PakhtoVoice-Orpheus-60H

PakhtoVoice-Orpheus-60H is an experimental Orpheus-based speech generation model fine-tuned for Peshawari / Pakistani Pashto on approximately 60 hours of proprietary synthetic Pashto speech data.

This model is functional and generally understandable, but it is still an early-stage checkpoint. Some pronunciations may be inaccurate, some outputs may sound robotic, and naturalness is still limited in places.

Overview

Base model: canopylabs/3b-hi-pretrain-research_release
Language: Pashto
Target dialect: Peshawari / Pakistani Pashto
Training data: ~60 hours
Data type: proprietary synthetic speech data
Training method: LoRA fine-tuning
Epochs: 4
Final training loss: ~3.89
Audio codec: SNAC 24 kHz
Sample rate: 24 kHz

Model status

This is an experimental baseline.

A fair summary is:

It behaves like a speaker who has learned to speak Pashto fairly recently but understandable overall, but still prone to pronunciation mistakes.

Intended use

This model is suitable for:

Pashto speech generation experiments
research and prototyping
exploring Orpheus-style speech token generation for Pashto
testing synthetic-data speech fine-tuning pipelines

This model is not recommended yet for production-critical deployment without human review.

Limitations

Known limitations include:

pronunciation mistakes on some words
occasional robotic or synthetic sounding outputs
imperfect naturalness and prosody
uneven quality depending on prompt content
likely sensitivity to spelling and text normalization
performance centered on Peshawari / Pakistani Pashto
possible inheritance of artifacts from synthetic training data

Because the training data is synthetic, the model may reflect unnatural pronunciations or speaking styles present in the source data.

Dataset

This model was trained on approximately 60 hours of proprietary synthetic Pashto speech data.

The dataset is not included with this release.

Training details

Base model

canopylabs/3b-hi-pretrain-research_release

Training setup

GPU: RTX 3090 24GB
Epochs: 4
Final training loss: ~3.89
Learning rate: 5e-5
Warmup ratio: 0.03
Weight decay: 0.01
Per-device batch size: 1
Gradient accumulation steps: 16
Max sequence length: 1536
Save steps: 500
Logging steps: 10
Precision: bf16=True, fp16=False

LoRA configuration

r: 32
alpha: 64
dropout: 0.0

Target modules

q_proj
k_proj
v_proj
o_proj
gate_proj
down_proj
up_proj

Modules saved

lm_head
embed_tokens

Token / audio format

This model uses:

text tokenization from the base model tokenizer
reserved vocabulary space for SNAC-style audio token IDs
SNAC decoding at 24 kHz

At inference time, the pipeline is:

tokenize input text
generate discrete audio token IDs
convert generated IDs back into SNAC codebooks
decode waveform audio using the original SNAC 24 kHz model

Example `config.yaml`

# ----------------------------
# SNAC
# ----------------------------
snac_model_name: "hubertsiuzdak/snac_24khz"
target_sample_rate: 24000

# ----------------------------
# Tokenizer / model
# ----------------------------
tokenizer_name: "canopylabs/3b-hi-pretrain-research_release"
output_dir: "E:\\ORPHEUS MODEL"

# ----------------------------
# Precision / resize
# ----------------------------
bf16: true
force_resize_embeddings_if_needed: true

# ----------------------------
# Special token IDs
# ----------------------------
start_of_text: 128000
end_of_text: 128009

start_of_speech: 128257
end_of_speech: 128258

start_of_human: 128259
end_of_human: 128260

start_of_ai: 128261
end_of_ai: 128262
pad_token: 128263

audio_tokens_start: 128266


## Inference

### Expected folder layout

```text
your_model_folder/
├─ README.md
├─ config.yaml
├─ infer.py
└─ merged/
   ├─ config.json
   ├─ model.safetensors / pytorch_model.bin
   ├─ tokenizer files
   └─ generation config files if present

How to run

Put your merged model inside output_dir/merged
Save the inference script below as infer.py
Save a matching config.yaml
Run:

python infer.py

Example prompt

دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي

Example inference script

import os
import re
import yaml
import torch
import numpy as np
import soundfile as sf

from snac import SNAC
from transformers import AutoModelForCausalLM, AutoTokenizer


def load_config(path="config.yaml"):
    with open(path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)


def load_snac_model(cfg):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    print(f"Loading SNAC decoder: {cfg['snac_model_name']}")
    snac_model = SNAC.from_pretrained(cfg["snac_model_name"]).to(device)
    snac_model.eval()
    return snac_model


def maybe_fix_tokenizer_and_resize(model, tokenizer, cfg):
    """
    Keeps tokenizer/model embedding size compatible with reserved audio token IDs.
    """
    pad_token = cfg["pad_token"]

    max_audio_token_id = cfg["audio_tokens_start"] + 7 * 4096 - 1
    special_ids = [
        cfg["start_of_text"],
        cfg["end_of_text"],
        cfg["start_of_speech"],
        cfg["end_of_speech"],
        cfg["start_of_human"],
        cfg["end_of_human"],
        cfg["start_of_ai"],
        cfg["end_of_ai"],
        cfg["pad_token"],
    ]
    max_token_id_needed = max(max_audio_token_id, max(special_ids))

    tokenizer_len = len(tokenizer)
    embedding_rows = model.get_input_embeddings().num_embeddings

    print(f"Tokenizer length before fix: {tokenizer_len}")
    print(f"Embedding rows before fix: {embedding_rows}")
    print(f"Max token id needed: {max_token_id_needed}")

    if embedding_rows <= max_token_id_needed:
        required_vocab_size = max_token_id_needed + 1

        if tokenizer_len < required_vocab_size:
            extra = required_vocab_size - tokenizer_len
            print(f"Tokenizer too small. Adding {extra} placeholder tokens...")
            tokenizer.add_tokens([f"<orpheus_extra_{i}>" for i in range(extra)])

        print(f"Resizing embeddings to {len(tokenizer)}")
        model.resize_token_embeddings(len(tokenizer))

        print(f"Tokenizer length after fix: {len(tokenizer)}")
        print(f"Embedding rows after fix: {model.get_input_embeddings().num_embeddings}")
    else:
        print("No resize needed; model embeddings already cover required token ids.")

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model.config.pad_token_id = pad_token


def load_merged_model(cfg, tokenizer):
    merged_dir = os.path.join(cfg["output_dir"], "merged")
    dtype = torch.bfloat16 if cfg.get("bf16", True) else torch.float16

    if not os.path.isdir(merged_dir):
        raise FileNotFoundError(
            f"Merged model folder not found: {merged_dir}\n"
            f"Run training first so it saves:\n"
            f"  {merged_dir}"
        )

    try:
        model = AutoModelForCausalLM.from_pretrained(
            merged_dir,
            torch_dtype=dtype,
            attn_implementation="flash_attention_2",
            device_map="auto",
        )
    except Exception:
        print("flash_attention_2 not available; falling back to sdpa")
        model = AutoModelForCausalLM.from_pretrained(
            merged_dir,
            torch_dtype=dtype,
            attn_implementation="sdpa",
            device_map="auto",
        )

    if cfg.get("force_resize_embeddings_if_needed", True):
        maybe_fix_tokenizer_and_resize(model, tokenizer, cfg)
    else:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = cfg["pad_token"]

    model.eval()
    return model


def build_prompt_input_ids(text, tokenizer, cfg):
    """
    Must match training prompt format.
    """
    start_of_human = cfg["start_of_human"]
    end_of_human = cfg["end_of_human"]
    start_of_ai = cfg["start_of_ai"]
    start_of_speech = cfg["start_of_speech"]
    end_of_text = cfg["end_of_text"]

    text_ids = tokenizer.encode(text, add_special_tokens=True)

    input_ids = (
        [start_of_human]
        + text_ids
        + [end_of_text]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
    )
    return input_ids


def extract_audio_tokens(generated_ids, cfg):
    """
    Extracts generated speech tokens and maps them back to SNAC code indices.
    """
    start_of_speech = cfg["start_of_speech"]
    end_of_speech = cfg["end_of_speech"]
    audio_tokens_start = cfg["audio_tokens_start"]

    try:
        speech_start_idx = generated_ids.index(start_of_speech) + 1
    except ValueError:
        raise ValueError("start_of_speech token not found in generated sequence.")

    speech_tokens = []
    for tok in generated_ids[speech_start_idx:]:
        if tok == end_of_speech:
            break
        if tok >= audio_tokens_start:
            speech_tokens.append(tok)

    if not speech_tokens:
        raise ValueError("No speech/audio tokens were generated.")

    snac_codes = []
    for i, tok in enumerate(speech_tokens):
        band = i % 7
        code = tok - audio_tokens_start - (band * 4096)
        snac_codes.append(code)

    usable = (len(snac_codes) // 7) * 7
    snac_codes = snac_codes[:usable]

    if len(snac_codes) < 7:
        raise ValueError("Too few usable SNAC codes after cleanup.")

    cleaned = []
    for c in snac_codes:
        if c < 0:
            c = 0
        elif c > 4095:
            c = 4095
        cleaned.append(c)

    return cleaned


def snac_codes_to_audio(snac_codes, snac_model):
    """
    Reverse the 7-code interleaving:
      frame = [c0, c1a, c2a, c2b, c1b, c2c, c2d]
    """
    device = next(snac_model.parameters()).device

    if len(snac_codes) % 7 != 0:
        raise ValueError("snac_codes length must be divisible by 7.")

    n_frames = len(snac_codes) // 7

    codes_0 = []
    codes_1 = []
    codes_2 = []

    for j in range(n_frames):
        i = 7 * j
        codes_0.append(snac_codes[i + 0])

        codes_1.append(snac_codes[i + 1])
        codes_1.append(snac_codes[i + 4])

        codes_2.append(snac_codes[i + 2])
        codes_2.append(snac_codes[i + 3])
        codes_2.append(snac_codes[i + 5])
        codes_2.append(snac_codes[i + 6])

    codes = [
        torch.tensor(codes_0, dtype=torch.int32, device=device).unsqueeze(0),
        torch.tensor(codes_1, dtype=torch.int32, device=device).unsqueeze(0),
        torch.tensor(codes_2, dtype=torch.int32, device=device).unsqueeze(0),
    ]

    with torch.inference_mode():
        audio_hat = snac_model.decode(codes)

    audio = audio_hat.squeeze().detach().float().cpu().numpy()
    audio = np.clip(audio, -1.0, 1.0)
    return audio


def generate_one(
    model,
    tokenizer,
    snac_model,
    text,
    cfg,
    output_wav_path="orpheus_test.wav",
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    max_new_tokens=2560,
):
    input_ids = build_prompt_input_ids(text, tokenizer, cfg)

    device = next(model.parameters()).device
    input_ids = torch.tensor([input_ids], dtype=torch.long, device=device)
    attention_mask = torch.ones_like(input_ids)

    end_of_speech = cfg["end_of_speech"]
    pad_token = cfg["pad_token"]

    with torch.inference_mode():
        output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            eos_token_id=end_of_speech,
            pad_token_id=pad_token,
        )

    generated_ids = output[0].detach().cpu().tolist()

    snac_codes = extract_audio_tokens(generated_ids, cfg)
    print(f"Generated {len(snac_codes)} SNAC codes")

    audio = snac_codes_to_audio(snac_codes, snac_model)

    sr = int(cfg["target_sample_rate"])
    sf.write(output_wav_path, audio, sr)
    print(f"Saved audio to: {output_wav_path}")

    return {
        "generated_ids": generated_ids,
        "snac_codes": snac_codes,
        "wav_path": output_wav_path,
        "sample_rate": sr,
    }


def make_safe_filename(text, max_len=50):
    text = re.sub(r"\s+", "_", text.strip())
    text = re.sub(r'[\\/*?:"<>|]', "", text)
    text = text[:max_len].strip("_")
    return text if text else "sample"


def main():
    cfg = load_config()

    test_texts = [
        "دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي"
    ]

    output_dir = "pashto_test_outputs"
    os.makedirs(output_dir, exist_ok=True)

    tokenizer = AutoTokenizer.from_pretrained(cfg["tokenizer_name"])

    print(f"Loading merged model from: {os.path.join(cfg['output_dir'], 'merged')}")
    model = load_merged_model(cfg, tokenizer)

    snac_model = load_snac_model(cfg)

    results = []

    for idx, test_text in enumerate(test_texts, start=1):
        safe_name = make_safe_filename(test_text)
        output_wav_path = os.path.join(output_dir, f"{idx:02d}_{safe_name}.wav")

        print(f"\n[{idx}/{len(test_texts)}] Generating for: {test_text}")

        result = generate_one(
            model=model,
            tokenizer=tokenizer,
            snac_model=snac_model,
            text=test_text,
            cfg=cfg,
            output_wav_path=output_wav_path,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            max_new_tokens=2560,
        )

        results.append(result)

    print("\nDone.")
    print(f"All WAVs saved in: {output_dir}")


if __name__ == "__main__":
    main()

Recommended generation settings

Reasonable defaults:

temperature = 0.7
top_p = 0.9
repetition_penalty = 1.1
max_new_tokens = 2560

Practical tips

shorter prompts often behave more reliably than very long prompts
punctuation and cleaner Pashto spelling can improve output quality
generation settings can noticeably affect output quality
if outputs terminate too early or become unstable, adjust:
- temperature
- top_p
- repetition_penalty
- max_new_tokens

Disclaimer

This is an experimental Peshawari / Pakistani Pashto speech generation checkpoint trained on proprietary synthetic data. It is functional, but pronunciation and naturalness remain imperfect.

Citation / reference

If you reference this model, please mention:

it is an Orpheus-based speech generation fine-tune
it targets Peshawari / Pakistani Pashto
it was trained on ~60 hours of proprietary synthetic speech
training ran for 4 epochs
final training loss was approximately 3.89

Downloads last month: 47

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for TBOGamer22/PakhtoVoice-Orpheus-60H

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

canopylabs/orpheus-3b-0.1-pretrained

Finetuned

canopylabs/3b-hi-pretrain-research_release

Finetuned

(32)

this model

Quantizations

1 model

PakhtoVoice-Orpheus-60H

Overview

Model status

Intended use

Limitations

Dataset

Training details

Base model

Training setup

LoRA configuration

Target modules

Modules saved

Token / audio format

Example config.yaml

How to run

Example prompt

Example inference script

Recommended generation settings

Practical tips

Disclaimer

Citation / reference

Model tree for TBOGamer22/PakhtoVoice-Orpheus-60H

Example `config.yaml`