PakhtoVoice-Orpheus-60H

PakhtoVoice-Orpheus-60H is an experimental Orpheus-based speech generation model fine-tuned for Peshawari / Pakistani Pashto on approximately 60 hours of proprietary synthetic Pashto speech data.

This model is functional and generally understandable, but it is still an early-stage checkpoint. Some pronunciations may be inaccurate, some outputs may sound robotic, and naturalness is still limited in places.

Overview

  • Base model: canopylabs/3b-hi-pretrain-research_release
  • Language: Pashto
  • Target dialect: Peshawari / Pakistani Pashto
  • Training data: ~60 hours
  • Data type: proprietary synthetic speech data
  • Training method: LoRA fine-tuning
  • Epochs: 4
  • Final training loss: ~3.89
  • Audio codec: SNAC 24 kHz
  • Sample rate: 24 kHz

Model status

This is an experimental baseline.

A fair summary is:

It behaves like a speaker who has learned to speak Pashto fairly recently but understandable overall, but still prone to pronunciation mistakes.

Intended use

This model is suitable for:

  • Pashto speech generation experiments
  • research and prototyping
  • exploring Orpheus-style speech token generation for Pashto
  • testing synthetic-data speech fine-tuning pipelines

This model is not recommended yet for production-critical deployment without human review.

Limitations

Known limitations include:

  • pronunciation mistakes on some words
  • occasional robotic or synthetic sounding outputs
  • imperfect naturalness and prosody
  • uneven quality depending on prompt content
  • likely sensitivity to spelling and text normalization
  • performance centered on Peshawari / Pakistani Pashto
  • possible inheritance of artifacts from synthetic training data

Because the training data is synthetic, the model may reflect unnatural pronunciations or speaking styles present in the source data.

Dataset

This model was trained on approximately 60 hours of proprietary synthetic Pashto speech data.

The dataset is not included with this release.

Training details

Base model

  • canopylabs/3b-hi-pretrain-research_release

Training setup

  • GPU: RTX 3090 24GB
  • Epochs: 4
  • Final training loss: ~3.89
  • Learning rate: 5e-5
  • Warmup ratio: 0.03
  • Weight decay: 0.01
  • Per-device batch size: 1
  • Gradient accumulation steps: 16
  • Max sequence length: 1536
  • Save steps: 500
  • Logging steps: 10
  • Precision: bf16=True, fp16=False

LoRA configuration

  • r: 32
  • alpha: 64
  • dropout: 0.0

Target modules

  • q_proj
  • k_proj
  • v_proj
  • o_proj
  • gate_proj
  • down_proj
  • up_proj

Modules saved

  • lm_head
  • embed_tokens

Token / audio format

This model uses:

  • text tokenization from the base model tokenizer
  • reserved vocabulary space for SNAC-style audio token IDs
  • SNAC decoding at 24 kHz

At inference time, the pipeline is:

  1. tokenize input text
  2. generate discrete audio token IDs
  3. convert generated IDs back into SNAC codebooks
  4. decode waveform audio using the original SNAC 24 kHz model

Example config.yaml

# ----------------------------
# SNAC
# ----------------------------
snac_model_name: "hubertsiuzdak/snac_24khz"
target_sample_rate: 24000

# ----------------------------
# Tokenizer / model
# ----------------------------
tokenizer_name: "canopylabs/3b-hi-pretrain-research_release"
output_dir: "E:\\ORPHEUS MODEL"

# ----------------------------
# Precision / resize
# ----------------------------
bf16: true
force_resize_embeddings_if_needed: true

# ----------------------------
# Special token IDs
# ----------------------------
start_of_text: 128000
end_of_text: 128009

start_of_speech: 128257
end_of_speech: 128258

start_of_human: 128259
end_of_human: 128260

start_of_ai: 128261
end_of_ai: 128262
pad_token: 128263

audio_tokens_start: 128266


## Inference

### Expected folder layout

```text
your_model_folder/
├─ README.md
├─ config.yaml
├─ infer.py
└─ merged/
   ├─ config.json
   ├─ model.safetensors / pytorch_model.bin
   ├─ tokenizer files
   └─ generation config files if present

How to run

  1. Put your merged model inside output_dir/merged
  2. Save the inference script below as infer.py
  3. Save a matching config.yaml
  4. Run:
python infer.py

Example prompt

دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي

Example inference script

import os
import re
import yaml
import torch
import numpy as np
import soundfile as sf

from snac import SNAC
from transformers import AutoModelForCausalLM, AutoTokenizer


def load_config(path="config.yaml"):
    with open(path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)


def load_snac_model(cfg):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    print(f"Loading SNAC decoder: {cfg['snac_model_name']}")
    snac_model = SNAC.from_pretrained(cfg["snac_model_name"]).to(device)
    snac_model.eval()
    return snac_model


def maybe_fix_tokenizer_and_resize(model, tokenizer, cfg):
    """
    Keeps tokenizer/model embedding size compatible with reserved audio token IDs.
    """
    pad_token = cfg["pad_token"]

    max_audio_token_id = cfg["audio_tokens_start"] + 7 * 4096 - 1
    special_ids = [
        cfg["start_of_text"],
        cfg["end_of_text"],
        cfg["start_of_speech"],
        cfg["end_of_speech"],
        cfg["start_of_human"],
        cfg["end_of_human"],
        cfg["start_of_ai"],
        cfg["end_of_ai"],
        cfg["pad_token"],
    ]
    max_token_id_needed = max(max_audio_token_id, max(special_ids))

    tokenizer_len = len(tokenizer)
    embedding_rows = model.get_input_embeddings().num_embeddings

    print(f"Tokenizer length before fix: {tokenizer_len}")
    print(f"Embedding rows before fix: {embedding_rows}")
    print(f"Max token id needed: {max_token_id_needed}")

    if embedding_rows <= max_token_id_needed:
        required_vocab_size = max_token_id_needed + 1

        if tokenizer_len < required_vocab_size:
            extra = required_vocab_size - tokenizer_len
            print(f"Tokenizer too small. Adding {extra} placeholder tokens...")
            tokenizer.add_tokens([f"<orpheus_extra_{i}>" for i in range(extra)])

        print(f"Resizing embeddings to {len(tokenizer)}")
        model.resize_token_embeddings(len(tokenizer))

        print(f"Tokenizer length after fix: {len(tokenizer)}")
        print(f"Embedding rows after fix: {model.get_input_embeddings().num_embeddings}")
    else:
        print("No resize needed; model embeddings already cover required token ids.")

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model.config.pad_token_id = pad_token


def load_merged_model(cfg, tokenizer):
    merged_dir = os.path.join(cfg["output_dir"], "merged")
    dtype = torch.bfloat16 if cfg.get("bf16", True) else torch.float16

    if not os.path.isdir(merged_dir):
        raise FileNotFoundError(
            f"Merged model folder not found: {merged_dir}\n"
            f"Run training first so it saves:\n"
            f"  {merged_dir}"
        )

    try:
        model = AutoModelForCausalLM.from_pretrained(
            merged_dir,
            torch_dtype=dtype,
            attn_implementation="flash_attention_2",
            device_map="auto",
        )
    except Exception:
        print("flash_attention_2 not available; falling back to sdpa")
        model = AutoModelForCausalLM.from_pretrained(
            merged_dir,
            torch_dtype=dtype,
            attn_implementation="sdpa",
            device_map="auto",
        )

    if cfg.get("force_resize_embeddings_if_needed", True):
        maybe_fix_tokenizer_and_resize(model, tokenizer, cfg)
    else:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = cfg["pad_token"]

    model.eval()
    return model


def build_prompt_input_ids(text, tokenizer, cfg):
    """
    Must match training prompt format.
    """
    start_of_human = cfg["start_of_human"]
    end_of_human = cfg["end_of_human"]
    start_of_ai = cfg["start_of_ai"]
    start_of_speech = cfg["start_of_speech"]
    end_of_text = cfg["end_of_text"]

    text_ids = tokenizer.encode(text, add_special_tokens=True)

    input_ids = (
        [start_of_human]
        + text_ids
        + [end_of_text]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
    )
    return input_ids


def extract_audio_tokens(generated_ids, cfg):
    """
    Extracts generated speech tokens and maps them back to SNAC code indices.
    """
    start_of_speech = cfg["start_of_speech"]
    end_of_speech = cfg["end_of_speech"]
    audio_tokens_start = cfg["audio_tokens_start"]

    try:
        speech_start_idx = generated_ids.index(start_of_speech) + 1
    except ValueError:
        raise ValueError("start_of_speech token not found in generated sequence.")

    speech_tokens = []
    for tok in generated_ids[speech_start_idx:]:
        if tok == end_of_speech:
            break
        if tok >= audio_tokens_start:
            speech_tokens.append(tok)

    if not speech_tokens:
        raise ValueError("No speech/audio tokens were generated.")

    snac_codes = []
    for i, tok in enumerate(speech_tokens):
        band = i % 7
        code = tok - audio_tokens_start - (band * 4096)
        snac_codes.append(code)

    usable = (len(snac_codes) // 7) * 7
    snac_codes = snac_codes[:usable]

    if len(snac_codes) < 7:
        raise ValueError("Too few usable SNAC codes after cleanup.")

    cleaned = []
    for c in snac_codes:
        if c < 0:
            c = 0
        elif c > 4095:
            c = 4095
        cleaned.append(c)

    return cleaned


def snac_codes_to_audio(snac_codes, snac_model):
    """
    Reverse the 7-code interleaving:
      frame = [c0, c1a, c2a, c2b, c1b, c2c, c2d]
    """
    device = next(snac_model.parameters()).device

    if len(snac_codes) % 7 != 0:
        raise ValueError("snac_codes length must be divisible by 7.")

    n_frames = len(snac_codes) // 7

    codes_0 = []
    codes_1 = []
    codes_2 = []

    for j in range(n_frames):
        i = 7 * j
        codes_0.append(snac_codes[i + 0])

        codes_1.append(snac_codes[i + 1])
        codes_1.append(snac_codes[i + 4])

        codes_2.append(snac_codes[i + 2])
        codes_2.append(snac_codes[i + 3])
        codes_2.append(snac_codes[i + 5])
        codes_2.append(snac_codes[i + 6])

    codes = [
        torch.tensor(codes_0, dtype=torch.int32, device=device).unsqueeze(0),
        torch.tensor(codes_1, dtype=torch.int32, device=device).unsqueeze(0),
        torch.tensor(codes_2, dtype=torch.int32, device=device).unsqueeze(0),
    ]

    with torch.inference_mode():
        audio_hat = snac_model.decode(codes)

    audio = audio_hat.squeeze().detach().float().cpu().numpy()
    audio = np.clip(audio, -1.0, 1.0)
    return audio


def generate_one(
    model,
    tokenizer,
    snac_model,
    text,
    cfg,
    output_wav_path="orpheus_test.wav",
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    max_new_tokens=2560,
):
    input_ids = build_prompt_input_ids(text, tokenizer, cfg)

    device = next(model.parameters()).device
    input_ids = torch.tensor([input_ids], dtype=torch.long, device=device)
    attention_mask = torch.ones_like(input_ids)

    end_of_speech = cfg["end_of_speech"]
    pad_token = cfg["pad_token"]

    with torch.inference_mode():
        output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            eos_token_id=end_of_speech,
            pad_token_id=pad_token,
        )

    generated_ids = output[0].detach().cpu().tolist()

    snac_codes = extract_audio_tokens(generated_ids, cfg)
    print(f"Generated {len(snac_codes)} SNAC codes")

    audio = snac_codes_to_audio(snac_codes, snac_model)

    sr = int(cfg["target_sample_rate"])
    sf.write(output_wav_path, audio, sr)
    print(f"Saved audio to: {output_wav_path}")

    return {
        "generated_ids": generated_ids,
        "snac_codes": snac_codes,
        "wav_path": output_wav_path,
        "sample_rate": sr,
    }


def make_safe_filename(text, max_len=50):
    text = re.sub(r"\s+", "_", text.strip())
    text = re.sub(r'[\\/*?:"<>|]', "", text)
    text = text[:max_len].strip("_")
    return text if text else "sample"


def main():
    cfg = load_config()

    test_texts = [
        "دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي"
    ]

    output_dir = "pashto_test_outputs"
    os.makedirs(output_dir, exist_ok=True)

    tokenizer = AutoTokenizer.from_pretrained(cfg["tokenizer_name"])

    print(f"Loading merged model from: {os.path.join(cfg['output_dir'], 'merged')}")
    model = load_merged_model(cfg, tokenizer)

    snac_model = load_snac_model(cfg)

    results = []

    for idx, test_text in enumerate(test_texts, start=1):
        safe_name = make_safe_filename(test_text)
        output_wav_path = os.path.join(output_dir, f"{idx:02d}_{safe_name}.wav")

        print(f"\n[{idx}/{len(test_texts)}] Generating for: {test_text}")

        result = generate_one(
            model=model,
            tokenizer=tokenizer,
            snac_model=snac_model,
            text=test_text,
            cfg=cfg,
            output_wav_path=output_wav_path,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.1,
            max_new_tokens=2560,
        )

        results.append(result)

    print("\nDone.")
    print(f"All WAVs saved in: {output_dir}")


if __name__ == "__main__":
    main()

Recommended generation settings

Reasonable defaults:

  • temperature = 0.7
  • top_p = 0.9
  • repetition_penalty = 1.1
  • max_new_tokens = 2560

Practical tips

  • shorter prompts often behave more reliably than very long prompts

  • punctuation and cleaner Pashto spelling can improve output quality

  • generation settings can noticeably affect output quality

  • if outputs terminate too early or become unstable, adjust:

    • temperature
    • top_p
    • repetition_penalty
    • max_new_tokens

Disclaimer

This is an experimental Peshawari / Pakistani Pashto speech generation checkpoint trained on proprietary synthetic data. It is functional, but pronunciation and naturalness remain imperfect.

Citation / reference

If you reference this model, please mention:

  • it is an Orpheus-based speech generation fine-tune
  • it targets Peshawari / Pakistani Pashto
  • it was trained on ~60 hours of proprietary synthetic speech
  • training ran for 4 epochs
  • final training loss was approximately 3.89

Downloads last month
47
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for TBOGamer22/PakhtoVoice-Orpheus-60H