PakhtoVoice-Orpheus-60H
PakhtoVoice-Orpheus-60H is an experimental Orpheus-based speech generation model fine-tuned for Peshawari / Pakistani Pashto on approximately 60 hours of proprietary synthetic Pashto speech data.
This model is functional and generally understandable, but it is still an early-stage checkpoint. Some pronunciations may be inaccurate, some outputs may sound robotic, and naturalness is still limited in places.
Overview
- Base model:
canopylabs/3b-hi-pretrain-research_release - Language: Pashto
- Target dialect: Peshawari / Pakistani Pashto
- Training data: ~60 hours
- Data type: proprietary synthetic speech data
- Training method: LoRA fine-tuning
- Epochs: 4
- Final training loss: ~3.89
- Audio codec: SNAC 24 kHz
- Sample rate: 24 kHz
Model status
This is an experimental baseline.
A fair summary is:
It behaves like a speaker who has learned to speak Pashto fairly recently but understandable overall, but still prone to pronunciation mistakes.
Intended use
This model is suitable for:
- Pashto speech generation experiments
- research and prototyping
- exploring Orpheus-style speech token generation for Pashto
- testing synthetic-data speech fine-tuning pipelines
This model is not recommended yet for production-critical deployment without human review.
Limitations
Known limitations include:
- pronunciation mistakes on some words
- occasional robotic or synthetic sounding outputs
- imperfect naturalness and prosody
- uneven quality depending on prompt content
- likely sensitivity to spelling and text normalization
- performance centered on Peshawari / Pakistani Pashto
- possible inheritance of artifacts from synthetic training data
Because the training data is synthetic, the model may reflect unnatural pronunciations or speaking styles present in the source data.
Dataset
This model was trained on approximately 60 hours of proprietary synthetic Pashto speech data.
The dataset is not included with this release.
Training details
Base model
canopylabs/3b-hi-pretrain-research_release
Training setup
- GPU:
RTX 3090 24GB - Epochs:
4 - Final training loss:
~3.89 - Learning rate:
5e-5 - Warmup ratio:
0.03 - Weight decay:
0.01 - Per-device batch size:
1 - Gradient accumulation steps:
16 - Max sequence length:
1536 - Save steps:
500 - Logging steps:
10 - Precision:
bf16=True,fp16=False
LoRA configuration
- r:
32 - alpha:
64 - dropout:
0.0
Target modules
q_projk_projv_projo_projgate_projdown_projup_proj
Modules saved
lm_headembed_tokens
Token / audio format
This model uses:
- text tokenization from the base model tokenizer
- reserved vocabulary space for SNAC-style audio token IDs
- SNAC decoding at 24 kHz
At inference time, the pipeline is:
- tokenize input text
- generate discrete audio token IDs
- convert generated IDs back into SNAC codebooks
- decode waveform audio using the original SNAC 24 kHz model
Example config.yaml
# ----------------------------
# SNAC
# ----------------------------
snac_model_name: "hubertsiuzdak/snac_24khz"
target_sample_rate: 24000
# ----------------------------
# Tokenizer / model
# ----------------------------
tokenizer_name: "canopylabs/3b-hi-pretrain-research_release"
output_dir: "E:\\ORPHEUS MODEL"
# ----------------------------
# Precision / resize
# ----------------------------
bf16: true
force_resize_embeddings_if_needed: true
# ----------------------------
# Special token IDs
# ----------------------------
start_of_text: 128000
end_of_text: 128009
start_of_speech: 128257
end_of_speech: 128258
start_of_human: 128259
end_of_human: 128260
start_of_ai: 128261
end_of_ai: 128262
pad_token: 128263
audio_tokens_start: 128266
## Inference
### Expected folder layout
```text
your_model_folder/
├─ README.md
├─ config.yaml
├─ infer.py
└─ merged/
├─ config.json
├─ model.safetensors / pytorch_model.bin
├─ tokenizer files
└─ generation config files if present
How to run
- Put your merged model inside
output_dir/merged - Save the inference script below as
infer.py - Save a matching
config.yaml - Run:
python infer.py
Example prompt
دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي
Example inference script
import os
import re
import yaml
import torch
import numpy as np
import soundfile as sf
from snac import SNAC
from transformers import AutoModelForCausalLM, AutoTokenizer
def load_config(path="config.yaml"):
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f)
def load_snac_model(cfg):
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading SNAC decoder: {cfg['snac_model_name']}")
snac_model = SNAC.from_pretrained(cfg["snac_model_name"]).to(device)
snac_model.eval()
return snac_model
def maybe_fix_tokenizer_and_resize(model, tokenizer, cfg):
"""
Keeps tokenizer/model embedding size compatible with reserved audio token IDs.
"""
pad_token = cfg["pad_token"]
max_audio_token_id = cfg["audio_tokens_start"] + 7 * 4096 - 1
special_ids = [
cfg["start_of_text"],
cfg["end_of_text"],
cfg["start_of_speech"],
cfg["end_of_speech"],
cfg["start_of_human"],
cfg["end_of_human"],
cfg["start_of_ai"],
cfg["end_of_ai"],
cfg["pad_token"],
]
max_token_id_needed = max(max_audio_token_id, max(special_ids))
tokenizer_len = len(tokenizer)
embedding_rows = model.get_input_embeddings().num_embeddings
print(f"Tokenizer length before fix: {tokenizer_len}")
print(f"Embedding rows before fix: {embedding_rows}")
print(f"Max token id needed: {max_token_id_needed}")
if embedding_rows <= max_token_id_needed:
required_vocab_size = max_token_id_needed + 1
if tokenizer_len < required_vocab_size:
extra = required_vocab_size - tokenizer_len
print(f"Tokenizer too small. Adding {extra} placeholder tokens...")
tokenizer.add_tokens([f"<orpheus_extra_{i}>" for i in range(extra)])
print(f"Resizing embeddings to {len(tokenizer)}")
model.resize_token_embeddings(len(tokenizer))
print(f"Tokenizer length after fix: {len(tokenizer)}")
print(f"Embedding rows after fix: {model.get_input_embeddings().num_embeddings}")
else:
print("No resize needed; model embeddings already cover required token ids.")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = pad_token
def load_merged_model(cfg, tokenizer):
merged_dir = os.path.join(cfg["output_dir"], "merged")
dtype = torch.bfloat16 if cfg.get("bf16", True) else torch.float16
if not os.path.isdir(merged_dir):
raise FileNotFoundError(
f"Merged model folder not found: {merged_dir}\n"
f"Run training first so it saves:\n"
f" {merged_dir}"
)
try:
model = AutoModelForCausalLM.from_pretrained(
merged_dir,
torch_dtype=dtype,
attn_implementation="flash_attention_2",
device_map="auto",
)
except Exception:
print("flash_attention_2 not available; falling back to sdpa")
model = AutoModelForCausalLM.from_pretrained(
merged_dir,
torch_dtype=dtype,
attn_implementation="sdpa",
device_map="auto",
)
if cfg.get("force_resize_embeddings_if_needed", True):
maybe_fix_tokenizer_and_resize(model, tokenizer, cfg)
else:
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = cfg["pad_token"]
model.eval()
return model
def build_prompt_input_ids(text, tokenizer, cfg):
"""
Must match training prompt format.
"""
start_of_human = cfg["start_of_human"]
end_of_human = cfg["end_of_human"]
start_of_ai = cfg["start_of_ai"]
start_of_speech = cfg["start_of_speech"]
end_of_text = cfg["end_of_text"]
text_ids = tokenizer.encode(text, add_special_tokens=True)
input_ids = (
[start_of_human]
+ text_ids
+ [end_of_text]
+ [end_of_human]
+ [start_of_ai]
+ [start_of_speech]
)
return input_ids
def extract_audio_tokens(generated_ids, cfg):
"""
Extracts generated speech tokens and maps them back to SNAC code indices.
"""
start_of_speech = cfg["start_of_speech"]
end_of_speech = cfg["end_of_speech"]
audio_tokens_start = cfg["audio_tokens_start"]
try:
speech_start_idx = generated_ids.index(start_of_speech) + 1
except ValueError:
raise ValueError("start_of_speech token not found in generated sequence.")
speech_tokens = []
for tok in generated_ids[speech_start_idx:]:
if tok == end_of_speech:
break
if tok >= audio_tokens_start:
speech_tokens.append(tok)
if not speech_tokens:
raise ValueError("No speech/audio tokens were generated.")
snac_codes = []
for i, tok in enumerate(speech_tokens):
band = i % 7
code = tok - audio_tokens_start - (band * 4096)
snac_codes.append(code)
usable = (len(snac_codes) // 7) * 7
snac_codes = snac_codes[:usable]
if len(snac_codes) < 7:
raise ValueError("Too few usable SNAC codes after cleanup.")
cleaned = []
for c in snac_codes:
if c < 0:
c = 0
elif c > 4095:
c = 4095
cleaned.append(c)
return cleaned
def snac_codes_to_audio(snac_codes, snac_model):
"""
Reverse the 7-code interleaving:
frame = [c0, c1a, c2a, c2b, c1b, c2c, c2d]
"""
device = next(snac_model.parameters()).device
if len(snac_codes) % 7 != 0:
raise ValueError("snac_codes length must be divisible by 7.")
n_frames = len(snac_codes) // 7
codes_0 = []
codes_1 = []
codes_2 = []
for j in range(n_frames):
i = 7 * j
codes_0.append(snac_codes[i + 0])
codes_1.append(snac_codes[i + 1])
codes_1.append(snac_codes[i + 4])
codes_2.append(snac_codes[i + 2])
codes_2.append(snac_codes[i + 3])
codes_2.append(snac_codes[i + 5])
codes_2.append(snac_codes[i + 6])
codes = [
torch.tensor(codes_0, dtype=torch.int32, device=device).unsqueeze(0),
torch.tensor(codes_1, dtype=torch.int32, device=device).unsqueeze(0),
torch.tensor(codes_2, dtype=torch.int32, device=device).unsqueeze(0),
]
with torch.inference_mode():
audio_hat = snac_model.decode(codes)
audio = audio_hat.squeeze().detach().float().cpu().numpy()
audio = np.clip(audio, -1.0, 1.0)
return audio
def generate_one(
model,
tokenizer,
snac_model,
text,
cfg,
output_wav_path="orpheus_test.wav",
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
max_new_tokens=2560,
):
input_ids = build_prompt_input_ids(text, tokenizer, cfg)
device = next(model.parameters()).device
input_ids = torch.tensor([input_ids], dtype=torch.long, device=device)
attention_mask = torch.ones_like(input_ids)
end_of_speech = cfg["end_of_speech"]
pad_token = cfg["pad_token"]
with torch.inference_mode():
output = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
eos_token_id=end_of_speech,
pad_token_id=pad_token,
)
generated_ids = output[0].detach().cpu().tolist()
snac_codes = extract_audio_tokens(generated_ids, cfg)
print(f"Generated {len(snac_codes)} SNAC codes")
audio = snac_codes_to_audio(snac_codes, snac_model)
sr = int(cfg["target_sample_rate"])
sf.write(output_wav_path, audio, sr)
print(f"Saved audio to: {output_wav_path}")
return {
"generated_ids": generated_ids,
"snac_codes": snac_codes,
"wav_path": output_wav_path,
"sample_rate": sr,
}
def make_safe_filename(text, max_len=50):
text = re.sub(r"\s+", "_", text.strip())
text = re.sub(r'[\\/*?:"<>|]', "", text)
text = text[:max_len].strip("_")
return text if text else "sample"
def main():
cfg = load_config()
test_texts = [
"دا یو ازمایښتي جمله ده چې د پښورۍ پښتو لپاره د وینا جوړولو ازموینه پرې وشي"
]
output_dir = "pashto_test_outputs"
os.makedirs(output_dir, exist_ok=True)
tokenizer = AutoTokenizer.from_pretrained(cfg["tokenizer_name"])
print(f"Loading merged model from: {os.path.join(cfg['output_dir'], 'merged')}")
model = load_merged_model(cfg, tokenizer)
snac_model = load_snac_model(cfg)
results = []
for idx, test_text in enumerate(test_texts, start=1):
safe_name = make_safe_filename(test_text)
output_wav_path = os.path.join(output_dir, f"{idx:02d}_{safe_name}.wav")
print(f"\n[{idx}/{len(test_texts)}] Generating for: {test_text}")
result = generate_one(
model=model,
tokenizer=tokenizer,
snac_model=snac_model,
text=test_text,
cfg=cfg,
output_wav_path=output_wav_path,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
max_new_tokens=2560,
)
results.append(result)
print("\nDone.")
print(f"All WAVs saved in: {output_dir}")
if __name__ == "__main__":
main()
Recommended generation settings
Reasonable defaults:
temperature = 0.7top_p = 0.9repetition_penalty = 1.1max_new_tokens = 2560
Practical tips
shorter prompts often behave more reliably than very long prompts
punctuation and cleaner Pashto spelling can improve output quality
generation settings can noticeably affect output quality
if outputs terminate too early or become unstable, adjust:
temperaturetop_prepetition_penaltymax_new_tokens
Disclaimer
This is an experimental Peshawari / Pakistani Pashto speech generation checkpoint trained on proprietary synthetic data. It is functional, but pronunciation and naturalness remain imperfect.
Citation / reference
If you reference this model, please mention:
- it is an Orpheus-based speech generation fine-tune
- it targets Peshawari / Pakistani Pashto
- it was trained on ~60 hours of proprietary synthetic speech
- training ran for 4 epochs
- final training loss was approximately 3.89
- Downloads last month
- 47