How to use from the
Use from the
Transformers library
# Gated model: Login with a HF token with gated access permission
hf auth login
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-to-speech", model="Sunbird/orpheus-3b-tts-multilingual")
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Sunbird/orpheus-3b-tts-multilingual")
model = AutoModelForCausalLM.from_pretrained("Sunbird/orpheus-3b-tts-multilingual")
Quick Links

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Orpheus-3B Sunbird Multilingual TTS

A multilingual, multi-speaker text-to-speech model fine-tuned from unsloth/orpheus-3b-0.1-pretrained on the full Sunbird/tts corpus — 20 language configurations and every speaker present in the dataset.

The model accepts arbitrary text and emits 24 kHz mono speech via the SNAC audio codec. Voice selection happens at the prompt level: prepend the chosen speaker_id followed by ": " to your text, and the model produces audio in that speaker's voice.

Quick links

Languages covered

Config Language ISO 639-1 Region
ach Acholi Uganda, South Sudan
afr Afrikaans af South Africa, Namibia
eng English en (control language)
ewe Ewe ee Ghana, Togo
ful Fulah ff West Africa (Sahel)
hau Hausa ha Nigeria, Niger, Chad
ibo Igbo ig Nigeria
kik Kikuyu ki Kenya
kin Kinyarwanda rw Rwanda
lgg Lugbara Uganda, DRC
lin Lingala ln DRC, Republic of Congo
lug Luganda lg Uganda
luo Luo (Dholuo) Kenya, Tanzania
nyn Runyankole Uganda
sot Sesotho st Lesotho, South Africa
swa Swahili sw East Africa
teo Ateso Uganda, Kenya
tsn Setswana tn Botswana, South Africa
xho Xhosa xh South Africa
yor Yoruba yo Nigeria, Benin

Per-language quality scales with the amount of training data Sunbird collected for that language; some configs have many more speaker hours than others. Audition the test split for each language before relying on a particular speaker — see the discovery snippet below.

TL;DR

# After installing the dependencies (see "Inference" below)
wav = synthesize("Mwattu, oli otya?",  speaker_id="salt_lug_0001")  # Luganda
wav = synthesize("Habari yako rafiki.", speaker_id="salt_swa_0001")  # Swahili
wav = synthesize("Bawo ni, ọrẹ mi?",     speaker_id="salt_yor_0001")  # Yoruba

The model has no explicit "language" knob — the language identity travels via the speaker tag, since each salt_<lang>_<NNNN> voice was recorded in exactly one language.


Discovering speaker IDs

The exact speaker_ids in each config can be enumerated from the dataset:

from collections import defaultdict
from datasets import load_dataset, get_dataset_config_names

CONFIGS = get_dataset_config_names("Sunbird/tts")  # the 20 languages

speakers_by_lang = defaultdict(set)
for cfg in CONFIGS:
    ds = load_dataset("Sunbird/tts", cfg, split="train")
    for sid in ds["speaker_id"]:
        speakers_by_lang[cfg].add(sid)

for cfg, sids in sorted(speakers_by_lang.items()):
    print(f"{cfg}: {len(sids)} speaker(s) — {sorted(sids)[:3]}{'...' if len(sids) > 3 else ''}")

Speaker IDs follow the pattern salt_<lang>_<NNNN> (e.g., salt_lug_0001, salt_ach_0007). Pass any one of them as speaker_id to either inference function below.


Inference

The model wraps every prompt in a multi-speaker tagged format:

[SOH] + tokenize("<speaker_id>: <your text>") + [EOT, EOH]

and the model autoregressively emits Llama-3 special tokens followed by SNAC audio codes that decode to a 24 kHz waveform. Two reference implementations follow.

Option A — transformers + unsloth (single request)

Best for development, notebook-driven iteration, and small batch sizes.

Install:

pip install unsloth snac soundfile torchcodec "datasets>=3.4.1,<4.0.0"

Run:

import os
import numpy as np
import torch
import soundfile as sf
from unsloth import FastLanguageModel
from snac import SNAC

MODEL_ID = "sunbird/orpheus-3b-tts-multilingual"

# Special tokens — must match the training format
END_OF_TEXT     = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH   = 128258
START_OF_HUMAN  = 128259
END_OF_HUMAN    = 128260
PAD_TOKEN       = 128263
AUDIO_TOKEN_LO  = 128266
AUDIO_TOKEN_HI  = 128266 + 7 * 4096   # exclusive

# 1) Load the LM (LoRA already merged into 16-bit weights at training time)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = MODEL_ID,
    max_seq_length = 4096,
    dtype          = None,         # auto bf16 / fp16
    load_in_4bit   = False,        # set True to halve VRAM at slight quality cost
    token          = os.environ.get("HF_TOKEN"),
)
FastLanguageModel.for_inference(model)

# 2) Load SNAC decoder (CPU is fine — frees GPU for the LM)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cpu")


def _redistribute_codes(code_list: list[int]) -> torch.Tensor:
    layer_1, layer_2, layer_3 = [], [], []
    for i in range(len(code_list) // 7):
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i + 1] - 4096)
        layer_3.append(code_list[7*i + 2] - 2*4096)
        layer_3.append(code_list[7*i + 3] - 3*4096)
        layer_2.append(code_list[7*i + 4] - 4*4096)
        layer_3.append(code_list[7*i + 5] - 5*4096)
        layer_3.append(code_list[7*i + 6] - 6*4096)
    if not layer_1:
        return torch.zeros(1, 1, 12000)   # ~0.5s silence fallback
    clamp = lambda vals: [max(0, min(4095, v)) for v in vals]
    codes = [torch.tensor(clamp(layer_1)).unsqueeze(0),
             torch.tensor(clamp(layer_2)).unsqueeze(0),
             torch.tensor(clamp(layer_3)).unsqueeze(0)]
    return snac_model.decode(codes)


def synthesize(text: str, speaker_id: str,
               *, max_new_tokens: int = 1200,
               temperature: float = 0.6, top_p: float = 0.95,
               repetition_penalty: float = 1.1,
               seed: int | None = None) -> np.ndarray:
    """Synthesize speech for `text` in the voice of `speaker_id`.

    `speaker_id` must be one of the speakers seen during training,
    e.g. "salt_lug_0001" (Luganda) or "salt_swa_0003" (Swahili).
    """
    if seed is not None:
        torch.manual_seed(seed)

    tagged   = f"{speaker_id}: {text}"
    text_ids = tokenizer(tagged, return_tensors="pt").input_ids
    soh = torch.tensor([[START_OF_HUMAN]], dtype=torch.int64)
    end = torch.tensor([[END_OF_TEXT, END_OF_HUMAN]], dtype=torch.int64)
    input_ids      = torch.cat([soh, text_ids, end], dim=1).to("cuda")
    attention_mask = torch.ones_like(input_ids)

    generated = model.generate(
        input_ids = input_ids, attention_mask = attention_mask,
        max_new_tokens = max_new_tokens,
        do_sample = True,
        temperature = temperature, top_p = top_p,
        repetition_penalty = repetition_penalty,
        eos_token_id = END_OF_SPEECH, use_cache = True,
    )

    # Crop on last SOS, filter to audio token range, redistribute, decode
    sos_indices = (generated == START_OF_SPEECH).nonzero(as_tuple=True)
    cropped = generated[:, sos_indices[1][-1].item() + 1:] if len(sos_indices[1]) > 0 else generated
    row = cropped[0]
    audio_only = row[(row >= AUDIO_TOKEN_LO) & (row < AUDIO_TOKEN_HI)]
    n = (audio_only.size(0) // 7) * 7
    code_list = [t.item() - AUDIO_TOKEN_LO for t in audio_only[:n]]
    waveform  = _redistribute_codes(code_list)
    return waveform.detach().squeeze().to("cpu").numpy().astype(np.float32)


# 3) Use it — pick a speaker per language
wav = synthesize("Mwattu, Mukama yeebazibwe.", speaker_id="salt_lug_0001", seed=42)
sf.write("luganda.wav", wav, 24000)

wav = synthesize("Habari yako rafiki.", speaker_id="salt_swa_0001", seed=42)
sf.write("swahili.wav", wav, 24000)

Option B — vllm (high throughput, batched, deployment)

Best for serving traffic. PagedAttention + continuous batching gives roughly 5–10× faster single-request latency and 10–100× higher throughput on batched requests vs. the transformers path. Multi-speaker batching (different speaker_ids in one call) gets the full benefit.

Important: vLLM ships its own torch/transformers and conflicts with Unsloth's pinned versions. Use a fresh Python environment for vLLM serving — do not install on top of an Unsloth env.

Install:

pip install vllm snac soundfile torchcodec "datasets>=3.4.1,<4.0.0"

Run:

import os
import numpy as np
import torch
import soundfile as sf
from snac import SNAC
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

MODEL_ID = "sunbird/orpheus-3b-tts-multilingual"

END_OF_TEXT     = 128009
START_OF_SPEECH = 128257
END_OF_SPEECH   = 128258
START_OF_HUMAN  = 128259
END_OF_HUMAN    = 128260
AUDIO_TOKEN_LO  = 128266
AUDIO_TOKEN_HI  = 128266 + 7 * 4096

# 1) Load
llm = LLM(
    model = MODEL_ID,
    dtype = "bfloat16",
    max_model_len = 4096,
    gpu_memory_utilization = 0.85,
)
tokenizer  = AutoTokenizer.from_pretrained(MODEL_ID, token=os.environ.get("HF_TOKEN"))
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cpu")


def _build_prompt_token_ids(text: str, speaker_id: str) -> list[int]:
    tagged = f"{speaker_id}: {text}"
    text_ids = tokenizer.encode(tagged, add_special_tokens=True)
    return [START_OF_HUMAN] + text_ids + [END_OF_TEXT, END_OF_HUMAN]


def _codes_to_waveform(generated_token_ids: list[int]) -> np.ndarray:
    ids = torch.tensor(generated_token_ids, dtype=torch.int64)
    sos_pos = (ids == START_OF_SPEECH).nonzero(as_tuple=True)[0]
    if len(sos_pos) > 0:
        ids = ids[sos_pos[-1].item() + 1:]
    audio = ids[(ids >= AUDIO_TOKEN_LO) & (ids < AUDIO_TOKEN_HI)]
    n = (audio.size(0) // 7) * 7
    cl = [t.item() - AUDIO_TOKEN_LO for t in audio[:n]]
    l1, l2, l3 = [], [], []
    for i in range(len(cl) // 7):
        l1.append(cl[7*i])
        l2.append(cl[7*i+1] - 4096); l3.append(cl[7*i+2] - 2*4096)
        l3.append(cl[7*i+3] - 3*4096); l2.append(cl[7*i+4] - 4*4096)
        l3.append(cl[7*i+5] - 5*4096); l3.append(cl[7*i+6] - 6*4096)
    if not l1:
        return np.zeros(12000, dtype=np.float32)
    cb = lambda v: [max(0, min(4095, x)) for x in v]
    codes = [torch.tensor(cb(l1)).unsqueeze(0),
             torch.tensor(cb(l2)).unsqueeze(0),
             torch.tensor(cb(l3)).unsqueeze(0)]
    return snac_model.decode(codes).detach().squeeze().cpu().numpy().astype(np.float32)


def synthesize(text: str, speaker_id: str,
               *, max_tokens: int = 1200,
               temperature: float = 0.6, top_p: float = 0.95,
               repetition_penalty: float = 1.1,
               seed: int | None = None) -> np.ndarray:
    sp = SamplingParams(
        temperature = temperature, top_p = top_p,
        repetition_penalty = repetition_penalty,
        max_tokens = max_tokens,
        stop_token_ids = [END_OF_SPEECH],
        skip_special_tokens = False,
        seed = seed,
    )
    pids = _build_prompt_token_ids(text, speaker_id)
    out  = llm.generate([{"prompt_token_ids": pids}], sp)
    return _codes_to_waveform(list(out[0].outputs[0].token_ids))


def synthesize_batch(items: list[dict], **kwargs) -> list[np.ndarray]:
    """items: list of {"text": str, "speaker_id": str} — different speakers
    can be mixed in one batch."""
    sp = SamplingParams(
        temperature = kwargs.get("temperature", 0.6),
        top_p = kwargs.get("top_p", 0.95),
        repetition_penalty = kwargs.get("repetition_penalty", 1.1),
        max_tokens = kwargs.get("max_tokens", 1200),
        stop_token_ids = [END_OF_SPEECH],
        skip_special_tokens = False,
        seed = kwargs.get("seed"),
    )
    prompts = [{"prompt_token_ids": _build_prompt_token_ids(it["text"], it["speaker_id"])}
               for it in items]
    outputs = llm.generate(prompts, sp)
    return [_codes_to_waveform(list(o.outputs[0].token_ids)) for o in outputs]


# 2) Single — pick a speaker per language
wav = synthesize("Mwattu, oli otya?", speaker_id="salt_lug_0001", seed=42)
sf.write("luganda.wav", wav, 24000)

# 3) Batched — different languages and speakers in one GPU pass
items = [
    {"text": "Mwattu, oli otya?",        "speaker_id": "salt_lug_0001"},
    {"text": "Habari yako rafiki.",      "speaker_id": "salt_swa_0001"},
    {"text": "Bawo ni, ọrẹ mi?",          "speaker_id": "salt_yor_0001"},
    {"text": "Sannu, ina kwana?",        "speaker_id": "salt_hau_0001"},
    {"text": "Goeie môre, hoe gaan dit?", "speaker_id": "salt_afr_0001"},
]
wavs = synthesize_batch(items, seed=123)
for i, (it, w) in enumerate(zip(items, wavs)):
    sf.write(f"batch_{i:02d}_{it['speaker_id']}.wav", w, 24000)

Generation parameters

Param Default What it does
temperature 0.6 Lower = more deterministic, slightly flatter prosody.
top_p 0.95 Nucleus sampling. Don't drop below 0.9 — produces robotic audio.
repetition_penalty 1.1 Discourages stuck-on-one-frame artefacts. 1.0 disables it.
max_new_tokens / max_tokens 1200 ≈ 9–10 s of audio. Raise for longer utterances.
seed None Pass an int for reproducible output across runs.

Token format

The tokenizer is Llama-3's, with Orpheus's audio-codebook special tokens laid out above the standard text vocabulary:

Token ID Purpose
<|begin_of_text|> 128000 Llama-3 BOS (auto-prepended by tokenizer)
<|end_of_text|> 128009 end of human turn (text portion)
START_OF_SPEECH 128257 model emits this just before audio codes
END_OF_SPEECH 128258 model emits this when it finishes — used as eos_token_id / stop_token_ids
START_OF_HUMAN 128259 wrap the text prompt
END_OF_HUMAN 128260 wrap the text prompt
START_OF_AI 128261 model emits this to begin its response
END_OF_AI 128262 model emits this when fully done
PAD_TOKEN 128263 left-padding for batched generation
audio codebook 128266 + N·4096 SNAC codes, N ∈ {0..6} for 7-frame layout

Training prompt structure (and what the model expects at inference):

[SOH] + tokenize("salt_<lang>_<NNNN>: <text>") + [EOT] + [EOH]
↳ model autoregressively emits:
[SOA] + [SOS] + audio_codes... + [EOS] + [EOA]

To recover audio: find the last START_OF_SPEECH (128257) in the output, take everything after it, drop any token outside the audio codebook range, group into 7-token frames, undo the per-position offsets, and feed the three layers to SNAC.decode. Both inference snippets above implement this end-to-end.


Training details

Setting Value
Base model unsloth/orpheus-3b-0.1-pretrained (raw pretrained, not the -ft voice-actor variant)
Adapter LoRA r=64, α=64, dropout=0, bias=none
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Optimizer adamw_8bit, weight decay 0.001
LR schedule linear, lr=2e-4, warmup steps=5
Per-device batch size 1 (with gradient_accumulation_steps=4, effective batch = 4)
Epochs 3
max_seq_length 4096
save_total_limit 2
Precision bfloat16 weights, 16-bit LoRA
Seed 3407
Hardware single NVIDIA RTX 4090 (24 GB)
Gradient checkpointing Unsloth's optimised variant
Final save LoRA merged into 16-bit weights via save_pretrained_merged(save_method="merged_16bit")

The pretrained variant of Orpheus was chosen over the -ft voice-actor variant because that variant has a strong English-voice-actor prior that fights low-resource-language fine-tuning.

Data prep summary

  1. Load all 20 configs of Sunbird/tts (get_dataset_config_names) and concatenate_datasets their train and test splits into one training set and one held-out evaluation set. No speaker filter.
  2. Tag each row with source = example["speaker_id"] (per-row, not constant) — the model learns the multi-speaker prompt format f"{speaker_id}: {text}" across every speaker it sees.
  3. Cast audio to 24 kHz via Audio(sampling_rate=24000).
  4. Drop rows whose tokenised text alone exceeds max_seq_length — saves expensive SNAC encoding on rows that would be filtered out downstream.
  5. Encode each remaining audio clip with hubertsiuzdak/snac_24khz → 7 codes per frame, flattened with per-layer offsets (+128266, +4096, +2·4096, …).
  6. Filter out rows with empty/None codes; drop consecutive duplicate frames.
  7. Build input_ids = [SOH] + text_ids + [EOT] + [EOH] + [SOA] + [SOS] + audio_codes + [EOS] + [EOA].
  8. Drop rows whose total tokenised length exceeds max_seq_length (safety net for rows where text fits but text + audio together overflow the budget).

Evaluation

Quality was evaluated qualitatively on a diverse held-out test sample: during training, up to 10 utterances are pulled from ds_test.shuffle(seed=42) covering as many distinct speaker_ids as possible. Generated audio is saved next to the ground-truth recording under inference_samples/sample_<idx>_<speaker_id>.wav so each language / voice combination can be auditioned individually.

We did not run automated metrics (WER on a downstream STT, MOS prediction, language-confusion eval, etc.) for this release. Numbers will be added if/when those become part of the evaluation pipeline.

Important caveat — quality varies by language. The training corpus is unbalanced across the 20 configs; languages with more speaker hours in Sunbird/tts get more training signal and produce more natural speech. Audition the per-language samples before relying on a specific voice for production traffic.


Intended uses & out-of-scope

Intended:

  • Multilingual voice synthesis for accessibility, language learning, human–computer interaction, audio content creation, and downstream speech research on the 20 covered languages.
  • A reference checkpoint for the Sunbird/tts → Orpheus-3B multilingual fine-tuning pipeline; reproducible training recipe in Orpheus_3B_Sunbird_Multilingual.ipynb.

Out of scope:

  • Voice impersonation / deception. The model imitates the timbres of consenting Sunbird voice donors. Do not use the generated audio to impersonate identifiable real persons or to produce content that could mislead listeners about who is speaking.
  • High-stakes decisions. Generated speech may contain pronunciation errors, prosodic artefacts, or hallucinated phrases — do not deploy in safety-critical contexts (medical, legal, emergency) without human review.
  • Languages outside the 20 configs. The model has no signal for languages not present in Sunbird/tts; sending German text to any speaker will produce garbled output, not "German with a Luganda accent".
  • Code-switching. Each speaker_id was recorded in a single language; the model has not seen mixed-language utterances and will likely produce phonetic artefacts at language boundaries within one prompt.
  • Cross-language voice transfer. Sending Acholi text to salt_lug_0001 (a Luganda speaker_id) is undefined behaviour. The model has no language-conditioning input separate from the speaker tag, so language identity travels via the speaker_id. Use a speaker whose salt_<lang>_NNNN prefix matches the language of your text.

Limitations & risks

  • Quality varies by language. Per-language data volume in Sunbird/tts is unbalanced. Languages with fewer hours produce noticeably less natural speech. Run the per-language test-split audit (script below) before committing to a particular voice.
  • No language conditioning. There is no language token; the model relies entirely on the speaker_id to disambiguate. Mismatching speaker_id and text language is undefined behaviour (see above).
  • Vocabulary coverage. Limited to the lexicon present in each config's training subset. Unfamiliar words, code-switching, and out-of-distribution proper nouns may produce artefacts.
  • Long utterances. The model was trained on utterances up to ~16 s of audio (max_seq_length=4096). Generation may degrade or truncate beyond ~10 s of speech.
  • Sampling variance. With do_sample=True, identical prompts can produce noticeably different deliveries between runs. Pass seed= for reproducibility.
  • No emotion/style control. Unlike the upstream orpheus-3b-0.1-ft, this fine-tune was not exposed to in-text emotion tags (<laugh>, <sigh>, …). Such tags will be tokenised as ordinary text and produce no special prosodic effect.
  • Bias. Inherits any biases present in the Sunbird/tts corpus and in Llama-3's pretraining; we have not audited these systematically per language.

Quick per-language audit script

from datasets import load_dataset, Audio, get_dataset_config_names
import soundfile as sf
from pathlib import Path

CONFIGS = get_dataset_config_names("Sunbird/tts")
out_dir = Path("language_audit"); out_dir.mkdir(exist_ok=True)

for cfg in CONFIGS:
    ds = load_dataset("Sunbird/tts", cfg, split="test")
    ds = ds.cast_column("audio", Audio(sampling_rate=24000))
    row = ds[0]
    sid, text = row["speaker_id"], row["text"]
    print(f"{cfg}: {sid} -> {text[:80]}")
    wav = synthesize(text, speaker_id=sid, seed=0)
    sf.write(out_dir / f"{cfg}_{sid}.wav", wav, 24000)
    sf.write(out_dir / f"{cfg}_{sid}_groundtruth.wav",
             row["audio"]["array"], 24000)

Hardware requirements

Mode Min VRAM Recommended
transformers + Unsloth, fp16 8 GB (with load_in_4bit=True) 16 GB
transformers + Unsloth, bf16 14 GB 24 GB
vLLM, bf16, max_model_len=4096 14 GB 24 GB

Audio decoding via SNAC runs on CPU and adds ~50–150 ms per utterance.


License & attribution

This fine-tune is released under Apache-2.0, matching the upstream unsloth/orpheus-3b-0.1-pretrained license. It transitively inherits obligations from:

  • The Orpheus-TTS project (CanopyAI).
  • The Llama-3 base architecture and weights — Meta Llama 3 Community License.
  • The SNAC audio codec (Hubert Siuzdak, MIT).
  • The Sunbird/tts dataset and the SALT voice donors who contributed recordings.

If you redistribute the merged weights, please carry these attributions forward.


Citation

If you use this model in your work, please cite both the dataset and the fine-tuning project:

@misc{sunbird_orpheus3b_multilingual_2026,
  title        = {Orpheus-3B Sunbird Multilingual TTS},
  author       = {Sunbird AI},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/sunbird/orpheus-3b-tts-multilingual}},
}

@misc{sunbird_tts_dataset,
  title        = {Sunbird Speech Dataset},
  author       = {Sunbird AI},
  howpublished = {\url{https://huggingface.co/datasets/Sunbird/tts}},
}

@misc{orpheus_tts_2025,
  title        = {Orpheus-TTS},
  author       = {Canopy Labs},
  year         = {2025},
  howpublished = {\url{https://github.com/canopyai/Orpheus-TTS}},
}

Single-speaker variant

If you only need one specific voice and want a smaller, more focused checkpoint, see sunbird/orpheus-3b-tts-salt-lug-0001 — same recipe, scoped to a single Luganda speaker.

Downloads last month
165
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sunbird/orpheus-3b-tts-multilingual

Dataset used to train Sunbird/orpheus-3b-tts-multilingual

Collection including Sunbird/orpheus-3b-tts-multilingual