Qwen3-TTS VoiceDesign — T6

A fine-tune of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign focused on fuller prosodic and emotional rendering under free-form English voice descriptions. This checkpoint was selected by listening rather than by automatic metric — it produces audibly more textured delivery (warmer storytelling, more committed sad/angry/whispered reads, richer pacing) than auto-metric-best variants of the same training run, with a base-model anchor that keeps neutral prompts close to the original model's baseline.

Base: Qwen3-TTS-12Hz-1.7B-VoiceDesign (frozen during training, LoRA adapter merged into this repo)
Method: LoRA on the Talker's attention + MLP projections, plus a KL-divergence anchor against the frozen base model on neutral-prompt minibatches; final adapter merged back into the base weights
Training data: EARS (rich-style multi-speaker reads) + Expresso (expressive performances), with free-form natural-language captions
Output: 24 kHz mono wav via the Qwen3 12 Hz multi-codebook codec

This repo is self-contained — it ships the merged transformer weights, the audio codec (speech_tokenizer/), the tokenizer, and all configs. No other HF repo needs to be downloaded at inference time.

What this checkpoint targets

T6 prioritizes the failure mode that emerges when a voice-design adapter is pushed hard on stylistic prompts: the adapter starts to "color" everything, including prompts that should sound neutral. To prevent that, training added a small KL-divergence loss against the frozen base model on a 10% minibatch slice that uses neutral prompts (e.g., "a clear, neutral voice reading the sentence") — the adapter is free to specialize on stylistic prompts but is anchored back toward base behavior on plain ones.

This particular checkpoint is the listener-selected variant from the run. Earlier auto-metric-best snapshots from the same training trajectory have tighter ASR scores but feel flatter in delivery; this checkpoint has a small naturalness lift on the automatic MOS proxy and noticeably more committed emotional shape on persona / scene prompts. The trade-off is that automatic transcription on some prompts may run slightly less precise than the base — see Known limitations.

Quick start

Install the Qwen3-TTS inference package (it registers the custom Qwen3TTSForConditionalGeneration model class with transformers):

pip install qwen-tts transformers torch soundfile

Generate a clip:

from qwen_tts import Qwen3TTSModel
import soundfile as sf

wrap = Qwen3TTSModel.from_pretrained("macminix/qwen3_voice_design_t6")

wavs, sr = wrap.generate_voice_design(
    text="Come and look at this, you are not going to believe it.",
    instruct="A male speaker delivers his happy speech at a moderate pace with standard energy.",
    language="english",
    temperature=0.9, top_k=50, top_p=1.0,
    repetition_penalty=1.05, max_new_tokens=600,
)
sf.write("out.wav", wavs[0], sr)

A ready-to-run version with three example prompts is provided at example_inference.py.

The `instruct` prompt format

The instruct field is free-form English describing the voice. The training distribution covers:

gender — "a male/female speaker", "a deep-voiced narrator"
pitch — "high/medium/low pitched", "deep", "thin and high"
speed — "slowly", "at a brisk pace", "at a moderate tempo"
affect / emotion — "happy", "angry", "sad", "whispered", "quiet", "projected"
scene / persona — "a bedtime storyteller", "a news anchor", "a meditation guide"

Example prompts:

A male speaker delivers his happy speech at a moderate pace with standard energy.
A female voice speaks softly with a sad tone, low energy, almost whispering.
An older male narrator reads a bedtime story slowly, with warmth.
A clear, neutral female voice reads the sentence.

How the adapter was trained

The training protocol corrects four common silent issues in naive recipes for VoiceDesign:

Dual-track input layout. Training-time inputs_embeds is built by the exact element-wise sum of text-track and codec-track embeddings used by Qwen3TTSForConditionalGeneration.generate's VoiceDesign path — including the 5-position English think-prefix on the codec track. This matches inference exactly, instead of approximating it with a chat-templated prompt + boundary switch.
Single-shift loss. Labels are computed manually as F.cross_entropy(logits[:, :-1], codec_0_labels[:, 1:], ignore_index=-100). The labels= argument is never passed into the wrapped forward, avoiding the double-shift that occurs when PEFT's wrapped CausalLMLoss adds its own internal shift on top of the collator's.
Conservative LR for LoRA on a 1.7 B base. Cosine schedule with a high LR floor so late training keeps making progress instead of plateauing.
No sub-talker loss with a frozen Code Predictor. The sub-talker auxiliary loss is disabled when the Code Predictor isn't part of the LoRA scope — this combination is known to corrupt training.

On top of those, T6 adds:

KL-to-base regularization. A small fraction of training minibatches (kl_neutral_mix_prob = 0.10) is replaced with neutral-prompt batches: same audio, same text, but the instruct is swapped for a generic "neutral voice" description from a fixed pool. On those minibatches the loss becomes β · KL(student_logits || teacher_logits), where the teacher is the same model with the LoRA adapter disabled. This keeps the adapter aligned with the base model's behavior on plain prompts while leaving stylistic prompts free to specialize.

The adapter is LoRA r=16, α=32, dropout=0.05 on the Talker's q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj projections only. The Code Predictor and audio codec are frozen end-to-end. Training data combines EARS (clean multi-speaker reads with style descriptors) and Expresso (high-quality expressive performances at 48 kHz, downsampled to 24 kHz to match the base's native rate). Captions are free-form natural-language prose, one canonical caption per clip — no templated descriptions.

The final adapter (~19 M parameters, ~77 MB at fp32) was permanently merged into the Talker weights for this repo so inference does not require PEFT.

Strengths

Richer emotional rendering by ear. Sad, whispered, angry, and projected prompts feel more committed than the base — less of a "narrated" surface, more delivery shape.
Better persona / scene composition. Bedtime storyteller, news anchor, meditation guide and similar persona prompts come through with stronger character.
Stable on neutral prompts. A plain "a clear, neutral voice reading the sentence" produces output close to base behavior — the KL anchor keeps the adapter's specialization on stylistic prompts from contaminating the neutral path.
Slight naturalness lift on the UTMOS automatic MOS proxy versus the base model on the same set of voice-design prompts.

Known limitations

Automatic transcription can run slightly less precise than the base on some prompts. The richer prosodic shape this checkpoint produces (longer drawn-out vowels on sad reads, accented stresses on excited prompts, more rubato in storyteller reads) trades a bit of ASR-friendly clarity for delivery character. If you need the lowest possible WER over expressive depth, an earlier checkpoint from the same trajectory is probably a better fit.
Gender drift on a few strong-emotion prompts. Some sad and fear prompts can bias toward the wrong-gender timbre on free-form descriptions. Mitigation in the prompt: lead with the gender ("A male speaker, sad and quiet, …") rather than the emotion.
English only. All training and evaluation used English prompts and English text. The base model supports 10 languages; they are untouched but not validated against this adapter's modified CB-0 distribution.
Research / non-commercial use only — see license.

License

Base model weights (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign): Apache 2.0.
Training data:
- EARS: CC BY-NC-SA 4.0 (research / non-commercial).
- Expresso: CC BY-NC 4.0 (research / non-commercial).

Because both training corpora carry non-commercial restrictions, the derived model effectively inherits a CC BY-NC-SA 4.0 constraint: free to use for research, academic, and non-commercial purposes, with attribution and share-alike. Commercial deployment is not recommended without re-training on a commercially-licensed corpus.

References

Base model: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
Inference library: qwen-tts on PyPI
EARS dataset: Effortless and Realistic Speech Dataset
Expresso dataset: ylacombe/expresso

Downloads last month: 9

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for macminix/qwen3_voice_design_t6

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign

Adapter

(8)

this model