MisoTTS int8 (BigBlueCeiling)
A weight-only int8 quantization of
BigBlueCeiling/MisoTTS-bf16,
produced with torchao (int8_weight_only). Only the backbone/decoder Linear
layers are quantized; the embeddings, output heads, and projection stay bf16.
Experimental. Weight-only int8; bf16 remains the reference.
What it is for
Lowering the hardware floor. Quantization here is a memory lever, not a speed one: MisoTTS decodes one frame at a time, and those tiny per-step matmuls cannot feed the GPU's low-precision tensor cores, so int8 dequantizes to bf16 for the matmul. You get the VRAM saving, not a throughput win.
- Fits: ~16 GB VRAM cards (RTX 4060 Ti 16G, 4070 Ti Super, A4000, ...)
- Quality: Quality-preserving: mean CER 0.11, WER 0.14, UTMOS 3.96 - statistically even with bf16 (CER 0.10 / WER 0.15 / UTMOS 3.94).
Use
This checkpoint is a torch.save'd torchao state_dict (model.pt). The serving
core in the MisoTTS repo pulls it
automatically when GPU-sense detects a matching VRAM tier. To load it directly:
from generator import load_miso_8b # from the MisoTTS repo
gen = load_miso_8b("cuda", model_path_or_repo_id="BigBlueCeiling/MisoTTS-int8",
prequantized=True)
Requires torch>=2.7 and a matching torchao (loading unpickles a torchao tensor
subclass, so weights_only=False is used; load only checkpoints you trust).
Model and original inference code are MisoLabs' work; see the upstream license.