parakeet-unified-en-0.6b-mlx-int8

8-bit affine-quantized MLX weights for NVIDIA's parakeet-unified-en-0.6b Cache-Aware FastConformer-RNNT, for the witness MLX C++ engine on Apple Silicon.

Only the linear / projection matmuls are quantized (group size 64, affine): encoder FFN + attention + pointwise convs, the subsampling output projection, the RNNT prediction LSTM + embedding, and the joint network. Conv2d subsampling, depthwise conv, all norms / biases / batch-norm stats, and the relative-position bias vectors stay dense fp32 (the engine reads them directly).

Why int8

The autoregressive RNNT decode is a batch-1, memory-bandwidth-bound GEMV, and at typical utterance lengths the encoder is partly weight-bandwidth-bound too — so halving the weight bytes read per step is a latency win on Apple apple9 (M3/M4), not just a footprint win. WER is unchanged from dense.

Measured (M4, 45 LibriSpeech samples / 300s, witness rtf_bench)

Variant Size Offline WER Offline RTF Streaming WER Streaming RTF
dense fp32 2.47 GB 1.78% 0.0084 (119x) 11.35% 0.0319 (31x)
int8 0.70 GB 1.78% 0.0075 (134x) 11.35% 0.0197 (51x)

(RTF measured with the witness engine's RNNT decoder optimizations enabled; lower RTF is faster. WER is bit-equivalent to dense on this set.)

Use

WITNESS_PARAKEET_UNIFIED_MODEL_DIR=/path/to/this/dir

The witness loader probes config.json for quantization.{bits,group_size} and routes the packed weights through quantized_matmul automatically.

Produced by crates/mlx-parakeet/scripts/quantize_parakeet_unified.py --bits 8.

Downloads last month
38
Safetensors
Model size
0.2B params
Tensor type
F32
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for littlebearlabs/parakeet-unified-en-0.6b-mlx-int8

Finetuned
(3)
this model