parakeet-unified-en-0.6b-mlx-int8

8-bit affine-quantized MLX weights for NVIDIA's parakeet-unified-en-0.6b Cache-Aware FastConformer-RNNT, for the witness MLX C++ engine on Apple Silicon.

Only the linear / projection matmuls are quantized (group size 64, affine): encoder FFN + attention + pointwise convs, the subsampling output projection, the RNNT prediction LSTM + embedding, and the joint network. Conv2d subsampling, depthwise conv, all norms / biases / batch-norm stats, and the relative-position bias vectors stay dense fp32 (the engine reads them directly).

Why int8

The autoregressive RNNT decode is a batch-1, memory-bandwidth-bound GEMV, and at typical utterance lengths the encoder is partly weight-bandwidth-bound too — so halving the weight bytes read per step is a latency win on Apple apple9 (M3/M4), not just a footprint win. WER is unchanged from dense.

Measured (M4, 45 LibriSpeech samples / 300s, witness rtf_bench)

Variant	Size	Offline WER	Offline RTF	Streaming WER	Streaming RTF
dense fp32	2.47 GB	1.78%	0.0084 (119x)	11.35%	0.0319 (31x)
int8	0.70 GB	1.78%	0.0075 (134x)	11.35%	0.0197 (51x)

(RTF measured with the witness engine's RNNT decoder optimizations enabled; lower RTF is faster. WER is bit-equivalent to dense on this set.)

Use

WITNESS_PARAKEET_UNIFIED_MODEL_DIR=/path/to/this/dir

The witness loader probes config.json for quantization.{bits,group_size} and routes the packed weights through quantized_matmul automatically.

Produced by crates/mlx-parakeet/scripts/quantize_parakeet_unified.py --bits 8.

Downloads last month: 38

Safetensors

Model size

0.2B params

Tensor type

F32

U32

MLX

Hardware compatibility

Quantized

Model tree for littlebearlabs/parakeet-unified-en-0.6b-mlx-int8

Base model

nvidia/parakeet-unified-en-0.6b

Finetuned

(3)

this model