Parakeet TDT 0.6B v3 — ONNX Quantized (int4/int8 hybrid)

Quantized ONNX export of nvidia/parakeet-tdt-0.6b-v3 for browser and edge inference with parakeet.js.

409 MB total — 6x smaller than fp32, 39% smaller than istupakov/int8, with half the quantization degradation and 17% faster inference.

Files

File	Size	Quantization	Description
`encoder-model.int4.onnx`	391 MB	int4 MatMul + int4 pointwise Conv (block_size=64)	Fast Conformer encoder
`decoder_joint-model.int8.onnx`	18 MB	int8 dynamic	Decoder + Joint network (LSTM + embedding)
`nemo128.int8.onnx`	42 KB	int8 dynamic	Mel preprocessor (optional, JS preprocessor recommended)
`vocab.txt`	94 KB	—	SentencePiece vocabulary (8193 tokens)
`config.json`	97 B	—	Model config

Usage with parakeet.js (browser)

import { fromUrls } from 'parakeet.js';

const BASE = 'https://huggingface.co/efederici/parakeet-tdt-0.6b-v3-onnx-int4/resolve/main';

const model = await fromUrls({
  encoderUrl: `${BASE}/encoder-model.int4.onnx`,
  decoderUrl: `${BASE}/decoder_joint-model.int8.onnx`,
  tokenizerUrl: `${BASE}/vocab.txt`,
  preprocessorBackend: 'js',
  backend: 'webgpu', // or 'wasm'
});

const result = await model.transcribe(pcm, 16000, {
  returnTimestamps: true,
  returnConfidences: true,
});
console.log(result.utterance_text);

For long recordings:

const result = await model.transcribeLongAudio(pcm, 16000, {
  returnTimestamps: true,
  chunkLengthS: 95,
});
console.log(result.text);
console.log(result.chunks);

Benchmark

LibriSpeech test-clean (200 samples, 26 min of audio)

CPU inference with onnxruntime.

Quality (WER vs ground truth, lower is better)

Model	Size	WER	RTF
fp32	~2.5 GB	1.72%	0.072x
istupakov int8	670 MB	1.67%	0.089x
this model	409 MB	1.67%	0.074x

All three models achieve the same ground-truth WER (~1.7%). The quantized models actually score marginally better due to a slight regularization effect.

Quantization degradation (WER vs fp32 output)

Model	Degradation	Word diffs
istupakov int8	0.79%	36 / 4541
this model	0.42%	19 / 4541

Half the degradation of int8, at 39% smaller size and 17% faster speed.

Individual samples (median of 5 runs)

Speed (RTF = processing time / audio duration, lower is better)

Model	Size	JFK 11s	MLK 13s	TED 60s	French 6s	Avg RTF
fp32	~2.5 GB	0.101	0.090	0.100	0.111	0.099
istupakov int8	670 MB	0.107	0.108	0.110	0.122	0.110
this model	409 MB	0.086	0.085	0.084	0.104	0.086

Quality (WER vs fp32 reference)

Model	JFK 11s	MLK 13s	TED 60s	French 6s	Overall WER
istupakov int8	~ punct diff	✓ exact	4.2% (8 errors)	✓ exact	3.35%
this model	✓ exact	✓ exact	2.1% (4 errors)	✓ exact	1.67%

Quantization details

Hybrid approach for optimal size/quality/speed:

Encoder pointwise Conv layers converted to MatMul for better int4 coverage (using onnx-conv2matmul)
Encoder linear + pointwise Conv (87.5% of weights): int4 MatMulNBits, block_size=64, asymmetric
Encoder depthwise Conv (small): fp32
Decoder (LSTM + embedding + linear): int8 dynamic quantization
Compatible with ONNX Runtime (CPU, WASM, WebGPU)

Source fp32 model: istupakov/parakeet-tdt-0.6b-v3-onnx

License

CC-BY-4.0, inherited from nvidia/parakeet-tdt-0.6b-v3.

Downloads last month: 116

Model tree for efederici/parakeet-tdt-0.6b-v3-onnx-int4

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(28)

this model