Parakeet TDT 0.6B v3 β ONNX Quantized (int4/int8 hybrid)
Quantized ONNX export of nvidia/parakeet-tdt-0.6b-v3 for browser and edge inference with parakeet.js.
409 MB total β 6x smaller than fp32, 39% smaller than istupakov/int8, with half the quantization degradation and 17% faster inference.
Files
| File | Size | Quantization | Description |
|---|---|---|---|
encoder-model.int4.onnx |
391 MB | int4 MatMul + int4 pointwise Conv (block_size=64) | Fast Conformer encoder |
decoder_joint-model.int8.onnx |
18 MB | int8 dynamic | Decoder + Joint network (LSTM + embedding) |
nemo128.int8.onnx |
42 KB | int8 dynamic | Mel preprocessor (optional, JS preprocessor recommended) |
vocab.txt |
94 KB | β | SentencePiece vocabulary (8193 tokens) |
config.json |
97 B | β | Model config |
Usage with parakeet.js (browser)
import { fromUrls } from 'parakeet.js';
const BASE = 'https://huggingface.co/efederici/parakeet-tdt-0.6b-v3-onnx-int4/resolve/main';
const model = await fromUrls({
encoderUrl: `${BASE}/encoder-model.int4.onnx`,
decoderUrl: `${BASE}/decoder_joint-model.int8.onnx`,
tokenizerUrl: `${BASE}/vocab.txt`,
preprocessorBackend: 'js',
backend: 'webgpu', // or 'wasm'
});
const result = await model.transcribe(pcm, 16000, {
returnTimestamps: true,
returnConfidences: true,
});
console.log(result.utterance_text);
For long recordings:
const result = await model.transcribeLongAudio(pcm, 16000, {
returnTimestamps: true,
chunkLengthS: 95,
});
console.log(result.text);
console.log(result.chunks);
Benchmark
LibriSpeech test-clean (200 samples, 26 min of audio)
CPU inference with onnxruntime.
Quality (WER vs ground truth, lower is better)
| Model | Size | WER | RTF |
|---|---|---|---|
| fp32 | ~2.5 GB | 1.72% | 0.072x |
| istupakov int8 | 670 MB | 1.67% | 0.089x |
| this model | 409 MB | 1.67% | 0.074x |
All three models achieve the same ground-truth WER (~1.7%). The quantized models actually score marginally better due to a slight regularization effect.
Quantization degradation (WER vs fp32 output)
| Model | Degradation | Word diffs |
|---|---|---|
| istupakov int8 | 0.79% | 36 / 4541 |
| this model | 0.42% | 19 / 4541 |
Half the degradation of int8, at 39% smaller size and 17% faster speed.
Individual samples (median of 5 runs)
Speed (RTF = processing time / audio duration, lower is better)
| Model | Size | JFK 11s | MLK 13s | TED 60s | French 6s | Avg RTF |
|---|---|---|---|---|---|---|
| fp32 | ~2.5 GB | 0.101 | 0.090 | 0.100 | 0.111 | 0.099 |
| istupakov int8 | 670 MB | 0.107 | 0.108 | 0.110 | 0.122 | 0.110 |
| this model | 409 MB | 0.086 | 0.085 | 0.084 | 0.104 | 0.086 |
Quality (WER vs fp32 reference)
| Model | JFK 11s | MLK 13s | TED 60s | French 6s | Overall WER |
|---|---|---|---|---|---|
| istupakov int8 | ~ punct diff | β exact | 4.2% (8 errors) | β exact | 3.35% |
| this model | β exact | β exact | 2.1% (4 errors) | β exact | 1.67% |
Quantization details
Hybrid approach for optimal size/quality/speed:
- Encoder pointwise Conv layers converted to MatMul for better int4 coverage (using onnx-conv2matmul)
- Encoder linear + pointwise Conv (87.5% of weights): int4 MatMulNBits, block_size=64, asymmetric
- Encoder depthwise Conv (small): fp32
- Decoder (LSTM + embedding + linear): int8 dynamic quantization
- Compatible with ONNX Runtime (CPU, WASM, WebGPU)
Source fp32 model: istupakov/parakeet-tdt-0.6b-v3-onnx
License
CC-BY-4.0, inherited from nvidia/parakeet-tdt-0.6b-v3.
- Downloads last month
- 116
Model tree for efederici/parakeet-tdt-0.6b-v3-onnx-int4
Base model
nvidia/parakeet-tdt-0.6b-v3