Parakeet TDT 0.6B v3 β€” ONNX Quantized (int4/int8 hybrid)

Quantized ONNX export of nvidia/parakeet-tdt-0.6b-v3 for browser and edge inference with parakeet.js.

409 MB total β€” 6x smaller than fp32, 39% smaller than istupakov/int8, with half the quantization degradation and 17% faster inference.

Files

File Size Quantization Description
encoder-model.int4.onnx 391 MB int4 MatMul + int4 pointwise Conv (block_size=64) Fast Conformer encoder
decoder_joint-model.int8.onnx 18 MB int8 dynamic Decoder + Joint network (LSTM + embedding)
nemo128.int8.onnx 42 KB int8 dynamic Mel preprocessor (optional, JS preprocessor recommended)
vocab.txt 94 KB β€” SentencePiece vocabulary (8193 tokens)
config.json 97 B β€” Model config

Usage with parakeet.js (browser)

import { fromUrls } from 'parakeet.js';

const BASE = 'https://huggingface.co/efederici/parakeet-tdt-0.6b-v3-onnx-int4/resolve/main';

const model = await fromUrls({
  encoderUrl: `${BASE}/encoder-model.int4.onnx`,
  decoderUrl: `${BASE}/decoder_joint-model.int8.onnx`,
  tokenizerUrl: `${BASE}/vocab.txt`,
  preprocessorBackend: 'js',
  backend: 'webgpu', // or 'wasm'
});

const result = await model.transcribe(pcm, 16000, {
  returnTimestamps: true,
  returnConfidences: true,
});
console.log(result.utterance_text);

For long recordings:

const result = await model.transcribeLongAudio(pcm, 16000, {
  returnTimestamps: true,
  chunkLengthS: 95,
});
console.log(result.text);
console.log(result.chunks);

Benchmark

LibriSpeech test-clean (200 samples, 26 min of audio)

CPU inference with onnxruntime.

Quality (WER vs ground truth, lower is better)

Model Size WER RTF
fp32 ~2.5 GB 1.72% 0.072x
istupakov int8 670 MB 1.67% 0.089x
this model 409 MB 1.67% 0.074x

All three models achieve the same ground-truth WER (~1.7%). The quantized models actually score marginally better due to a slight regularization effect.

Quantization degradation (WER vs fp32 output)

Model Degradation Word diffs
istupakov int8 0.79% 36 / 4541
this model 0.42% 19 / 4541

Half the degradation of int8, at 39% smaller size and 17% faster speed.

Individual samples (median of 5 runs)

Speed (RTF = processing time / audio duration, lower is better)

Model Size JFK 11s MLK 13s TED 60s French 6s Avg RTF
fp32 ~2.5 GB 0.101 0.090 0.100 0.111 0.099
istupakov int8 670 MB 0.107 0.108 0.110 0.122 0.110
this model 409 MB 0.086 0.085 0.084 0.104 0.086

Quality (WER vs fp32 reference)

Model JFK 11s MLK 13s TED 60s French 6s Overall WER
istupakov int8 ~ punct diff βœ“ exact 4.2% (8 errors) βœ“ exact 3.35%
this model βœ“ exact βœ“ exact 2.1% (4 errors) βœ“ exact 1.67%

Quantization details

Hybrid approach for optimal size/quality/speed:

  • Encoder pointwise Conv layers converted to MatMul for better int4 coverage (using onnx-conv2matmul)
  • Encoder linear + pointwise Conv (87.5% of weights): int4 MatMulNBits, block_size=64, asymmetric
  • Encoder depthwise Conv (small): fp32
  • Decoder (LSTM + embedding + linear): int8 dynamic quantization
  • Compatible with ONNX Runtime (CPU, WASM, WebGPU)

Source fp32 model: istupakov/parakeet-tdt-0.6b-v3-onnx

License

CC-BY-4.0, inherited from nvidia/parakeet-tdt-0.6b-v3.

Downloads last month
116
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for efederici/parakeet-tdt-0.6b-v3-onnx-int4

Quantized
(28)
this model