--- license: cc-by-4.0 language: en tags: - speech - asr - ctc - onnx - parakeet - nemo - nvidia - vocabulary-boost base_model: nvidia/parakeet-tdt_ctc-110m pipeline_tag: automatic-speech-recognition --- # Parakeet CTC 110M (INT8) CTC-based speech recognition model for vocabulary-rescored transcription in [Heydict](https://github.com/Entropora/heydict). ## Overview This is the CTC decoder head of NVIDIA's [parakeet-tdt_ctc-110m](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m), exported to ONNX by [csukuangfj/sherpa-onnx](https://huggingface.co/csukuangfj/sherpa-onnx-nemo-parakeet_tdt_ctc_110m-en-36000) and dynamically quantized to INT8. It runs as a **companion model** alongside the primary Parakeet TDT transducer. The CTC model's frame-level logits are rescored against the user's custom vocabulary list (domain terms, company names, technical jargon) to improve recognition accuracy for specialized terms. ## Files | File | Size | Description | |------|------|-------------| | `encoder.int8.onnx` | 126 MB | INT8 dynamically quantized CTC encoder | | `encoder.fp32.onnx` | 437 MB | Original FP32 encoder (for reference/GPU) | | `tokens.txt` | 10 KB | SentencePiece vocabulary (sherpa-onnx format) | ## Architecture - **Encoder**: FastConformer (17 layers, 256 dim, 4 heads) - **Decoder**: CTC (encoder-only, no transducer joiner) - **Vocabulary**: 1025 SentencePiece tokens - **Input**: 128-dim log-mel spectrogram (NeMo convention) - **Output**: Frame-level logits [1, T', 1025] ## Quantization Dynamic INT8 quantization via `onnxruntime.quantization.quantize_dynamic`. Weights are INT8, activations are quantized at runtime. ~3.5x smaller than FP32 with minimal accuracy loss — suitable for a companion rescoring model. ## License CC-BY-4.0 (inherited from NVIDIA's original model)