perfeptron's picture
Add model card
5d9e8f5 verified
metadata
license: cc-by-4.0
language: en
tags:
  - speech
  - asr
  - ctc
  - onnx
  - parakeet
  - nemo
  - nvidia
  - vocabulary-boost
base_model: nvidia/parakeet-tdt_ctc-110m
pipeline_tag: automatic-speech-recognition

Parakeet CTC 110M (INT8)

CTC-based speech recognition model for vocabulary-rescored transcription in Heydict.

Overview

This is the CTC decoder head of NVIDIA's parakeet-tdt_ctc-110m, exported to ONNX by csukuangfj/sherpa-onnx and dynamically quantized to INT8.

It runs as a companion model alongside the primary Parakeet TDT transducer. The CTC model's frame-level logits are rescored against the user's custom vocabulary list (domain terms, company names, technical jargon) to improve recognition accuracy for specialized terms.

Files

File Size Description
encoder.int8.onnx 126 MB INT8 dynamically quantized CTC encoder
encoder.fp32.onnx 437 MB Original FP32 encoder (for reference/GPU)
tokens.txt 10 KB SentencePiece vocabulary (sherpa-onnx format)

Architecture

  • Encoder: FastConformer (17 layers, 256 dim, 4 heads)
  • Decoder: CTC (encoder-only, no transducer joiner)
  • Vocabulary: 1025 SentencePiece tokens
  • Input: 128-dim log-mel spectrogram (NeMo convention)
  • Output: Frame-level logits [1, T', 1025]

Quantization

Dynamic INT8 quantization via onnxruntime.quantization.quantize_dynamic. Weights are INT8, activations are quantized at runtime. ~3.5x smaller than FP32 with minimal accuracy loss — suitable for a companion rescoring model.

License

CC-BY-4.0 (inherited from NVIDIA's original model)