Gipformer - Efficient Vietnamese Speech Recognition

Highlights

  • State-of-the-art accuracy β€” Demonstrates top-tier performance across major Vietnamese ASR benchmarks, delivering highly precise and reliable transcription quality.
  • Robust handling of telephonic domains β€” Excels in processing challenging, noisy real-world call center recordings across all major Vietnamese regional accents.
  • Outstanding parameter efficiency β€” Ranks among the smallest ASR models currently available.
  • Seamless edge deployment β€” Its naturally low resource requirements enable ultra-fast inference on mobile and embedded systems, making it perfectly suited for offline, on-device applications.
  • Built-in data privacy β€” By supporting full local execution, the model ensures sensitive audio data is processed securely on-device, eliminating the need for third-party cloud services.
  • gipformer-65M-rnnt is based on Zipformer Transducer architecture.

Benchmark Results

We evaluate gipformer-65M-rnnt against 12 established Vietnamese ASR models across 12 benchmarks spanning call center, medical, broadcast, read speech, etc. All numbers are Word Error Rate (WER %) β€” lower is better.

Normalization: Both predictions and labels are normalized before computing WER β€” lowercased, diacritics removed, and numbers converted to spoken form.

Model Params tele-medium tele-diff-north tele-diff-middle tele-diff-south MultiMED VietMed vlsp-t1 vlsp-t2 LSVSC Fleurs ViMD vivos
vinai/PhoWhisper-small 244M 33.96 55.88 65.41 62.35 26.02 25.50 15.99 34.20 11.23 16.11 14.09 6.23
vinai/PhoWhisper-medium 769M 26.46 51.20 59.04 55.39 24.76 24.90 14.06 26.38 10.25 14.44 11.34 4.93
vinai/PhoWhisper-large 1.5B 26.82 50.39 59.44 56.70 24.47 24.37 13.70 27.45 10.08 12.62 11.18 4.73
khanhld/chunkformer-large-vie 110M 27.60 46.30 51.91 49.09 22.60 19.59 14.09 25.81 8.85 14.17 11.77 4.18
nguyenvulebinh/wav2vec2-base-vi 95M 23.71 40.49 48.90 46.33 23.03 22.96 13.14 37.33 9.89 20.09 11.42 6.60
hynt/Zipformer-30M-RNNT-6000h 30M 19.95 38.77 45.19 43.89 19.85 19.93 11.76 28.63 9.12 13.16 7.28 4.60
VietASR-zipformer 65M 20.30 42.21 49.01 47.86 22.05 21.90 14.54 31.18 10.23 14.76 10.15 6.92
Qwen/Qwen3-ASR-1.7B 1.7B 26.34 46.80 59.85 51.84 20.11 20.21 16.29 34.26 9.64 10.13 11.16 7.17
Qwen/Qwen3-ASR-0.6B 600M 32.29 48.57 61.88 55.43 22.65 22.51 18.62 43.44 10.96 13.11 14.37 10.23
nvidia/parakeet-ctc-0.6b-Vietnamese 600M 31.82 55.33 61.65 56.70 23.79 23.53 17.00 37.94 10.46 16.11 12.95 7.76
g-group-ai-lab/gipformer-65M-rnnt 65M 15.53 25.10 32.27 32.62 19.35 19.41 13.39 20.40 8.96 12.92 7.17 4.12

Rankings Summary

Rank Count Benchmarks
#1 9 / 12 tele-medium, tele-difficult-north, tele-difficult-middle, tele-difficult-south, MultiMED, VietMed, vlsp-2020-task-2, ViMD, vivos
#2 1 / 12 LSVSC (8.96)
#3 2 / 12 vlsp-2020-task-1 (13.39), Fleurs (12.92)
Dataset Descriptions

Private test sets (call center domain):

  • tele-medium β€” Call center recordings with medium difficulty
  • tele-difficult-north β€” Low-quality call center audio, hard-to-hear speakers β€” Northern Vietnamese accent
  • tele-difficult-middle β€” Low-quality call center audio, hard-to-hear speakers β€” Central Vietnamese accent
  • tele-difficult-south β€” Low-quality call center audio, hard-to-hear speakers β€” Southern Vietnamese accent

Public test sets:

  • MultiMED β€” Multi-domain medical conversations
  • VietMed β€” Vietnamese medical domain
  • vlsp-2020-task-1 β€” VLSP 2020 ASR Shared Task 1
  • vlsp-2020-task-2 β€” VLSP 2020 ASR Shared Task 2
  • LSVSC β€” Large-Scale Vietnamese Speech Corpus
  • Fleurs β€” Google's Few-shot Learning Evaluation of Universal Representations of Speech (Vietnamese subset)
  • ViMD β€” Vietnamese Multi-Domain
  • vivos β€” Vietnamese read speech corpus

Call Center Domain: Where It Matters Most

Call center ASR is one of the most challenging real-world domains β€” noisy phone lines, overlapping speech, diverse regional accents, and spontaneous conversation. gipformer-65M-rnnt delivers dominant performance across all call center test sets.

Resources

Usage

See the Quick Start guide for detailed usage instructions.

Citation

@misc{gipformer,
  title={Gipformer - Efficient Vietnamese Speech Recognition},
  author={G-Group AI Lab},
  year={2026},
  url={https://huggingface.co/g-group-ai-lab/gipformer-65M-rnnt}
}

License

This model is released under the MIT License.

Acknowledgments

Developed by G-Group AI Lab. For questions, issues, or collaboration inquiries, please visit our HuggingFace organization page.

Downloads last month
148
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Space using g-group-ai-lab/gipformer-65M-rnnt 1