CR-CTC: Consistency regularization on CTC for improved speech recognition
Paper β’ 2410.05101 β’ Published
Lightweight Japanese ASR that outputs hiragana only with dual CTC heads (kana + phoneme).
Built on reazon-research/japanese-wav2vec2-large with InterCTC and CR-CTC.
| Property | Value |
|---|---|
| Parameters | 315.6M |
| Base model | reazon-research/japanese-wav2vec2-large (pretrained on 35,000h) |
| Training data | ReazonSpeech medium (1,000h) |
| Training cost | ~$17 (H100 80GB, 8 hours) |
| Kana vocabulary | 84 tokens (82 hiragana + long vowel mark + CTC blank) |
| Phoneme vocabulary | 43 tokens (42 phonemes + CTC blank) |
| Precision | BF16 training, FP16 inference |
Audio (16kHz) β CNN Feature Extractor (frozen, 7 conv layers)
β Transformer Encoder (24 layers, hidden=1024)
βββ Layer 12 β Phoneme CTC Head (InterCTC, 43 classes)
βββ Layer 24 β Kana CTC Head (CR-CTC, 84 classes)
CR-CTC(kana) + 0.3 * CTC(phoneme)| Dataset | Condition | KER | PER |
|---|---|---|---|
| JSUT-BASIC5000 | Studio, single speaker | 7.47% | 10.42% |
| JVS parallel100 | 100 speakers | 15.68% | 21.43% |
| ReazonSpeech | TV broadcast (wild audio) | 21.65% | 21.87% |
KER = Kana Error Rate (character-level edit distance on hiragana output).
import torch
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2Model
# Load checkpoint
checkpoint = torch.load("best-medium-ep5-inference.pt", map_location="cpu")
# The checkpoint contains:
# - model_state_dict: model weights
# - kana_vocab: hiragana vocabulary mapping
# - phoneme_vocab: phoneme vocabulary mapping
# - config: training configuration
# See https://github.com/nyosegawa/hiragana-asr for full inference code
git clone https://github.com/nyosegawa/hiragana-asr
cd hiragana-asr
uv sync && uv run python -m unidic download
# Download model from this repo and place in models/checkpoints/
uv run python scripts/03_infer.py --audio your_audio.wav
uv run python scripts/realtime_asr.py
Trained in 2 stages:
See training report for details.
@misc{sakasegawa2026hiraganaasr,
title={hiragana-asr: Lightweight Japanese ASR with Dual CTC},
author={Sakasegawa},
year={2026},
url={https://github.com/nyosegawa/hiragana-asr}
}
Apache-2.0. Training data: ReazonSpeech (CDLA-Sharing-1.0 β model weights are unrestricted).
Base model
reazon-research/japanese-wav2vec2-large