japanese-wav2vec2-large-hiragana-ctc

Lightweight Japanese ASR that outputs hiragana only with dual CTC heads (kana + phoneme).

Built on reazon-research/japanese-wav2vec2-large with InterCTC and CR-CTC.

Why hiragana?

No hallucination: CTC is structurally incapable of generating content not in the input (no autoregressive decoder)
Lightweight: 315M parameters, FP16 inference ~630MB, runs real-time on MacBook Air M2
LLM-friendly: Pass hiragana to an LLM for kanji conversion and intent understanding — disambiguation is LLM's strength

Model Details

Property	Value
Parameters	315.6M
Base model	reazon-research/japanese-wav2vec2-large (pretrained on 35,000h)
Training data	ReazonSpeech medium (1,000h)
Training cost	~$17 (H100 80GB, 8 hours)
Kana vocabulary	84 tokens (82 hiragana + long vowel mark + CTC blank)
Phoneme vocabulary	43 tokens (42 phonemes + CTC blank)
Precision	BF16 training, FP16 inference

Architecture

Audio (16kHz) → CNN Feature Extractor (frozen, 7 conv layers)
              → Transformer Encoder (24 layers, hidden=1024)
                  ├── Layer 12 → Phoneme CTC Head (InterCTC, 43 classes)
                  └── Layer 24 → Kana CTC Head (CR-CTC, 84 classes)

InterCTC (Lee & Watanabe, ICASSP 2021): Auxiliary phoneme task at intermediate layer for gradient flow improvement
CR-CTC (Yao et al., ICLR 2025): Consistency regularization to smooth CTC spike distributions
Loss: CR-CTC(kana) + 0.3 * CTC(phoneme)

Evaluation Results

Dataset	Condition	KER	PER
JSUT-BASIC5000	Studio, single speaker	7.47%	10.42%
JVS parallel100	100 speakers	15.68%	21.43%
ReazonSpeech	TV broadcast (wild audio)	21.65%	21.87%

KER = Kana Error Rate (character-level edit distance on hiragana output).

Usage

import torch
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2Model

# Load checkpoint
checkpoint = torch.load("best-medium-ep5-inference.pt", map_location="cpu")

# The checkpoint contains:
# - model_state_dict: model weights
# - kana_vocab: hiragana vocabulary mapping
# - phoneme_vocab: phoneme vocabulary mapping
# - config: training configuration

# See https://github.com/nyosegawa/hiragana-asr for full inference code

Quick inference with the repo

git clone https://github.com/nyosegawa/hiragana-asr
cd hiragana-asr
uv sync && uv run python -m unidic download

# Download model from this repo and place in models/checkpoints/
uv run python scripts/03_infer.py --audio your_audio.wav

Real-time ASR

uv run python scripts/realtime_asr.py

Training

Trained in 2 stages:

Stage 1 (100h, A100): wav2vec2-large fine-tuning on ReazonSpeech small, 15 epochs
Stage 2 (1,000h, H100): Continued training on ReazonSpeech medium, 5 epochs with lr=5e-5

See training report for details.

Known Limitations

Katakana loanwords: Long vowel marks and small kana are error-prone for foreign words
Long vowel mark "ー": Most unstable token across all datasets
Small kana: 28.1% error rate on JVS (e.g., "ぉ" confused with "ほ")

Citation

@misc{sakasegawa2026hiraganaasr,
  title={hiragana-asr: Lightweight Japanese ASR with Dual CTC},
  author={Sakasegawa},
  year={2026},
  url={https://github.com/nyosegawa/hiragana-asr}
}

License

Apache-2.0. Training data: ReazonSpeech (CDLA-Sharing-1.0 — model weights are unrestricted).

Model tree for sakasegawa/japanese-wav2vec2-large-hiragana-ctc

Base model

reazon-research/japanese-wav2vec2-large

Finetuned

(2)

this model

Dataset used to train sakasegawa/japanese-wav2vec2-large-hiragana-ctc

Space using sakasegawa/japanese-wav2vec2-large-hiragana-ctc 1

Papers for sakasegawa/japanese-wav2vec2-large-hiragana-ctc

CR-CTC: Consistency regularization on CTC for improved speech recognition

Paper • 2410.05101 • Published Oct 7, 2024

Intermediate Loss Regularization for CTC-based Speech Recognition

Paper • 2102.03216 • Published Feb 5, 2021

sakasegawa
/

japanese-wav2vec2-large-hiragana-ctc