japanese-wav2vec2-large-hiragana-ctc

Lightweight Japanese ASR that outputs hiragana only with dual CTC heads (kana + phoneme).

Built on reazon-research/japanese-wav2vec2-large with InterCTC and CR-CTC.

Why hiragana?

  • No hallucination: CTC is structurally incapable of generating content not in the input (no autoregressive decoder)
  • Lightweight: 315M parameters, FP16 inference ~630MB, runs real-time on MacBook Air M2
  • LLM-friendly: Pass hiragana to an LLM for kanji conversion and intent understanding β€” disambiguation is LLM's strength

Model Details

Property Value
Parameters 315.6M
Base model reazon-research/japanese-wav2vec2-large (pretrained on 35,000h)
Training data ReazonSpeech medium (1,000h)
Training cost ~$17 (H100 80GB, 8 hours)
Kana vocabulary 84 tokens (82 hiragana + long vowel mark + CTC blank)
Phoneme vocabulary 43 tokens (42 phonemes + CTC blank)
Precision BF16 training, FP16 inference

Architecture

Audio (16kHz) β†’ CNN Feature Extractor (frozen, 7 conv layers)
              β†’ Transformer Encoder (24 layers, hidden=1024)
                  β”œβ”€β”€ Layer 12 β†’ Phoneme CTC Head (InterCTC, 43 classes)
                  └── Layer 24 β†’ Kana CTC Head (CR-CTC, 84 classes)
  • InterCTC (Lee & Watanabe, ICASSP 2021): Auxiliary phoneme task at intermediate layer for gradient flow improvement
  • CR-CTC (Yao et al., ICLR 2025): Consistency regularization to smooth CTC spike distributions
  • Loss: CR-CTC(kana) + 0.3 * CTC(phoneme)

Evaluation Results

Dataset Condition KER PER
JSUT-BASIC5000 Studio, single speaker 7.47% 10.42%
JVS parallel100 100 speakers 15.68% 21.43%
ReazonSpeech TV broadcast (wild audio) 21.65% 21.87%

KER = Kana Error Rate (character-level edit distance on hiragana output).

Usage

import torch
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2Model

# Load checkpoint
checkpoint = torch.load("best-medium-ep5-inference.pt", map_location="cpu")

# The checkpoint contains:
# - model_state_dict: model weights
# - kana_vocab: hiragana vocabulary mapping
# - phoneme_vocab: phoneme vocabulary mapping
# - config: training configuration

# See https://github.com/nyosegawa/hiragana-asr for full inference code

Quick inference with the repo

git clone https://github.com/nyosegawa/hiragana-asr
cd hiragana-asr
uv sync && uv run python -m unidic download

# Download model from this repo and place in models/checkpoints/
uv run python scripts/03_infer.py --audio your_audio.wav

Real-time ASR

uv run python scripts/realtime_asr.py

Training

Trained in 2 stages:

  1. Stage 1 (100h, A100): wav2vec2-large fine-tuning on ReazonSpeech small, 15 epochs
  2. Stage 2 (1,000h, H100): Continued training on ReazonSpeech medium, 5 epochs with lr=5e-5

See training report for details.

Known Limitations

  • Katakana loanwords: Long vowel marks and small kana are error-prone for foreign words
  • Long vowel mark "γƒΌ": Most unstable token across all datasets
  • Small kana: 28.1% error rate on JVS (e.g., "ぉ" confused with "ほ")

Citation

@misc{sakasegawa2026hiraganaasr,
  title={hiragana-asr: Lightweight Japanese ASR with Dual CTC},
  author={Sakasegawa},
  year={2026},
  url={https://github.com/nyosegawa/hiragana-asr}
}

License

Apache-2.0. Training data: ReazonSpeech (CDLA-Sharing-1.0 β€” model weights are unrestricted).

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sakasegawa/japanese-wav2vec2-large-hiragana-ctc

Finetuned
(2)
this model

Dataset used to train sakasegawa/japanese-wav2vec2-large-hiragana-ctc

Space using sakasegawa/japanese-wav2vec2-large-hiragana-ctc 1

Papers for sakasegawa/japanese-wav2vec2-large-hiragana-ctc