wav2vec2-ja-cv25-reazon-mixed

Japanese ASR model based on facebook/wav2vec2-large-xlsr-53.

Task: Automatic Speech Recognition (CTC)
Language: Japanese
License: apache-2.0
Base model: facebook/wav2vec2-large-xlsr-53
Dataset:
- Common Voice Scripted Speech 25.0 Japanese
- ReazonSpeech(small)
Decoding: greedy CTC, without external language model

Usage

import torch
import librosa
from transformers import AutoModelForCTC, Wav2Vec2Processor

repo_id = "takehika/wav2vec2-ja-cv25-reazon-mixed"

processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = AutoModelForCTC.from_pretrained(repo_id)

speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)

Data

Dataset:
- Common Voice Scripted Speech 25.0 Japanese
- ReazonSpeech(small)

Text normalization:

Unicode NFKC normalization
Whitespace cleanup
Keep Japanese characters, digits, and selected punctuation
Remove most other symbols

Tokenizer/vocab behavior:

Character-level CTC vocab
Built from normalized training text
Space is mapped to |
Includes digits 0-9
Includes 、, 。, 「, 」
Does not include ？, ！ in the final saved vocab

Evaluation

External evaluation results saved in this project:

Common Voice test: CER = 0.3343 on 8,957 samples
JSUT basic5000: CER = 0.2632 on 4,999 samples

CER was computed using nearly the same text normalization as training preprocessing: both references and predictions were Unicode NFKC-normalized, whitespace-cleaned, | was mapped to a space, and only Japanese characters, digits 0-9, and selected punctuation were retained.

Transcription Examples

Examples from the saved external evaluation:

Common Voice test

Reference: 私はその人の記憶を呼び起すごとに、すぐ「先生」といいたくなる。
Prediction: 私はその人の記憶を呼び起こすごとにすぐ先生とい痛くなる。

Reference: 元気の出る曲をかけて
Prediction: 元金の出る曲をかけて

Reference: 日本語は美しいです。
Prediction: 日本語は美しいです。

Reference: 札幌スクーリングへ行く
Prediction: 札幌札幌ロスクーリグへ行く

JSUT basic5000

Reference: 木曜日、停戦会談は、何の進展もないまま終了しました。
Prediction: 木曜日、低戦会段は何の浸点もないまま終了しました。

Reference: 週に四回、フランスの授業があります。
Prediction: 週に4回、フランスの授業があります。す。すすす。。

Reference: 大声で泣きながら、女の子は母親を探していた。
Prediction: 大声で泣きながら、女の子は母親を探していた。

Reference: 末期試験に備えて、本当に気合いを入れて勉強しなきゃ。
Prediction: 真記資県に備えて本当に気合を入れて勉強しなきゃ。

Intended Use and Limitations

Intended use: Japanese ASR for 16 kHz speech.
Performance may degrade on noisy speech, conversational speech, dialect-heavy speech, or domains far from Common Voice Scripted Speech and ReazonSpeech.
Text style: punctuation handling and formatting may differ from natural writing due to CTC-style decoding and preprocessing.
Reliability: outputs may include transcription errors; human review is recommended for high-stakes use.

Attribution

Base model:
- facebook/wav2vec2-large-xlsr-53 - Apache-2.0
Dataset:
- Common Voice Scripted Speech 25.0 Japanese - CC0-1.0
- ReazonSpeech - CDLA-Sharing-1.0 (see the dataset page for license/terms before use)

License

This model is licensed under Apache-2.0.

Downloads last month: 66

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for takehika/wav2vec2-ja-cv25-reazon-mixed

Base model

facebook/wav2vec2-large-xlsr-53

Finetuned

(349)

this model