wav2vec2-ja-cv25-reazon-mixed
Japanese ASR model based on facebook/wav2vec2-large-xlsr-53.
- Task: Automatic Speech Recognition (CTC)
- Language: Japanese
- License: apache-2.0
- Base model:
facebook/wav2vec2-large-xlsr-53 - Dataset:
- Common Voice Scripted Speech 25.0 Japanese
- ReazonSpeech(small)
- Decoding: greedy CTC, without external language model
Usage
import torch
import librosa
from transformers import AutoModelForCTC, Wav2Vec2Processor
repo_id = "takehika/wav2vec2-ja-cv25-reazon-mixed"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = AutoModelForCTC.from_pretrained(repo_id)
speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)
Data
Text normalization:
- Unicode NFKC normalization
- Whitespace cleanup
- Keep Japanese characters, digits, and selected punctuation
- Remove most other symbols
Tokenizer/vocab behavior:
- Character-level CTC vocab
- Built from normalized training text
- Space is mapped to
| - Includes digits
0-9 - Includes
、,。,「,」 - Does not include
?,!in the final saved vocab
Evaluation
External evaluation results saved in this project:
- Common Voice test:
CER = 0.3343on8,957samples - JSUT basic5000:
CER = 0.2632on4,999samples
CER was computed using nearly the same text normalization as training preprocessing: both references and predictions were Unicode NFKC-normalized, whitespace-cleaned, | was mapped to a space, and only Japanese characters, digits 0-9, and selected punctuation were retained.
Transcription Examples
Examples from the saved external evaluation:
Common Voice test
Reference: 私はその人の記憶を呼び起すごとに、すぐ「先生」といいたくなる。
Prediction: 私はその人の記憶を呼び起こすごとにすぐ先生とい痛くなる。
Reference: 元気の出る曲をかけて
Prediction: 元金の出る曲をかけて
Reference: 日本語は美しいです。
Prediction: 日本語は美しいです。
Reference: 札幌スクーリングへ行く
Prediction: 札幌札幌ロスクーリグへ行く
JSUT basic5000
Reference: 木曜日、停戦会談は、何の進展もないまま終了しました。
Prediction: 木曜日、低戦会段は何の浸点もないまま終了しました。
Reference: 週に四回、フランスの授業があります。
Prediction: 週に4回、フランスの授業があります。す。すすす。。
Reference: 大声で泣きながら、女の子は母親を探していた。
Prediction: 大声で泣きながら、女の子は母親を探していた。
Reference: 末期試験に備えて、本当に気合いを入れて勉強しなきゃ。
Prediction: 真記資県に備えて本当に気合を入れて勉強しなきゃ。
Intended Use and Limitations
- Intended use: Japanese ASR for 16 kHz speech.
- Performance may degrade on noisy speech, conversational speech, dialect-heavy speech, or domains far from Common Voice Scripted Speech and ReazonSpeech.
- Text style: punctuation handling and formatting may differ from natural writing due to CTC-style decoding and preprocessing.
- Reliability: outputs may include transcription errors; human review is recommended for high-stakes use.
Attribution
- Base model:
facebook/wav2vec2-large-xlsr-53- Apache-2.0
- Dataset:
- Common Voice Scripted Speech 25.0 Japanese - CC0-1.0
- ReazonSpeech - CDLA-Sharing-1.0 (see the dataset page for license/terms before use)
License
This model is licensed under Apache-2.0.
- Downloads last month
- 66
Model tree for takehika/wav2vec2-ja-cv25-reazon-mixed
Base model
facebook/wav2vec2-large-xlsr-53