wav2vec2-ja-cv25-reazon-mixed

Japanese ASR model based on facebook/wav2vec2-large-xlsr-53.

  • Task: Automatic Speech Recognition (CTC)
  • Language: Japanese
  • License: apache-2.0
  • Base model: facebook/wav2vec2-large-xlsr-53
  • Dataset:
    • Common Voice Scripted Speech 25.0 Japanese
    • ReazonSpeech(small)
  • Decoding: greedy CTC, without external language model

Usage

import torch
import librosa
from transformers import AutoModelForCTC, Wav2Vec2Processor

repo_id = "takehika/wav2vec2-ja-cv25-reazon-mixed"

processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = AutoModelForCTC.from_pretrained(repo_id)

speech, _ = librosa.load("sample.wav", sr=16_000)
inputs = processor(speech, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)
text = processor.batch_decode(pred_ids)[0]
print(text)

Data

Text normalization:

  • Unicode NFKC normalization
  • Whitespace cleanup
  • Keep Japanese characters, digits, and selected punctuation
  • Remove most other symbols

Tokenizer/vocab behavior:

  • Character-level CTC vocab
  • Built from normalized training text
  • Space is mapped to |
  • Includes digits 0-9
  • Includes , , ,
  • Does not include , in the final saved vocab

Evaluation

External evaluation results saved in this project:

  • Common Voice test: CER = 0.3343 on 8,957 samples
  • JSUT basic5000: CER = 0.2632 on 4,999 samples

CER was computed using nearly the same text normalization as training preprocessing: both references and predictions were Unicode NFKC-normalized, whitespace-cleaned, | was mapped to a space, and only Japanese characters, digits 0-9, and selected punctuation were retained.

Transcription Examples

Examples from the saved external evaluation:

Common Voice test

Reference: 私はその人の記憶を呼び起すごとに、すぐ「先生」といいたくなる。
Prediction: 私はその人の記憶を呼び起こすごとにすぐ先生とい痛くなる。

Reference: 元気の出る曲をかけて
Prediction: 元金の出る曲をかけて

Reference: 日本語は美しいです。
Prediction: 日本語は美しいです。

Reference: 札幌スクーリングへ行く
Prediction: 札幌札幌ロスクーリグへ行く

JSUT basic5000

Reference: 木曜日、停戦会談は、何の進展もないまま終了しました。
Prediction: 木曜日、低戦会段は何の浸点もないまま終了しました。

Reference: 週に四回、フランスの授業があります。
Prediction: 週に4回、フランスの授業があります。す。すすす。。

Reference: 大声で泣きながら、女の子は母親を探していた。
Prediction: 大声で泣きながら、女の子は母親を探していた。

Reference: 末期試験に備えて、本当に気合いを入れて勉強しなきゃ。
Prediction: 真記資県に備えて本当に気合を入れて勉強しなきゃ。

Intended Use and Limitations

  • Intended use: Japanese ASR for 16 kHz speech.
  • Performance may degrade on noisy speech, conversational speech, dialect-heavy speech, or domains far from Common Voice Scripted Speech and ReazonSpeech.
  • Text style: punctuation handling and formatting may differ from natural writing due to CTC-style decoding and preprocessing.
  • Reliability: outputs may include transcription errors; human review is recommended for high-stakes use.

Attribution

License

This model is licensed under Apache-2.0.

Downloads last month
66
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for takehika/wav2vec2-ja-cv25-reazon-mixed

Finetuned
(349)
this model