whisper-large-v3-yue-lora-dec-enc4
Fine-tuned openai/whisper-large-v3 for Cantonese (yue) speech recognition on Common Voice.
Evaluation Results
| Metric | Value |
|---|---|
| CER (no punctuation) | 3.28% |
| CER (raw) | 4.03% |
| Eval Loss | 0.0625 |
| Best Step | 36000 |
| Best Epoch | 9.08 |
Training History
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 1000 | 0.03 | 0.2421 | 9.02% | 11.76% |
| 2000 | 0.05 | 0.2165 | 9.05% | 11.38% |
| 3000 | 0.08 | 0.2069 | 8.57% | 11.01% |
| 4000 | 1.01 | 0.1925 | 8.55% | 10.60% |
| 5000 | 1.04 | 0.1785 | 7.53% | 9.69% |
| 6000 | 1.06 | 0.1698 | 7.36% | 9.47% |
| 7000 | 1.09 | 0.1639 | 7.13% | 9.23% |
| 8000 | 2.02 | 0.1551 | 6.74% | 8.63% |
| 9000 | 2.05 | 0.1476 | 6.42% | 8.43% |
| 10000 | 2.07 | 0.1371 | 6.22% | 7.99% |
| 11000 | 2.10 | 0.1374 | 6.03% | 7.91% |
| 12000 | 3.03 | 0.1248 | 6.12% | 7.79% |
| 13000 | 3.05 | 0.1188 | 5.74% | 7.34% |
| 14000 | 3.08 | 0.1143 | 5.34% | 6.96% |
| 15000 | 4.01 | 0.1095 | 5.25% | 6.60% |
| 16000 | 4.04 | 0.1070 | 5.26% | 6.52% |
| 17000 | 4.06 | 0.0989 | 5.06% | 6.28% |
| 18000 | 4.09 | 0.0969 | 4.69% | 5.96% |
| 19000 | 5.02 | 0.0972 | 4.88% | 6.03% |
| 20000 | 5.04 | 0.0920 | 4.59% | 5.78% |
| 21000 | 5.07 | 0.0873 | 4.19% | 5.22% |
| 22000 | 5.10 | 0.0890 | 4.49% | 5.56% |
| 23000 | 6.03 | 0.0847 | 4.11% | 5.18% |
| 24000 | 6.05 | 0.0832 | 4.15% | 5.32% |
| 25000 | 6.08 | 0.0800 | 3.87% | 4.91% |
| 26000 | 7.01 | 0.0763 | 4.05% | 4.97% |
| 27000 | 7.04 | 0.0734 | 3.84% | 4.64% |
| 28000 | 7.06 | 0.0724 | 3.74% | 4.65% |
| 29000 | 7.09 | 0.0722 | 3.60% | 4.53% |
| 30000 | 8.02 | 0.0707 | 3.60% | 4.47% |
| 31000 | 8.04 | 0.0683 | 3.36% | 4.17% |
| 32000 | 8.07 | 0.0669 | 3.41% | 4.17% |
| 33000 | 8.10 | 0.0645 | 3.37% | 4.19% |
| 34000 | 9.03 | 0.0632 | 3.36% | 4.16% |
| 35000 | 9.05 | 0.0634 | 3.30% | 4.10% |
| 36000 | 9.08 | 0.0625 | 3.28% | 4.03% |
Final Evaluation
| Split | CER (raw) | CER (nopunct) |
|---|---|---|
| test_yue | 4.58% | 4.03% |
| holdback_yue | 5.21% | 4.65% |
Training Details
- Base model: openai/whisper-large-v3
- Dataset: mozilla-foundation/common_voice_17_0 (yue)
- Language: Cantonese (yue)
- Task: Automatic Speech Recognition (ASR)
- Architecture: Encoder-Decoder (Seq2Seq)
- Metric: Character Error Rate (CER)
- Total training steps: 36690
Training Metrics
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-lora-dec-enc4
tensorboard --logdir whisper-large-v3-yue-lora-dec-enc4/runs
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-lora-dec-enc4")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-lora-dec-enc4")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
input_features = processor(
audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
- Downloads last month
- 27
Model tree for awong-dev/whisper-large-v3-yue-lora-dec-enc4
Base model
openai/whisper-large-v3Dataset used to train awong-dev/whisper-large-v3-yue-lora-dec-enc4
Evaluation results
- CER (no punctuation) on Common Voice (Cantonese)test set self-reported0.033
- CER (raw) on Common Voice (Cantonese)test set self-reported0.040