whisper-large-v3-cantonese-tristage
Fine-tuned openai/whisper-large-v3 for Cantonese (yue) speech recognition on Common Voice.
Evaluation Results
| Metric | Value |
|---|---|
| CER (no punctuation) | 8.83% |
| CER (raw) | 11.96% |
| Eval Loss | 0.2169 |
| Best Step | 162000 |
| Best Epoch | 14.05 |
Training History
| Step | Epoch | Eval Loss | CER (nopunct) | CER (raw) |
|---|---|---|---|---|
| 2000 | 0.01 | 1.3117 | 27.13% | 31.91% |
| 4000 | 0.02 | 0.8686 | 12.97% | 17.53% |
| 6000 | 0.04 | 0.6329 | 10.92% | 15.26% |
| 8000 | 0.05 | 0.4910 | 10.51% | 14.51% |
| 10000 | 0.06 | 0.4129 | 10.31% | 14.18% |
| 12000 | 1.01 | 0.3756 | 10.06% | 13.73% |
| 14000 | 1.02 | 0.3514 | 10.04% | 13.59% |
| 16000 | 1.03 | 0.3345 | 10.11% | 13.55% |
| 18000 | 1.04 | 0.3219 | 10.20% | 13.50% |
| 20000 | 1.05 | 0.3086 | 10.16% | 13.31% |
| 22000 | 1.07 | 0.3003 | 10.25% | 13.33% |
| 24000 | 2.01 | 0.2930 | 10.13% | 13.19% |
| 26000 | 2.02 | 0.2854 | 10.00% | 12.96% |
| 28000 | 2.04 | 0.2789 | 9.92% | 12.85% |
| 30000 | 2.05 | 0.2749 | 9.89% | 12.79% |
| 32000 | 2.06 | 0.2693 | 9.84% | 12.75% |
| 34000 | 3.01 | 0.2657 | 9.71% | 12.61% |
| 36000 | 3.02 | 0.2624 | 9.78% | 12.69% |
| 38000 | 3.03 | 0.2594 | 9.69% | 12.62% |
| 40000 | 3.04 | 0.2575 | 9.63% | 12.55% |
| 42000 | 3.05 | 0.2558 | 9.67% | 12.64% |
| 44000 | 3.07 | 0.2524 | 9.56% | 12.51% |
| 46000 | 4.01 | 0.2524 | 9.51% | 12.52% |
| 48000 | 4.02 | 0.2496 | 9.47% | 12.50% |
| 50000 | 4.04 | 0.2491 | 9.41% | 12.43% |
| 52000 | 4.05 | 0.2461 | 9.46% | 12.46% |
| 54000 | 4.06 | 0.2437 | 9.39% | 12.40% |
| 56000 | 5.01 | 0.2430 | 9.40% | 12.41% |
| 58000 | 5.02 | 0.2426 | 9.39% | 12.41% |
| 60000 | 5.03 | 0.2418 | 9.34% | 12.39% |
| 62000 | 5.04 | 0.2402 | 9.41% | 12.49% |
| 64000 | 5.05 | 0.2398 | 9.32% | 12.38% |
| 66000 | 5.07 | 0.2373 | 9.28% | 12.31% |
| 68000 | 6.01 | 0.2379 | 9.25% | 12.33% |
| 70000 | 6.02 | 0.2362 | 9.25% | 12.31% |
| 72000 | 6.04 | 0.2351 | 9.22% | 12.28% |
| 74000 | 6.05 | 0.2345 | 9.20% | 12.26% |
| 76000 | 6.06 | 0.2331 | 9.16% | 12.23% |
| 78000 | 7.01 | 0.2326 | 9.21% | 12.24% |
| 80000 | 7.02 | 0.2324 | 9.24% | 12.27% |
| 82000 | 7.03 | 0.2320 | 9.21% | 12.27% |
| 84000 | 7.04 | 0.2300 | 9.11% | 12.16% |
| 86000 | 7.05 | 0.2303 | 9.07% | 12.14% |
| 88000 | 7.07 | 0.2298 | 9.08% | 12.14% |
| 90000 | 8.01 | 0.2291 | 9.15% | 12.20% |
| 92000 | 8.02 | 0.2285 | 9.03% | 12.10% |
| 94000 | 8.04 | 0.2273 | 8.93% | 11.99% |
| 96000 | 8.05 | 0.2271 | 8.99% | 12.04% |
| 98000 | 8.06 | 0.2259 | 8.93% | 11.99% |
| 100000 | 9.01 | 0.2258 | 8.93% | 12.02% |
| 102000 | 9.02 | 0.2253 | 8.98% | 12.11% |
| 104000 | 9.03 | 0.2259 | 8.94% | 12.03% |
| 106000 | 9.04 | 0.2242 | 8.96% | 12.04% |
| 108000 | 9.05 | 0.2234 | 8.97% | 12.09% |
| 110000 | 9.07 | 0.2241 | 9.03% | 12.11% |
| 112000 | 10.01 | 0.2233 | 8.97% | 12.05% |
| 114000 | 10.02 | 0.2233 | 8.99% | 12.07% |
| 116000 | 10.04 | 0.2217 | 8.89% | 11.97% |
| 118000 | 10.05 | 0.2215 | 8.97% | 12.05% |
| 120000 | 10.06 | 0.2207 | 8.96% | 12.03% |
| 122000 | 11.01 | 0.2201 | 9.06% | 12.16% |
| 124000 | 11.02 | 0.2198 | 8.96% | 12.01% |
| 126000 | 11.03 | 0.2190 | 8.92% | 11.97% |
| 128000 | 11.04 | 0.2197 | 8.89% | 11.97% |
| 130000 | 11.05 | 0.2188 | 8.97% | 12.08% |
| 132000 | 11.07 | 0.2189 | 8.95% | 12.05% |
| 134000 | 12.01 | 0.2186 | 8.95% | 12.03% |
| 136000 | 12.02 | 0.2183 | 8.90% | 12.02% |
| 138000 | 12.04 | 0.2184 | 8.92% | 12.01% |
| 140000 | 12.05 | 0.2183 | 8.94% | 12.03% |
| 142000 | 12.06 | 0.2182 | 8.95% | 12.04% |
| 144000 | 13.01 | 0.2175 | 8.94% | 12.03% |
| 146000 | 13.02 | 0.2173 | 8.89% | 11.99% |
| 148000 | 13.03 | 0.2174 | 8.93% | 12.03% |
| 150000 | 13.04 | 0.2171 | 8.89% | 12.00% |
| 152000 | 13.05 | 0.2171 | 8.84% | 11.95% |
| 154000 | 13.07 | 0.2171 | 8.86% | 11.99% |
| 156000 | 14.01 | 0.2174 | 8.92% | 12.04% |
| 158000 | 14.02 | 0.2173 | 8.86% | 11.97% |
| 160000 | 14.04 | 0.2171 | 8.85% | 11.95% |
| 162000 | 14.05 | 0.2169 | 8.83% | 11.96% |
| 164000 | 14.06 | 0.2170 | 8.84% | 11.95% |
Training Details
- Base model: openai/whisper-large-v3
- Dataset: mozilla-foundation/common_voice_17_0 (yue)
- Language: Cantonese (yue)
- Task: Automatic Speech Recognition (ASR)
- Architecture: Encoder-Decoder (Seq2Seq)
- Metric: Character Error Rate (CER)
- Total training steps: 164000
Training Metrics
TensorBoard logs are included in the runs/ directory of this repository.
# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-cantonese-tristage
tensorboard --logdir whisper-large-v3-cantonese-tristage/runs
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio
processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")
# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
audio = torchaudio.transforms.Resample(sr, 16000)(audio)
input_features = processor(
audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
- Downloads last month
- 37
Model tree for awong-dev/whisper-large-v3-cantonese-tristage
Base model
openai/whisper-large-v3Dataset used to train awong-dev/whisper-large-v3-cantonese-tristage
Evaluation results
- CER (no punctuation) on Common Voice (Cantonese)test set self-reported0.088
- CER (raw) on Common Voice (Cantonese)test set self-reported0.120