whisper-large-v3-cantonese-tristage

Fine-tuned openai/whisper-large-v3 for Cantonese (yue) speech recognition on Common Voice.

Evaluation Results

Metric Value
CER (no punctuation) 8.83%
CER (raw) 11.96%
Eval Loss 0.2169
Best Step 162000
Best Epoch 14.05

Training History

Step Epoch Eval Loss CER (nopunct) CER (raw)
2000 0.01 1.3117 27.13% 31.91%
4000 0.02 0.8686 12.97% 17.53%
6000 0.04 0.6329 10.92% 15.26%
8000 0.05 0.4910 10.51% 14.51%
10000 0.06 0.4129 10.31% 14.18%
12000 1.01 0.3756 10.06% 13.73%
14000 1.02 0.3514 10.04% 13.59%
16000 1.03 0.3345 10.11% 13.55%
18000 1.04 0.3219 10.20% 13.50%
20000 1.05 0.3086 10.16% 13.31%
22000 1.07 0.3003 10.25% 13.33%
24000 2.01 0.2930 10.13% 13.19%
26000 2.02 0.2854 10.00% 12.96%
28000 2.04 0.2789 9.92% 12.85%
30000 2.05 0.2749 9.89% 12.79%
32000 2.06 0.2693 9.84% 12.75%
34000 3.01 0.2657 9.71% 12.61%
36000 3.02 0.2624 9.78% 12.69%
38000 3.03 0.2594 9.69% 12.62%
40000 3.04 0.2575 9.63% 12.55%
42000 3.05 0.2558 9.67% 12.64%
44000 3.07 0.2524 9.56% 12.51%
46000 4.01 0.2524 9.51% 12.52%
48000 4.02 0.2496 9.47% 12.50%
50000 4.04 0.2491 9.41% 12.43%
52000 4.05 0.2461 9.46% 12.46%
54000 4.06 0.2437 9.39% 12.40%
56000 5.01 0.2430 9.40% 12.41%
58000 5.02 0.2426 9.39% 12.41%
60000 5.03 0.2418 9.34% 12.39%
62000 5.04 0.2402 9.41% 12.49%
64000 5.05 0.2398 9.32% 12.38%
66000 5.07 0.2373 9.28% 12.31%
68000 6.01 0.2379 9.25% 12.33%
70000 6.02 0.2362 9.25% 12.31%
72000 6.04 0.2351 9.22% 12.28%
74000 6.05 0.2345 9.20% 12.26%
76000 6.06 0.2331 9.16% 12.23%
78000 7.01 0.2326 9.21% 12.24%
80000 7.02 0.2324 9.24% 12.27%
82000 7.03 0.2320 9.21% 12.27%
84000 7.04 0.2300 9.11% 12.16%
86000 7.05 0.2303 9.07% 12.14%
88000 7.07 0.2298 9.08% 12.14%
90000 8.01 0.2291 9.15% 12.20%
92000 8.02 0.2285 9.03% 12.10%
94000 8.04 0.2273 8.93% 11.99%
96000 8.05 0.2271 8.99% 12.04%
98000 8.06 0.2259 8.93% 11.99%
100000 9.01 0.2258 8.93% 12.02%
102000 9.02 0.2253 8.98% 12.11%
104000 9.03 0.2259 8.94% 12.03%
106000 9.04 0.2242 8.96% 12.04%
108000 9.05 0.2234 8.97% 12.09%
110000 9.07 0.2241 9.03% 12.11%
112000 10.01 0.2233 8.97% 12.05%
114000 10.02 0.2233 8.99% 12.07%
116000 10.04 0.2217 8.89% 11.97%
118000 10.05 0.2215 8.97% 12.05%
120000 10.06 0.2207 8.96% 12.03%
122000 11.01 0.2201 9.06% 12.16%
124000 11.02 0.2198 8.96% 12.01%
126000 11.03 0.2190 8.92% 11.97%
128000 11.04 0.2197 8.89% 11.97%
130000 11.05 0.2188 8.97% 12.08%
132000 11.07 0.2189 8.95% 12.05%
134000 12.01 0.2186 8.95% 12.03%
136000 12.02 0.2183 8.90% 12.02%
138000 12.04 0.2184 8.92% 12.01%
140000 12.05 0.2183 8.94% 12.03%
142000 12.06 0.2182 8.95% 12.04%
144000 13.01 0.2175 8.94% 12.03%
146000 13.02 0.2173 8.89% 11.99%
148000 13.03 0.2174 8.93% 12.03%
150000 13.04 0.2171 8.89% 12.00%
152000 13.05 0.2171 8.84% 11.95%
154000 13.07 0.2171 8.86% 11.99%
156000 14.01 0.2174 8.92% 12.04%
158000 14.02 0.2173 8.86% 11.97%
160000 14.04 0.2171 8.85% 11.95%
162000 14.05 0.2169 8.83% 11.96%
164000 14.06 0.2170 8.84% 11.95%

Training Details

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-cantonese-tristage
tensorboard --logdir whisper-large-v3-cantonese-tristage/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Downloads last month
37
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for awong-dev/whisper-large-v3-cantonese-tristage

Finetuned
(813)
this model

Dataset used to train awong-dev/whisper-large-v3-cantonese-tristage

Evaluation results