whisper-large-v3-cantonese-tristage

Fine-tuned openai/whisper-large-v3 for Cantonese (yue) speech recognition on Common Voice.

Evaluation Results

Metric	Value
CER (no punctuation)	8.83%
CER (raw)	11.96%
Eval Loss	0.2169
Best Step	162000
Best Epoch	14.05

Training History

Step	Epoch	Eval Loss	CER (nopunct)	CER (raw)
2000	0.01	1.3117	27.13%	31.91%
4000	0.02	0.8686	12.97%	17.53%
6000	0.04	0.6329	10.92%	15.26%
8000	0.05	0.4910	10.51%	14.51%
10000	0.06	0.4129	10.31%	14.18%
12000	1.01	0.3756	10.06%	13.73%
14000	1.02	0.3514	10.04%	13.59%
16000	1.03	0.3345	10.11%	13.55%
18000	1.04	0.3219	10.20%	13.50%
20000	1.05	0.3086	10.16%	13.31%
22000	1.07	0.3003	10.25%	13.33%
24000	2.01	0.2930	10.13%	13.19%
26000	2.02	0.2854	10.00%	12.96%
28000	2.04	0.2789	9.92%	12.85%
30000	2.05	0.2749	9.89%	12.79%
32000	2.06	0.2693	9.84%	12.75%
34000	3.01	0.2657	9.71%	12.61%
36000	3.02	0.2624	9.78%	12.69%
38000	3.03	0.2594	9.69%	12.62%
40000	3.04	0.2575	9.63%	12.55%
42000	3.05	0.2558	9.67%	12.64%
44000	3.07	0.2524	9.56%	12.51%
46000	4.01	0.2524	9.51%	12.52%
48000	4.02	0.2496	9.47%	12.50%
50000	4.04	0.2491	9.41%	12.43%
52000	4.05	0.2461	9.46%	12.46%
54000	4.06	0.2437	9.39%	12.40%
56000	5.01	0.2430	9.40%	12.41%
58000	5.02	0.2426	9.39%	12.41%
60000	5.03	0.2418	9.34%	12.39%
62000	5.04	0.2402	9.41%	12.49%
64000	5.05	0.2398	9.32%	12.38%
66000	5.07	0.2373	9.28%	12.31%
68000	6.01	0.2379	9.25%	12.33%
70000	6.02	0.2362	9.25%	12.31%
72000	6.04	0.2351	9.22%	12.28%
74000	6.05	0.2345	9.20%	12.26%
76000	6.06	0.2331	9.16%	12.23%
78000	7.01	0.2326	9.21%	12.24%
80000	7.02	0.2324	9.24%	12.27%
82000	7.03	0.2320	9.21%	12.27%
84000	7.04	0.2300	9.11%	12.16%
86000	7.05	0.2303	9.07%	12.14%
88000	7.07	0.2298	9.08%	12.14%
90000	8.01	0.2291	9.15%	12.20%
92000	8.02	0.2285	9.03%	12.10%
94000	8.04	0.2273	8.93%	11.99%
96000	8.05	0.2271	8.99%	12.04%
98000	8.06	0.2259	8.93%	11.99%
100000	9.01	0.2258	8.93%	12.02%
102000	9.02	0.2253	8.98%	12.11%
104000	9.03	0.2259	8.94%	12.03%
106000	9.04	0.2242	8.96%	12.04%
108000	9.05	0.2234	8.97%	12.09%
110000	9.07	0.2241	9.03%	12.11%
112000	10.01	0.2233	8.97%	12.05%
114000	10.02	0.2233	8.99%	12.07%
116000	10.04	0.2217	8.89%	11.97%
118000	10.05	0.2215	8.97%	12.05%
120000	10.06	0.2207	8.96%	12.03%
122000	11.01	0.2201	9.06%	12.16%
124000	11.02	0.2198	8.96%	12.01%
126000	11.03	0.2190	8.92%	11.97%
128000	11.04	0.2197	8.89%	11.97%
130000	11.05	0.2188	8.97%	12.08%
132000	11.07	0.2189	8.95%	12.05%
134000	12.01	0.2186	8.95%	12.03%
136000	12.02	0.2183	8.90%	12.02%
138000	12.04	0.2184	8.92%	12.01%
140000	12.05	0.2183	8.94%	12.03%
142000	12.06	0.2182	8.95%	12.04%
144000	13.01	0.2175	8.94%	12.03%
146000	13.02	0.2173	8.89%	11.99%
148000	13.03	0.2174	8.93%	12.03%
150000	13.04	0.2171	8.89%	12.00%
152000	13.05	0.2171	8.84%	11.95%
154000	13.07	0.2171	8.86%	11.99%
156000	14.01	0.2174	8.92%	12.04%
158000	14.02	0.2173	8.86%	11.97%
160000	14.04	0.2171	8.85%	11.95%
162000	14.05	0.2169	8.83%	11.96%
164000	14.06	0.2170	8.84%	11.95%

Training Details

Base model: openai/whisper-large-v3
Dataset: mozilla-foundation/common_voice_17_0 (yue)
Language: Cantonese (yue)
Task: Automatic Speech Recognition (ASR)
Architecture: Encoder-Decoder (Seq2Seq)
Metric: Character Error Rate (CER)
Total training steps: 164000

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-cantonese-tristage
tensorboard --logdir whisper-large-v3-cantonese-tristage/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-cantonese-tristage")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Downloads last month: 37

Safetensors

Model size

2B params

Tensor type

F32

Model tree for awong-dev/whisper-large-v3-cantonese-tristage

Base model

openai/whisper-large-v3

Finetuned

(813)

this model

Dataset used to train awong-dev/whisper-large-v3-cantonese-tristage

Evaluation results

CER (no punctuation) on Common Voice (Cantonese)
test set self-reported

0.088
CER (raw) on Common Voice (Cantonese)
test set self-reported

0.120