whisper-large-v3-yue-lora-dec-enc4

Fine-tuned openai/whisper-large-v3 for Cantonese (yue) speech recognition on Common Voice.

Evaluation Results

Metric	Value
CER (no punctuation)	3.28%
CER (raw)	4.03%
Eval Loss	0.0625
Best Step	36000
Best Epoch	9.08

Training History

Step	Epoch	Eval Loss	CER (nopunct)	CER (raw)
1000	0.03	0.2421	9.02%	11.76%
2000	0.05	0.2165	9.05%	11.38%
3000	0.08	0.2069	8.57%	11.01%
4000	1.01	0.1925	8.55%	10.60%
5000	1.04	0.1785	7.53%	9.69%
6000	1.06	0.1698	7.36%	9.47%
7000	1.09	0.1639	7.13%	9.23%
8000	2.02	0.1551	6.74%	8.63%
9000	2.05	0.1476	6.42%	8.43%
10000	2.07	0.1371	6.22%	7.99%
11000	2.10	0.1374	6.03%	7.91%
12000	3.03	0.1248	6.12%	7.79%
13000	3.05	0.1188	5.74%	7.34%
14000	3.08	0.1143	5.34%	6.96%
15000	4.01	0.1095	5.25%	6.60%
16000	4.04	0.1070	5.26%	6.52%
17000	4.06	0.0989	5.06%	6.28%
18000	4.09	0.0969	4.69%	5.96%
19000	5.02	0.0972	4.88%	6.03%
20000	5.04	0.0920	4.59%	5.78%
21000	5.07	0.0873	4.19%	5.22%
22000	5.10	0.0890	4.49%	5.56%
23000	6.03	0.0847	4.11%	5.18%
24000	6.05	0.0832	4.15%	5.32%
25000	6.08	0.0800	3.87%	4.91%
26000	7.01	0.0763	4.05%	4.97%
27000	7.04	0.0734	3.84%	4.64%
28000	7.06	0.0724	3.74%	4.65%
29000	7.09	0.0722	3.60%	4.53%
30000	8.02	0.0707	3.60%	4.47%
31000	8.04	0.0683	3.36%	4.17%
32000	8.07	0.0669	3.41%	4.17%
33000	8.10	0.0645	3.37%	4.19%
34000	9.03	0.0632	3.36%	4.16%
35000	9.05	0.0634	3.30%	4.10%
36000	9.08	0.0625	3.28%	4.03%

Final Evaluation

Split	CER (raw)	CER (nopunct)
test_yue	4.58%	4.03%
holdback_yue	5.21%	4.65%

Training Details

Base model: openai/whisper-large-v3
Dataset: mozilla-foundation/common_voice_17_0 (yue)
Language: Cantonese (yue)
Task: Automatic Speech Recognition (ASR)
Architecture: Encoder-Decoder (Seq2Seq)
Metric: Character Error Rate (CER)
Total training steps: 36690

Training Metrics

TensorBoard logs are included in the runs/ directory of this repository.

# Clone and view locally
git clone https://huggingface.co/awong-dev/whisper-large-v3-yue-lora-dec-enc4
tensorboard --logdir whisper-large-v3-yue-lora-dec-enc4/runs

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torchaudio

processor = WhisperProcessor.from_pretrained("awong-dev/whisper-large-v3-yue-lora-dec-enc4")
model = WhisperForConditionalGeneration.from_pretrained("awong-dev/whisper-large-v3-yue-lora-dec-enc4")

# Load audio
audio, sr = torchaudio.load("audio.mp3")
if sr != 16000:
    audio = torchaudio.transforms.Resample(sr, 16000)(audio)

input_features = processor(
    audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt"
).input_features

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Downloads last month: 27

Safetensors

Model size

2B params

Tensor type

F32

Model tree for awong-dev/whisper-large-v3-yue-lora-dec-enc4

Base model

openai/whisper-large-v3

Finetuned

(813)

this model

Dataset used to train awong-dev/whisper-large-v3-yue-lora-dec-enc4

Evaluation results

CER (no punctuation) on Common Voice (Cantonese)
test set self-reported

0.033
CER (raw) on Common Voice (Cantonese)
test set self-reported

0.040