azeem-ahmed/Common_Voice_Corpus_22_0_Urdu
Preview β’ Updated β’ 15 β’ 1
How to use azeem-ahmed/wav2vec2-xls-r-1b-urdu with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="azeem-ahmed/wav2vec2-xls-r-1b-urdu") # Load model directly
from transformers import AutoProcessor, AutoModelForCTC
processor = AutoProcessor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
model = AutoModelForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")This repository hosts a fine-tuned version of facebook/wav2vec2-xls-r-1b for Automatic Speech Recognition (ASR) in Urdu.
The model has been trained on the Common Voice Corpus 22.0 (Urdu subset) with extensive enhancements in preprocessing, error handling, and training monitoring.
facebook/wav2vec2-xls-r-1bHyperparameters
Metrics
This model achieves exceptional performance on Urdu speech recognition:
The model was trained for 30 epochs with a batch size optimized for the RTX 4090. Metrics were logged continuously.
| Step | Epoch | Training Loss | Validation Loss | WER | CER |
|---|---|---|---|---|---|
| 1000 | 1.09 | 3.1996 | 1.0216 | 0.6107 | 0.4886 |
| 2000 | 2.18 | 5.5422 | 0.8069 | 0.4751 | 0.3801 |
| 3000 | 3.28 | 3.8995 | 0.7641 | 0.4441 | 0.3553 |
| 4000 | 4.37 | 1.7375 | 0.714 | 0.4175 | 0.334 |
| 5000 | 5.46 | 1.8486 | 0.7205 | 0.3998 | 0.3198 |
| 6000 | 6.55 | 4.2864 | 0.6949 | 0.397 | 0.3176 |
| 7000 | 7.64 | 5.7143 | 0.7016 | 0.3783 | 0.3026 |
| 8000 | 8.73 | 3.0777 | 0.6733 | 0.3817 | 0.3053 |
| 9000 | 9.83 | 3.3163 | 0.6827 | 0.3646 | 0.2916 |
| 10000 | 10.92 | 2.6399 | 0.6645 | 0.3647 | 0.2918 |
| 11000 | 12.01 | 1.9039 | 0.7104 | 0.3684 | 0.2947 |
| 12000 | 13.1 | 2.7625 | 0.693 | 0.3624 | 0.2899 |
| 13000 | 14.19 | 4.189 | 0.7066 | 0.3621 | 0.2897 |
| 14000 | 15.28 | 4.8301 | 0.7281 | 0.3565 | 0.2852 |
| 15000 | 16.38 | 2.8099 | 0.7179 | 0.354 | 0.2832 |
| 16000 | 17.47 | 2.191 | 0.7339 | 0.3527 | 0.2821 |
| 17000 | 18.56 | 6.7916 | 0.7245 | 0.3589 | 0.2871 |
| 18000 | 19.65 | 4.7375 | 0.7599 | 0.3485 | 0.2788 |
| 19000 | 20.74 | 6.2273 | 0.7414 | 0.3471 | 0.2776 |
| 20000 | 21.83 | 2.4164 | 0.7877 | 0.3519 | 0.2815 |
| 21000 | 22.93 | 3.9591 | 0.7595 | 0.3422 | 0.2737 |
| 22000 | 24.02 | 7.3049 | 0.7994 | 0.343 | 0.2744 |
| 23000 | 25.11 | 4.7571 | 0.8182 | 0.3457 | 0.2766 |
| 24000 | 26.2 | 2.9164 | 0.8067 | 0.3417 | 0.2733 |
| 25000 | 27.29 | 4.1302 | 0.8132 | 0.3377 | 0.2701 |
| 26000 | 28.38 | 4.2031 | 0.8328 | 0.3383 | 0.2707 |
| 27000 | 29.48 | 1.2038 | 0.8367 | 0.3375 | 0.27 |
| 27480 | 30 | 5.8839 | 0.8261 | 0.3376 | 0.2701 |
pip install torch librosa soundfile transformers datasets jiwer
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch, soundfile as sf
processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
speech, sr = sf.read("sample.wav")
inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids)[0])
@misc{azeem2025wav2vec2urdu,
title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR},
author={Ahmed, Azeem},
year={2025},
howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}},
}
Built with β€οΈ for the Urdu language community
Base model
facebook/wav2vec2-xls-r-1b