Whisper Small — Quranic Arabic ASR with Full Tashkeel
Fine-tuned openai/whisper-small on tarteel-ai/everyayah for Automatic Speech Recognition of Quranic recitation with complete tashkeel (Arabic diacritics) preserved in the output.
Model Summary
| Property | Value |
|---|---|
| Base model | openai/whisper-small (244 M parameters) |
| Language | Arabic (ar) |
| Task | Automatic Speech Recognition |
| Dataset | tarteel-ai/everyayah |
| Output | Arabic text with full tashkeel (harakat) |
| Fine-tuning type | Full fine-tuning — all 244 M parameters trained |
| Precision | fp16 mixed precision |
| Hardware | 4 x NVIDIA RTX 2080 Ti (11 GB VRAM each, 44 GB total) |
| Best checkpoint | Step 1500 (epoch 9.94) |
Performance
Validation Set Results (500 samples, mixed reciters)
| Step | Epoch | eval loss | CER (with tashkeel) | WER (with tashkeel) | WER (normalized, no tashkeel) |
|---|---|---|---|---|---|
| 500 | 3.31 | 0.02615 | 1.3581% | 6.6743% | 6.0468% |
| 1000 | 6.62 | 0.01690 | 0.7120% | 3.6794% | 3.4227% |
| 1500 | 9.94 | 0.01766 | 0.6922% | 3.2801% | 3.0519% |
| 2000 | 13.25 | 0.02102 | 0.8241% | 4.1928% | 3.8220% |
Step 1500 is the best checkpoint by all primary metrics. Overfitting becomes visible at step 2000 where both loss and error rates increase despite continued training.
Training Loss Progression
| Step | Train Loss | Learning Rate |
|---|---|---|
| 50 | 0.7672 | 9.80e-07 |
| 100 | 0.2892 | 1.98e-06 |
| 200 | 0.0990 | 3.98e-06 |
| 300 | 0.0628 | 5.98e-06 |
| 500 | 0.0149 | 9.98e-06 |
| 700 | 0.0051 | 9.98e-06 |
| 1000 | 0.0009 | 9.89e-06 |
| 1500 | 0.0003 | 9.57e-06 |
The model converges rapidly — training loss drops from 0.7672 at step 50 to 0.0003 at step 1500, a reduction of 99.96%.
Comparison with Baseline
| Metric | openai/whisper-small (no fine-tuning) | This model (step 1500) | Improvement |
|---|---|---|---|
| CER with tashkeel | 61.97% | 0.6922% | 89.5x reduction |
| WER with tashkeel | 102.22% | 3.2801% | 31.2x reduction |
The baseline whisper-small model produces WER above 100% on Quranic text because it was not trained on Tajweed recitation and does not reliably output tashkeel. This fine-tuned model reduces character error rate from 61.97% to 0.69%.
Inference Test Results
Test on Known Reciters — Unseen Ayahs (20 samples, test split)
Samples drawn from the held-out test split containing reciters present in training but on ayahs never seen during fine-tuning.
| Metric | Result |
|---|---|
| CER (with tashkeel) | 0.9055% |
| WER (with tashkeel) | 5.6122% |
| WER (normalized) | 5.6122% |
14 out of 20 samples achieved perfect 0.00% CER. Remaining errors were single-character phonological confusions on difficult Tajweed transitions.
Generalization Test — Completely Unseen Reciters (12 samples)
Samples drawn from reciters abdullah_matroud and abdurrahmaan_as-sudais, whose voices were never present in the training data.
| Reciter | Samples | CER range |
|---|---|---|
| abdullah_matroud | 6 | 0.00% – 29.03% |
| abdurrahmaan_as-sudais | 6 | 0.00% – 8.96% |
| Metric | Result |
|---|---|
| CER (with tashkeel) | 3.9918% |
| WER (with tashkeel) | 15.9292% |
| WER (normalized) | 14.1593% |
3 out of 12 samples achieved perfect 0.00% CER on completely new voices. The generalization gap from trained reciters (0.69% CER) to unseen reciters (3.99% CER) demonstrates strong domain transfer within the Quranic recitation domain despite the model having 244 M parameters trained on only 19,284 samples in approximately 1.5 hours.
Why Full Fine-Tuning
Quranic recitation (Tajweed) differs substantially from modern spoken Arabic in several dimensions: phonological rules (idgham, ikhfa, madd), recitation style, and the strict requirement to reproduce full tashkeel in the output. LoRA or adapter-based approaches were considered but full fine-tuning was chosen because:
- All 12 encoder and decoder layers need to adapt to the Tajweed acoustic domain.
- Complete diacritic generation (tashkeel) requires the decoder vocabulary distribution to be fully reshaped, not just steered by adapter weights.
- With 19,284 training samples across 6 diverse reciters, the dataset is large enough to justify full fine-tuning without severe overfitting.
Training Details
Dataset
Training used 19,284 verse-level recordings from 6 Quranic reciters selected from the tarteel-ai/everyayah dataset. Reciters were chosen to maximize acoustic diversity and reduce reciter-specific memorization.
| Reciter | Approximate samples |
|---|---|
| Abdulsamad | 4,269 |
| Abdul Basit | 4,269 |
| Abdullah Basfar | 4,269 |
| Husary | 4,269 |
| Menshawi | 2,846 |
| Minshawi | 4,269 |
| Total | 19,284 |
Validation: 500 samples (mixed reciters, not seen during training). Test: 1,000 samples (mixed reciters).
All splits were pre-filtered to remove:
- Audio samples longer than 30 seconds (corrupted signal — full-chapter audio with verse-level label).
- Text samples whose tokenized length exceeds 448 tokens (Whisper decoder hard limit). No truncation is applied — samples exceeding the limit are removed entirely to preserve tashkeel integrity.
Hyperparameters
| Setting | Value |
|---|---|
| Learning rate | 1e-5 |
| LR scheduler | Cosine decay |
| Warmup steps | 500 |
| Effective batch size | 8 (per device) x 4 (gradient accumulation) x 4 (GPUs) = 128 |
| Weight decay | 0.05 |
| Dropout | 0.1 (encoder, decoder, attention) |
| Max steps | 8000 (early stopping by CER patience=5) |
| Precision | fp16 mixed precision |
| Gradient checkpointing | Enabled |
| Max grad norm | 1.0 |
| Optimizer | AdamW |
| Primary eval metric | CER with tashkeel |
Evaluation Configuration
per_device_eval_batch_size: 2 (required to prevent OOM during autoregressive generation on 11 GB VRAM)eval_accumulation_steps: 8predict_with_generate: True- Decoding: greedy (num_beams=1) during training evaluation
Usage
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="NightPrince/stt-arabic-whisper-finetuned-diactires",
generate_kwargs={"language": "arabic", "task": "transcribe"},
)
result = pipe("surah_fatiha.mp3")
print(result["text"])
# Example output: بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
processor = WhisperProcessor.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model = WhisperForConditionalGeneration.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model.eval()
# audio_array: numpy array at 16000 Hz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(
inputs.input_features,
language="arabic",
task="transcribe",
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
Metrics Definitions
- CER (with tashkeel): Character Error Rate computed on the raw output including all harakat (fatha, damma, kasra, tanwin, shadda, sukun). This is the primary metric — it directly measures diacritization accuracy.
- WER (with tashkeel): Word Error Rate on the raw output with full tashkeel.
- WER normalized: Word Error Rate after stripping all tashkeel from both prediction and reference. Measures word-level recognition independent of diacritization.
Intended Use
This model is designed for transcribing Quranic recitation audio to text with full tashkeel preserved. Suitable applications include:
- Quranic recitation evaluation and feedback tools
- Quran learning applications requiring accurate text alignment
- Islamic education platforms
- Research in Arabic speech recognition with diacritics
Limitations
- Trained exclusively on Quranic recitation. Performance on non-Quranic Arabic speech will be significantly degraded.
- Optimized for verse-level (ayah-level) audio segments. Very long continuous recitations may require segmentation.
- Tashkeel accuracy reflects the Uthmani script style used in the training corpus.
Citation
If you use this model in your research or application, please cite:
@misc{elnawasany2026whisperquran,
author = {Yahya Mohamed Elnawasany},
title = {Whisper Small Fine-Tuned for Quranic Arabic ASR with Full Tashkeel},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/NightPrince/stt-arabic-whisper-finetuned-diactires}},
email = {yahyaalnwsany39@gmail.com},
note = {Portfolio: https://yahya-portfoli-app.netlify.app/}
}
Author: Yahya Mohamed Elnawasany Email: yahyaalnwsany39@gmail.com Portfolio: https://yahya-portfoli-app.netlify.app/
Training Infrastructure
Training ran on a shared server with 4 x NVIDIA RTX 2080 Ti GPUs under WSL2 using PyTorch DDP via Accelerate. Total training time to step 1500 was approximately 1.5 hours.
- Downloads last month
- 54
Model tree for NightPrince/stt-arabic-whisper-finetuned-diactires
Base model
openai/whisper-smallDataset used to train NightPrince/stt-arabic-whisper-finetuned-diactires
Space using NightPrince/stt-arabic-whisper-finetuned-diactires 1
Evaluation results
- CER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)validation set self-reported0.692
- WER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)validation set self-reported3.280
- WER (normalized, no tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)validation set self-reported3.052
- CER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)test set self-reported0.905
- WER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)test set self-reported5.612
- WER (normalized, no tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)test set self-reported5.612
- CER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)self-reported3.992
- WER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)self-reported15.929
- WER (normalized, no tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)self-reported14.159