Whisper Small — Quranic Arabic ASR with Full Tashkeel

Fine-tuned openai/whisper-small on tarteel-ai/everyayah for Automatic Speech Recognition of Quranic recitation with complete tashkeel (Arabic diacritics) preserved in the output.

Model Summary

Property	Value
Base model	openai/whisper-small (244 M parameters)
Language	Arabic (ar)
Task	Automatic Speech Recognition
Dataset	tarteel-ai/everyayah
Output	Arabic text with full tashkeel (harakat)
Fine-tuning type	Full fine-tuning — all 244 M parameters trained
Precision	fp16 mixed precision
Hardware	4 x NVIDIA RTX 2080 Ti (11 GB VRAM each, 44 GB total)
Best checkpoint	Step 1500 (epoch 9.94)

Performance

Validation Set Results (500 samples, mixed reciters)

Step	Epoch	eval loss	CER (with tashkeel)	WER (with tashkeel)	WER (normalized, no tashkeel)
500	3.31	0.02615	1.3581%	6.6743%	6.0468%
1000	6.62	0.01690	0.7120%	3.6794%	3.4227%
1500	9.94	0.01766	0.6922%	3.2801%	3.0519%
2000	13.25	0.02102	0.8241%	4.1928%	3.8220%

Step 1500 is the best checkpoint by all primary metrics. Overfitting becomes visible at step 2000 where both loss and error rates increase despite continued training.

Training Loss Progression

Step	Train Loss	Learning Rate
50	0.7672	9.80e-07
100	0.2892	1.98e-06
200	0.0990	3.98e-06
300	0.0628	5.98e-06
500	0.0149	9.98e-06
700	0.0051	9.98e-06
1000	0.0009	9.89e-06
1500	0.0003	9.57e-06

The model converges rapidly — training loss drops from 0.7672 at step 50 to 0.0003 at step 1500, a reduction of 99.96%.

Comparison with Baseline

Metric	openai/whisper-small (no fine-tuning)	This model (step 1500)	Improvement
CER with tashkeel	61.97%	0.6922%	89.5x reduction
WER with tashkeel	102.22%	3.2801%	31.2x reduction

The baseline whisper-small model produces WER above 100% on Quranic text because it was not trained on Tajweed recitation and does not reliably output tashkeel. This fine-tuned model reduces character error rate from 61.97% to 0.69%.

Inference Test Results

Test on Known Reciters — Unseen Ayahs (20 samples, test split)

Samples drawn from the held-out test split containing reciters present in training but on ayahs never seen during fine-tuning.

Metric	Result
CER (with tashkeel)	0.9055%
WER (with tashkeel)	5.6122%
WER (normalized)	5.6122%

14 out of 20 samples achieved perfect 0.00% CER. Remaining errors were single-character phonological confusions on difficult Tajweed transitions.

Generalization Test — Completely Unseen Reciters (12 samples)

Samples drawn from reciters abdullah_matroud and abdurrahmaan_as-sudais, whose voices were never present in the training data.

Reciter	Samples	CER range
abdullah_matroud	6	0.00% – 29.03%
abdurrahmaan_as-sudais	6	0.00% – 8.96%

Metric	Result
CER (with tashkeel)	3.9918%
WER (with tashkeel)	15.9292%
WER (normalized)	14.1593%

3 out of 12 samples achieved perfect 0.00% CER on completely new voices. The generalization gap from trained reciters (0.69% CER) to unseen reciters (3.99% CER) demonstrates strong domain transfer within the Quranic recitation domain despite the model having 244 M parameters trained on only 19,284 samples in approximately 1.5 hours.

Why Full Fine-Tuning

Quranic recitation (Tajweed) differs substantially from modern spoken Arabic in several dimensions: phonological rules (idgham, ikhfa, madd), recitation style, and the strict requirement to reproduce full tashkeel in the output. LoRA or adapter-based approaches were considered but full fine-tuning was chosen because:

All 12 encoder and decoder layers need to adapt to the Tajweed acoustic domain.
Complete diacritic generation (tashkeel) requires the decoder vocabulary distribution to be fully reshaped, not just steered by adapter weights.
With 19,284 training samples across 6 diverse reciters, the dataset is large enough to justify full fine-tuning without severe overfitting.

Training Details

Dataset

Training used 19,284 verse-level recordings from 6 Quranic reciters selected from the tarteel-ai/everyayah dataset. Reciters were chosen to maximize acoustic diversity and reduce reciter-specific memorization.

Reciter	Approximate samples
Abdulsamad	4,269
Abdul Basit	4,269
Abdullah Basfar	4,269
Husary	4,269
Menshawi	2,846
Minshawi	4,269
Total	19,284

Validation: 500 samples (mixed reciters, not seen during training). Test: 1,000 samples (mixed reciters).

All splits were pre-filtered to remove:

Audio samples longer than 30 seconds (corrupted signal — full-chapter audio with verse-level label).
Text samples whose tokenized length exceeds 448 tokens (Whisper decoder hard limit). No truncation is applied — samples exceeding the limit are removed entirely to preserve tashkeel integrity.

Hyperparameters

Setting	Value
Learning rate	1e-5
LR scheduler	Cosine decay
Warmup steps	500
Effective batch size	8 (per device) x 4 (gradient accumulation) x 4 (GPUs) = 128
Weight decay	0.05
Dropout	0.1 (encoder, decoder, attention)
Max steps	8000 (early stopping by CER patience=5)
Precision	fp16 mixed precision
Gradient checkpointing	Enabled
Max grad norm	1.0
Optimizer	AdamW
Primary eval metric	CER with tashkeel

Evaluation Configuration

per_device_eval_batch_size: 2 (required to prevent OOM during autoregressive generation on 11 GB VRAM)
eval_accumulation_steps: 8
predict_with_generate: True
Decoding: greedy (num_beams=1) during training evaluation

Usage

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="NightPrince/stt-arabic-whisper-finetuned-diactires",
    generate_kwargs={"language": "arabic", "task": "transcribe"},
)

result = pipe("surah_fatiha.mp3")
print(result["text"])
# Example output: بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model = WhisperForConditionalGeneration.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model.eval()

# audio_array: numpy array at 16000 Hz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="arabic",
        task="transcribe",
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Metrics Definitions

CER (with tashkeel): Character Error Rate computed on the raw output including all harakat (fatha, damma, kasra, tanwin, shadda, sukun). This is the primary metric — it directly measures diacritization accuracy.
WER (with tashkeel): Word Error Rate on the raw output with full tashkeel.
WER normalized: Word Error Rate after stripping all tashkeel from both prediction and reference. Measures word-level recognition independent of diacritization.

Intended Use

This model is designed for transcribing Quranic recitation audio to text with full tashkeel preserved. Suitable applications include:

Quranic recitation evaluation and feedback tools
Quran learning applications requiring accurate text alignment
Islamic education platforms
Research in Arabic speech recognition with diacritics

Limitations

Trained exclusively on Quranic recitation. Performance on non-Quranic Arabic speech will be significantly degraded.
Optimized for verse-level (ayah-level) audio segments. Very long continuous recitations may require segmentation.
Tashkeel accuracy reflects the Uthmani script style used in the training corpus.

Citation

If you use this model in your research or application, please cite:

@misc{elnawasany2026whisperquran,
  author    = {Yahya Mohamed Elnawasany},
  title     = {Whisper Small Fine-Tuned for Quranic Arabic ASR with Full Tashkeel},
  year      = {2026},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/NightPrince/stt-arabic-whisper-finetuned-diactires}},
  email     = {yahyaalnwsany39@gmail.com},
  note      = {Portfolio: https://yahya-portfoli-app.netlify.app/}
}

Author: Yahya Mohamed Elnawasany Email: yahyaalnwsany39@gmail.com Portfolio: https://yahya-portfoli-app.netlify.app/

Training Infrastructure

Training ran on a shared server with 4 x NVIDIA RTX 2080 Ti GPUs under WSL2 using PyTorch DDP via Accelerate. Total training time to step 1500 was approximately 1.5 hours.

Downloads last month: 54

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for NightPrince/stt-arabic-whisper-finetuned-diactires

Base model

openai/whisper-small

Finetuned

(3445)

this model

Dataset used to train NightPrince/stt-arabic-whisper-finetuned-diactires

Space using NightPrince/stt-arabic-whisper-finetuned-diactires 1

Evaluation results

CER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
validation set self-reported

0.692
WER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
validation set self-reported

3.280
WER (normalized, no tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
validation set self-reported

3.052
CER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
test set self-reported

0.905
WER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
test set self-reported

5.612
WER (normalized, no tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
test set self-reported

5.612
CER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
self-reported

3.992
WER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
self-reported

15.929
WER (normalized, no tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
self-reported

14.159