Whisper Small — Quranic Arabic ASR with Full Tashkeel

Fine-tuned openai/whisper-small on tarteel-ai/everyayah for Automatic Speech Recognition of Quranic recitation with complete tashkeel (Arabic diacritics) preserved in the output.

Model Summary

Property Value
Base model openai/whisper-small (244 M parameters)
Language Arabic (ar)
Task Automatic Speech Recognition
Dataset tarteel-ai/everyayah
Output Arabic text with full tashkeel (harakat)
Fine-tuning type Full fine-tuning — all 244 M parameters trained
Precision fp16 mixed precision
Hardware 4 x NVIDIA RTX 2080 Ti (11 GB VRAM each, 44 GB total)
Best checkpoint Step 1500 (epoch 9.94)

Performance

Validation Set Results (500 samples, mixed reciters)

Step Epoch eval loss CER (with tashkeel) WER (with tashkeel) WER (normalized, no tashkeel)
500 3.31 0.02615 1.3581% 6.6743% 6.0468%
1000 6.62 0.01690 0.7120% 3.6794% 3.4227%
1500 9.94 0.01766 0.6922% 3.2801% 3.0519%
2000 13.25 0.02102 0.8241% 4.1928% 3.8220%

Step 1500 is the best checkpoint by all primary metrics. Overfitting becomes visible at step 2000 where both loss and error rates increase despite continued training.

Training Loss Progression

Step Train Loss Learning Rate
50 0.7672 9.80e-07
100 0.2892 1.98e-06
200 0.0990 3.98e-06
300 0.0628 5.98e-06
500 0.0149 9.98e-06
700 0.0051 9.98e-06
1000 0.0009 9.89e-06
1500 0.0003 9.57e-06

The model converges rapidly — training loss drops from 0.7672 at step 50 to 0.0003 at step 1500, a reduction of 99.96%.

Comparison with Baseline

Metric openai/whisper-small (no fine-tuning) This model (step 1500) Improvement
CER with tashkeel 61.97% 0.6922% 89.5x reduction
WER with tashkeel 102.22% 3.2801% 31.2x reduction

The baseline whisper-small model produces WER above 100% on Quranic text because it was not trained on Tajweed recitation and does not reliably output tashkeel. This fine-tuned model reduces character error rate from 61.97% to 0.69%.

Inference Test Results

Test on Known Reciters — Unseen Ayahs (20 samples, test split)

Samples drawn from the held-out test split containing reciters present in training but on ayahs never seen during fine-tuning.

Metric Result
CER (with tashkeel) 0.9055%
WER (with tashkeel) 5.6122%
WER (normalized) 5.6122%

14 out of 20 samples achieved perfect 0.00% CER. Remaining errors were single-character phonological confusions on difficult Tajweed transitions.

Generalization Test — Completely Unseen Reciters (12 samples)

Samples drawn from reciters abdullah_matroud and abdurrahmaan_as-sudais, whose voices were never present in the training data.

Reciter Samples CER range
abdullah_matroud 6 0.00% – 29.03%
abdurrahmaan_as-sudais 6 0.00% – 8.96%
Metric Result
CER (with tashkeel) 3.9918%
WER (with tashkeel) 15.9292%
WER (normalized) 14.1593%

3 out of 12 samples achieved perfect 0.00% CER on completely new voices. The generalization gap from trained reciters (0.69% CER) to unseen reciters (3.99% CER) demonstrates strong domain transfer within the Quranic recitation domain despite the model having 244 M parameters trained on only 19,284 samples in approximately 1.5 hours.

Why Full Fine-Tuning

Quranic recitation (Tajweed) differs substantially from modern spoken Arabic in several dimensions: phonological rules (idgham, ikhfa, madd), recitation style, and the strict requirement to reproduce full tashkeel in the output. LoRA or adapter-based approaches were considered but full fine-tuning was chosen because:

  1. All 12 encoder and decoder layers need to adapt to the Tajweed acoustic domain.
  2. Complete diacritic generation (tashkeel) requires the decoder vocabulary distribution to be fully reshaped, not just steered by adapter weights.
  3. With 19,284 training samples across 6 diverse reciters, the dataset is large enough to justify full fine-tuning without severe overfitting.

Training Details

Dataset

Training used 19,284 verse-level recordings from 6 Quranic reciters selected from the tarteel-ai/everyayah dataset. Reciters were chosen to maximize acoustic diversity and reduce reciter-specific memorization.

Reciter Approximate samples
Abdulsamad 4,269
Abdul Basit 4,269
Abdullah Basfar 4,269
Husary 4,269
Menshawi 2,846
Minshawi 4,269
Total 19,284

Validation: 500 samples (mixed reciters, not seen during training). Test: 1,000 samples (mixed reciters).

All splits were pre-filtered to remove:

  • Audio samples longer than 30 seconds (corrupted signal — full-chapter audio with verse-level label).
  • Text samples whose tokenized length exceeds 448 tokens (Whisper decoder hard limit). No truncation is applied — samples exceeding the limit are removed entirely to preserve tashkeel integrity.

Hyperparameters

Setting Value
Learning rate 1e-5
LR scheduler Cosine decay
Warmup steps 500
Effective batch size 8 (per device) x 4 (gradient accumulation) x 4 (GPUs) = 128
Weight decay 0.05
Dropout 0.1 (encoder, decoder, attention)
Max steps 8000 (early stopping by CER patience=5)
Precision fp16 mixed precision
Gradient checkpointing Enabled
Max grad norm 1.0
Optimizer AdamW
Primary eval metric CER with tashkeel

Evaluation Configuration

  • per_device_eval_batch_size: 2 (required to prevent OOM during autoregressive generation on 11 GB VRAM)
  • eval_accumulation_steps: 8
  • predict_with_generate: True
  • Decoding: greedy (num_beams=1) during training evaluation

Usage

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="NightPrince/stt-arabic-whisper-finetuned-diactires",
    generate_kwargs={"language": "arabic", "task": "transcribe"},
)

result = pipe("surah_fatiha.mp3")
print(result["text"])
# Example output: بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model = WhisperForConditionalGeneration.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model.eval()

# audio_array: numpy array at 16000 Hz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="arabic",
        task="transcribe",
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

Metrics Definitions

  • CER (with tashkeel): Character Error Rate computed on the raw output including all harakat (fatha, damma, kasra, tanwin, shadda, sukun). This is the primary metric — it directly measures diacritization accuracy.
  • WER (with tashkeel): Word Error Rate on the raw output with full tashkeel.
  • WER normalized: Word Error Rate after stripping all tashkeel from both prediction and reference. Measures word-level recognition independent of diacritization.

Intended Use

This model is designed for transcribing Quranic recitation audio to text with full tashkeel preserved. Suitable applications include:

  • Quranic recitation evaluation and feedback tools
  • Quran learning applications requiring accurate text alignment
  • Islamic education platforms
  • Research in Arabic speech recognition with diacritics

Limitations

  • Trained exclusively on Quranic recitation. Performance on non-Quranic Arabic speech will be significantly degraded.
  • Optimized for verse-level (ayah-level) audio segments. Very long continuous recitations may require segmentation.
  • Tashkeel accuracy reflects the Uthmani script style used in the training corpus.

Citation

If you use this model in your research or application, please cite:

@misc{elnawasany2026whisperquran,
  author    = {Yahya Mohamed Elnawasany},
  title     = {Whisper Small Fine-Tuned for Quranic Arabic ASR with Full Tashkeel},
  year      = {2026},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/NightPrince/stt-arabic-whisper-finetuned-diactires}},
  email     = {yahyaalnwsany39@gmail.com},
  note      = {Portfolio: https://yahya-portfoli-app.netlify.app/}
}

Author: Yahya Mohamed Elnawasany Email: yahyaalnwsany39@gmail.com Portfolio: https://yahya-portfoli-app.netlify.app/

Training Infrastructure

Training ran on a shared server with 4 x NVIDIA RTX 2080 Ti GPUs under WSL2 using PyTorch DDP via Accelerate. Total training time to step 1500 was approximately 1.5 hours.

Downloads last month
54
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NightPrince/stt-arabic-whisper-finetuned-diactires

Finetuned
(3445)
this model

Dataset used to train NightPrince/stt-arabic-whisper-finetuned-diactires

Space using NightPrince/stt-arabic-whisper-finetuned-diactires 1

Evaluation results

  • CER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
    validation set self-reported
    0.692
  • WER (with tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
    validation set self-reported
    3.280
  • WER (normalized, no tashkeel) on tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
    validation set self-reported
    3.052
  • CER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
    test set self-reported
    0.905
  • WER (with tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
    test set self-reported
    5.612
  • WER (normalized, no tashkeel) on tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
    test set self-reported
    5.612
  • CER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
    self-reported
    3.992
  • WER (with tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
    self-reported
    15.929
  • WER (normalized, no tashkeel) — unseen reciters on tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
    self-reported
    14.159