---
library_name: transformers
language:
- ar
license: apache-2.0
base_model: openai/whisper-small
datasets:
- tarteel-ai/everyayah
tags:
- whisper
- automatic-speech-recognition
- arabic
- quran
- tashkeel
- diacritics
- tajweed
metrics:
- wer
- cer
model-index:
- name: NightPrince/stt-arabic-whisper-finetuned-diactires
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: tarteel-ai/everyayah (validation split — 500 samples, trained reciters)
      type: tarteel-ai/everyayah
      split: validation
    metrics:
    - type: cer
      value: 0.6922
      name: CER (with tashkeel)
    - type: wer
      value: 3.2801
      name: WER (with tashkeel)
    - type: wer
      value: 3.0519
      name: WER (normalized, no tashkeel)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs)
      type: tarteel-ai/everyayah
      split: test
    metrics:
    - type: cer
      value: 0.9055
      name: CER (with tashkeel)
    - type: wer
      value: 5.6122
      name: WER (with tashkeel)
    - type: wer
      value: 5.6122
      name: WER (normalized, no tashkeel)
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters)
      type: tarteel-ai/everyayah
      split: train
    metrics:
    - type: cer
      value: 3.9918
      name: CER (with tashkeel) — unseen reciters
    - type: wer
      value: 15.9292
      name: WER (with tashkeel) — unseen reciters
    - type: wer
      value: 14.1593
      name: WER (normalized, no tashkeel) — unseen reciters
---

# Whisper Small — Quranic Arabic ASR with Full Tashkeel

Fine-tuned [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) on [`tarteel-ai/everyayah`](https://huggingface.co/datasets/tarteel-ai/everyayah) for Automatic Speech Recognition of Quranic recitation with complete tashkeel (Arabic diacritics) preserved in the output.

## Model Summary

| Property | Value |
|---|---|
| Base model | openai/whisper-small (244 M parameters) |
| Language | Arabic (ar) |
| Task | Automatic Speech Recognition |
| Dataset | tarteel-ai/everyayah |
| Output | Arabic text with full tashkeel (harakat) |
| Fine-tuning type | Full fine-tuning — all 244 M parameters trained |
| Precision | fp16 mixed precision |
| Hardware | 4 x NVIDIA RTX 2080 Ti (11 GB VRAM each, 44 GB total) |
| Best checkpoint | Step 1500 (epoch 9.94) |

## Performance

### Validation Set Results (500 samples, mixed reciters)

| Step | Epoch | eval loss | CER (with tashkeel) | WER (with tashkeel) | WER (normalized, no tashkeel) |
|------|-------|-----------|---------------------|---------------------|-------------------------------|
| 500 | 3.31 | 0.02615 | 1.3581% | 6.6743% | 6.0468% |
| 1000 | 6.62 | 0.01690 | 0.7120% | 3.6794% | 3.4227% |
| **1500** | **9.94** | **0.01766** | **0.6922%** | **3.2801%** | **3.0519%** |
| 2000 | 13.25 | 0.02102 | 0.8241% | 4.1928% | 3.8220% |

Step 1500 is the best checkpoint by all primary metrics. Overfitting becomes visible at step 2000 where both loss and error rates increase despite continued training.

### Training Loss Progression

| Step | Train Loss | Learning Rate |
|------|------------|---------------|
| 50 | 0.7672 | 9.80e-07 |
| 100 | 0.2892 | 1.98e-06 |
| 200 | 0.0990 | 3.98e-06 |
| 300 | 0.0628 | 5.98e-06 |
| 500 | 0.0149 | 9.98e-06 |
| 700 | 0.0051 | 9.98e-06 |
| 1000 | 0.0009 | 9.89e-06 |
| 1500 | 0.0003 | 9.57e-06 |

The model converges rapidly — training loss drops from 0.7672 at step 50 to 0.0003 at step 1500, a reduction of 99.96%.

### Comparison with Baseline

| Metric | openai/whisper-small (no fine-tuning) | This model (step 1500) | Improvement |
|---|---|---|---|
| CER with tashkeel | 61.97% | 0.6922% | 89.5x reduction |
| WER with tashkeel | 102.22% | 3.2801% | 31.2x reduction |

The baseline whisper-small model produces WER above 100% on Quranic text because it was not trained on Tajweed recitation and does not reliably output tashkeel. This fine-tuned model reduces character error rate from 61.97% to 0.69%.


## Inference Test Results

### Test on Known Reciters — Unseen Ayahs (20 samples, test split)

Samples drawn from the held-out test split containing reciters present in training but on ayahs never seen during fine-tuning.

| Metric | Result |
|---|---|
| CER (with tashkeel) | 0.9055% |
| WER (with tashkeel) | 5.6122% |
| WER (normalized) | 5.6122% |

14 out of 20 samples achieved perfect 0.00% CER. Remaining errors were single-character phonological confusions on difficult Tajweed transitions.

### Generalization Test — Completely Unseen Reciters (12 samples)

Samples drawn from reciters `abdullah_matroud` and `abdurrahmaan_as-sudais`, whose voices were never present in the training data.

| Reciter | Samples | CER range |
|---|---|---|
| abdullah_matroud | 6 | 0.00% – 29.03% |
| abdurrahmaan_as-sudais | 6 | 0.00% – 8.96% |

| Metric | Result |
|---|---|
| CER (with tashkeel) | 3.9918% |
| WER (with tashkeel) | 15.9292% |
| WER (normalized) | 14.1593% |

3 out of 12 samples achieved perfect 0.00% CER on completely new voices. The generalization gap from trained reciters (0.69% CER) to unseen reciters (3.99% CER) demonstrates strong domain transfer within the Quranic recitation domain despite the model having 244 M parameters trained on only 19,284 samples in approximately 1.5 hours.

## Why Full Fine-Tuning

Quranic recitation (Tajweed) differs substantially from modern spoken Arabic in several dimensions: phonological rules (idgham, ikhfa, madd), recitation style, and the strict requirement to reproduce full tashkeel in the output. LoRA or adapter-based approaches were considered but full fine-tuning was chosen because:

1. All 12 encoder and decoder layers need to adapt to the Tajweed acoustic domain.
2. Complete diacritic generation (tashkeel) requires the decoder vocabulary distribution to be fully reshaped, not just steered by adapter weights.
3. With 19,284 training samples across 6 diverse reciters, the dataset is large enough to justify full fine-tuning without severe overfitting.

## Training Details

### Dataset

Training used 19,284 verse-level recordings from 6 Quranic reciters selected from the tarteel-ai/everyayah dataset. Reciters were chosen to maximize acoustic diversity and reduce reciter-specific memorization.

| Reciter | Approximate samples |
|---|---|
| Abdulsamad | 4,269 |
| Abdul Basit | 4,269 |
| Abdullah Basfar | 4,269 |
| Husary | 4,269 |
| Menshawi | 2,846 |
| Minshawi | 4,269 |
| **Total** | **19,284** |

Validation: 500 samples (mixed reciters, not seen during training).
Test: 1,000 samples (mixed reciters).

All splits were pre-filtered to remove:
- Audio samples longer than 30 seconds (corrupted signal — full-chapter audio with verse-level label).
- Text samples whose tokenized length exceeds 448 tokens (Whisper decoder hard limit). No truncation is applied — samples exceeding the limit are removed entirely to preserve tashkeel integrity.

### Hyperparameters

| Setting | Value |
|---|---|
| Learning rate | 1e-5 |
| LR scheduler | Cosine decay |
| Warmup steps | 500 |
| Effective batch size | 8 (per device) x 4 (gradient accumulation) x 4 (GPUs) = 128 |
| Weight decay | 0.05 |
| Dropout | 0.1 (encoder, decoder, attention) |
| Max steps | 8000 (early stopping by CER patience=5) |
| Precision | fp16 mixed precision |
| Gradient checkpointing | Enabled |
| Max grad norm | 1.0 |
| Optimizer | AdamW |
| Primary eval metric | CER with tashkeel |

### Evaluation Configuration

- `per_device_eval_batch_size`: 2 (required to prevent OOM during autoregressive generation on 11 GB VRAM)
- `eval_accumulation_steps`: 8
- `predict_with_generate`: True
- Decoding: greedy (num_beams=1) during training evaluation

## Usage

```python
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="NightPrince/stt-arabic-whisper-finetuned-diactires",
    generate_kwargs={"language": "arabic", "task": "transcribe"},
)

result = pipe("surah_fatiha.mp3")
print(result["text"])
# Example output: بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
```

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

processor = WhisperProcessor.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model = WhisperForConditionalGeneration.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires")
model.eval()

# audio_array: numpy array at 16000 Hz
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(
        inputs.input_features,
        language="arabic",
        task="transcribe",
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
```

## Metrics Definitions

- **CER (with tashkeel)**: Character Error Rate computed on the raw output including all harakat (fatha, damma, kasra, tanwin, shadda, sukun). This is the primary metric — it directly measures diacritization accuracy.
- **WER (with tashkeel)**: Word Error Rate on the raw output with full tashkeel.
- **WER normalized**: Word Error Rate after stripping all tashkeel from both prediction and reference. Measures word-level recognition independent of diacritization.

## Intended Use

This model is designed for transcribing Quranic recitation audio to text with full tashkeel preserved. Suitable applications include:

- Quranic recitation evaluation and feedback tools
- Quran learning applications requiring accurate text alignment
- Islamic education platforms
- Research in Arabic speech recognition with diacritics

## Limitations

- Trained exclusively on Quranic recitation. Performance on non-Quranic Arabic speech will be significantly degraded.
- Optimized for verse-level (ayah-level) audio segments. Very long continuous recitations may require segmentation.
- Tashkeel accuracy reflects the Uthmani script style used in the training corpus.


## Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{elnawasany2026whisperquran,
  author    = {Yahya Mohamed Elnawasany},
  title     = {Whisper Small Fine-Tuned for Quranic Arabic ASR with Full Tashkeel},
  year      = {2026},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/NightPrince/stt-arabic-whisper-finetuned-diactires}},
  email     = {yahyaalnwsany39@gmail.com},
  note      = {Portfolio: https://yahya-portfoli-app.netlify.app/}
}
```

**Author**: Yahya Mohamed Elnawasany
**Email**: yahyaalnwsany39@gmail.com
**Portfolio**: https://yahya-portfoli-app.netlify.app/

## Training Infrastructure

Training ran on a shared server with 4 x NVIDIA RTX 2080 Ti GPUs under WSL2 using PyTorch DDP via Accelerate. Total training time to step 1500 was approximately 1.5 hours.