--- library_name: transformers language: - ar license: apache-2.0 base_model: openai/whisper-small datasets: - tarteel-ai/everyayah tags: - whisper - automatic-speech-recognition - arabic - quran - tashkeel - diacritics - tajweed metrics: - wer - cer model-index: - name: NightPrince/stt-arabic-whisper-finetuned-diactires results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: tarteel-ai/everyayah (validation split — 500 samples, trained reciters) type: tarteel-ai/everyayah split: validation metrics: - type: cer value: 0.6922 name: CER (with tashkeel) - type: wer value: 3.2801 name: WER (with tashkeel) - type: wer value: 3.0519 name: WER (normalized, no tashkeel) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: tarteel-ai/everyayah (inference test — 20 samples, known reciters, unseen ayahs) type: tarteel-ai/everyayah split: test metrics: - type: cer value: 0.9055 name: CER (with tashkeel) - type: wer value: 5.6122 name: WER (with tashkeel) - type: wer value: 5.6122 name: WER (normalized, no tashkeel) - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: tarteel-ai/everyayah (generalization test — 12 samples, unseen reciters) type: tarteel-ai/everyayah split: train metrics: - type: cer value: 3.9918 name: CER (with tashkeel) — unseen reciters - type: wer value: 15.9292 name: WER (with tashkeel) — unseen reciters - type: wer value: 14.1593 name: WER (normalized, no tashkeel) — unseen reciters --- # Whisper Small — Quranic Arabic ASR with Full Tashkeel Fine-tuned [`openai/whisper-small`](https://huggingface.co/openai/whisper-small) on [`tarteel-ai/everyayah`](https://huggingface.co/datasets/tarteel-ai/everyayah) for Automatic Speech Recognition of Quranic recitation with complete tashkeel (Arabic diacritics) preserved in the output. ## Model Summary | Property | Value | |---|---| | Base model | openai/whisper-small (244 M parameters) | | Language | Arabic (ar) | | Task | Automatic Speech Recognition | | Dataset | tarteel-ai/everyayah | | Output | Arabic text with full tashkeel (harakat) | | Fine-tuning type | Full fine-tuning — all 244 M parameters trained | | Precision | fp16 mixed precision | | Hardware | 4 x NVIDIA RTX 2080 Ti (11 GB VRAM each, 44 GB total) | | Best checkpoint | Step 1500 (epoch 9.94) | ## Performance ### Validation Set Results (500 samples, mixed reciters) | Step | Epoch | eval loss | CER (with tashkeel) | WER (with tashkeel) | WER (normalized, no tashkeel) | |------|-------|-----------|---------------------|---------------------|-------------------------------| | 500 | 3.31 | 0.02615 | 1.3581% | 6.6743% | 6.0468% | | 1000 | 6.62 | 0.01690 | 0.7120% | 3.6794% | 3.4227% | | **1500** | **9.94** | **0.01766** | **0.6922%** | **3.2801%** | **3.0519%** | | 2000 | 13.25 | 0.02102 | 0.8241% | 4.1928% | 3.8220% | Step 1500 is the best checkpoint by all primary metrics. Overfitting becomes visible at step 2000 where both loss and error rates increase despite continued training. ### Training Loss Progression | Step | Train Loss | Learning Rate | |------|------------|---------------| | 50 | 0.7672 | 9.80e-07 | | 100 | 0.2892 | 1.98e-06 | | 200 | 0.0990 | 3.98e-06 | | 300 | 0.0628 | 5.98e-06 | | 500 | 0.0149 | 9.98e-06 | | 700 | 0.0051 | 9.98e-06 | | 1000 | 0.0009 | 9.89e-06 | | 1500 | 0.0003 | 9.57e-06 | The model converges rapidly — training loss drops from 0.7672 at step 50 to 0.0003 at step 1500, a reduction of 99.96%. ### Comparison with Baseline | Metric | openai/whisper-small (no fine-tuning) | This model (step 1500) | Improvement | |---|---|---|---| | CER with tashkeel | 61.97% | 0.6922% | 89.5x reduction | | WER with tashkeel | 102.22% | 3.2801% | 31.2x reduction | The baseline whisper-small model produces WER above 100% on Quranic text because it was not trained on Tajweed recitation and does not reliably output tashkeel. This fine-tuned model reduces character error rate from 61.97% to 0.69%. ## Inference Test Results ### Test on Known Reciters — Unseen Ayahs (20 samples, test split) Samples drawn from the held-out test split containing reciters present in training but on ayahs never seen during fine-tuning. | Metric | Result | |---|---| | CER (with tashkeel) | 0.9055% | | WER (with tashkeel) | 5.6122% | | WER (normalized) | 5.6122% | 14 out of 20 samples achieved perfect 0.00% CER. Remaining errors were single-character phonological confusions on difficult Tajweed transitions. ### Generalization Test — Completely Unseen Reciters (12 samples) Samples drawn from reciters `abdullah_matroud` and `abdurrahmaan_as-sudais`, whose voices were never present in the training data. | Reciter | Samples | CER range | |---|---|---| | abdullah_matroud | 6 | 0.00% – 29.03% | | abdurrahmaan_as-sudais | 6 | 0.00% – 8.96% | | Metric | Result | |---|---| | CER (with tashkeel) | 3.9918% | | WER (with tashkeel) | 15.9292% | | WER (normalized) | 14.1593% | 3 out of 12 samples achieved perfect 0.00% CER on completely new voices. The generalization gap from trained reciters (0.69% CER) to unseen reciters (3.99% CER) demonstrates strong domain transfer within the Quranic recitation domain despite the model having 244 M parameters trained on only 19,284 samples in approximately 1.5 hours. ## Why Full Fine-Tuning Quranic recitation (Tajweed) differs substantially from modern spoken Arabic in several dimensions: phonological rules (idgham, ikhfa, madd), recitation style, and the strict requirement to reproduce full tashkeel in the output. LoRA or adapter-based approaches were considered but full fine-tuning was chosen because: 1. All 12 encoder and decoder layers need to adapt to the Tajweed acoustic domain. 2. Complete diacritic generation (tashkeel) requires the decoder vocabulary distribution to be fully reshaped, not just steered by adapter weights. 3. With 19,284 training samples across 6 diverse reciters, the dataset is large enough to justify full fine-tuning without severe overfitting. ## Training Details ### Dataset Training used 19,284 verse-level recordings from 6 Quranic reciters selected from the tarteel-ai/everyayah dataset. Reciters were chosen to maximize acoustic diversity and reduce reciter-specific memorization. | Reciter | Approximate samples | |---|---| | Abdulsamad | 4,269 | | Abdul Basit | 4,269 | | Abdullah Basfar | 4,269 | | Husary | 4,269 | | Menshawi | 2,846 | | Minshawi | 4,269 | | **Total** | **19,284** | Validation: 500 samples (mixed reciters, not seen during training). Test: 1,000 samples (mixed reciters). All splits were pre-filtered to remove: - Audio samples longer than 30 seconds (corrupted signal — full-chapter audio with verse-level label). - Text samples whose tokenized length exceeds 448 tokens (Whisper decoder hard limit). No truncation is applied — samples exceeding the limit are removed entirely to preserve tashkeel integrity. ### Hyperparameters | Setting | Value | |---|---| | Learning rate | 1e-5 | | LR scheduler | Cosine decay | | Warmup steps | 500 | | Effective batch size | 8 (per device) x 4 (gradient accumulation) x 4 (GPUs) = 128 | | Weight decay | 0.05 | | Dropout | 0.1 (encoder, decoder, attention) | | Max steps | 8000 (early stopping by CER patience=5) | | Precision | fp16 mixed precision | | Gradient checkpointing | Enabled | | Max grad norm | 1.0 | | Optimizer | AdamW | | Primary eval metric | CER with tashkeel | ### Evaluation Configuration - `per_device_eval_batch_size`: 2 (required to prevent OOM during autoregressive generation on 11 GB VRAM) - `eval_accumulation_steps`: 8 - `predict_with_generate`: True - Decoding: greedy (num_beams=1) during training evaluation ## Usage ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="NightPrince/stt-arabic-whisper-finetuned-diactires", generate_kwargs={"language": "arabic", "task": "transcribe"}, ) result = pipe("surah_fatiha.mp3") print(result["text"]) # Example output: بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ ``` ```python from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch processor = WhisperProcessor.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires") model = WhisperForConditionalGeneration.from_pretrained("NightPrince/stt-arabic-whisper-finetuned-diactires") model.eval() # audio_array: numpy array at 16000 Hz inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt") with torch.no_grad(): predicted_ids = model.generate( inputs.input_features, language="arabic", task="transcribe", ) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) print(transcription[0]) ``` ## Metrics Definitions - **CER (with tashkeel)**: Character Error Rate computed on the raw output including all harakat (fatha, damma, kasra, tanwin, shadda, sukun). This is the primary metric — it directly measures diacritization accuracy. - **WER (with tashkeel)**: Word Error Rate on the raw output with full tashkeel. - **WER normalized**: Word Error Rate after stripping all tashkeel from both prediction and reference. Measures word-level recognition independent of diacritization. ## Intended Use This model is designed for transcribing Quranic recitation audio to text with full tashkeel preserved. Suitable applications include: - Quranic recitation evaluation and feedback tools - Quran learning applications requiring accurate text alignment - Islamic education platforms - Research in Arabic speech recognition with diacritics ## Limitations - Trained exclusively on Quranic recitation. Performance on non-Quranic Arabic speech will be significantly degraded. - Optimized for verse-level (ayah-level) audio segments. Very long continuous recitations may require segmentation. - Tashkeel accuracy reflects the Uthmani script style used in the training corpus. ## Citation If you use this model in your research or application, please cite: ```bibtex @misc{elnawasany2026whisperquran, author = {Yahya Mohamed Elnawasany}, title = {Whisper Small Fine-Tuned for Quranic Arabic ASR with Full Tashkeel}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/NightPrince/stt-arabic-whisper-finetuned-diactires}}, email = {yahyaalnwsany39@gmail.com}, note = {Portfolio: https://yahya-portfoli-app.netlify.app/} } ``` **Author**: Yahya Mohamed Elnawasany **Email**: yahyaalnwsany39@gmail.com **Portfolio**: https://yahya-portfoli-app.netlify.app/ ## Training Infrastructure Training ran on a shared server with 4 x NVIDIA RTX 2080 Ti GPUs under WSL2 using PyTorch DDP via Accelerate. Total training time to step 1500 was approximately 1.5 hours.