فتاح — Fattah-2.5B
نموذج لغوي مصري مبني على Qwen3 بتقنية Depth-Up Scaling
Egyptian Arabic LLM Built on Qwen3 with Depth-Up Scaling
Overview
Fattah (فتاح — meaning "the opener" or "the one who opens doors") is a 2.5B parameter Large Language Model specialized for Egyptian Arabic, the most widely spoken Arabic dialect with over 100 million native speakers.
Fattah is built through a novel three-stage pipeline:
- Depth-Up Scaling (DUS) — expanding Qwen3-1.7B from 28 to 40 transformer layers
- Continual Pre-Training (CPT) — trained on a ~8.59B token Egyptian Arabic corpus, processing 5.51B tokens (64.1% of the full dataset)
- Supervised Fine-Tuning (SFT) — 400K Egyptian Arabic instruction-response pairs
⚠️ Note: This is the pre-DPO version (CPT + SFT only). A DPO-aligned version (
Fattah-2.5B-v2) is coming soon with improved factual accuracy, reduced hallucination, and better instruction following.
Model Details
| Property | Value |
|---|---|
| Model Name | Fattah-2.5B |
| Base Model | Qwen/Qwen3-1.7B-Base |
| Architecture | Qwen3 (expanded via DUS) |
| Parameters | 2,635,771,904 (~2.64B) |
| Transformer Layers | 40 (expanded from 28) |
| Hidden Size | 2048 |
| Context Length | 64K tokens (YaRN extended) |
| Language | Egyptian Arabic (primary), MSA, English |
| License | Apache 2.0 |
| Training Compute | 2× NVIDIA A6000 48GB |
Training Pipeline
Stage 1 — Depth-Up Scaling (DUS)
Starting from Qwen/Qwen3-1.7B-Base, we applied Depth-Up Scaling surgery — the same technique used in SOLAR-10.7B — to expand the model from 28 to 40 transformer layers, increasing parameter count from 1.7B to ~2.5B without any training.
Qwen3-1.7B-Base (28 layers)
↓ DUS Surgery
Fattah-DUS (40 layers, ~2.5B)
Layer expansion strategy: concatenate layers [0-23] + layers [4-27], creating a deeper model that inherits the base model's knowledge while providing additional capacity for Egyptian Arabic adaptation.
Stage 2 — Continual Pre-Training (CPT)
| Parameter | Value |
|---|---|
| Dataset | Custom Egyptian Arabic corpus (~8.59B tokens) |
| Total dataset tokens | ~8.59B tokens |
| Tokens processed | 5.51B tokens (64.1% of dataset) |
| Training steps | 42,000 |
| Learning rate | 1e-5 (cosine decay) |
| Sequence length | 4096 |
| Batch size | 2 per GPU × 8 grad accum × 2 GPUs = 131,072 tokens/step |
| Framework | ms-swift + DeepSpeed ZeRO-1 |
| Final loss | 1.824 |
Dataset composition:
- 51.7% Egyptian Arabic (web, subtitles, social media, educational)
- 22.1% Modern Standard Arabic (MSA)
- 13.8% English
- 12.4% Code
Stage 3 — Supervised Fine-Tuning (SFT)
| Parameter | Value |
|---|---|
| Dataset | MBZUAI-Paris/Egyptian-SFT-Mixture (400K samples) |
| Epochs | 2 |
| Learning rate | 5e-6 (cosine decay) |
| Final eval loss | 1.668 |
| Final token accuracy | 67.01% |
| Training time | ~19 hours |
Context Extension — YaRN
After SFT, the context window was extended from 32K to 64K tokens using YaRN (Yet another RoPE extensioN):
"rope_scaling": {
"rope_type": "yarn",
"factor": 2.0,
"original_max_position_embeddings": 32768
}
Evaluation Results
All evaluations use zero-shot log-likelihood scoring (same methodology as NileChat paper). HellaSwag uses length-normalized accuracy (acc_norm); all other benchmarks use unnormalized accuracy (acc).
Arabic Script Benchmarks — Full Comparison
All evaluations use zero-shot log-likelihood scoring. HellaSwag uses acc_norm (length-normalized accuracy). All other benchmarks use acc (unnormalized accuracy).
Published baselines are from the NileChat paper (Table 1). Fattah rows use our custom evaluation harness with identical zero-shot methodology.
Arabic Script Benchmarks
| Model | Params | MMLU | Belebele | HellaSwag† | PIQA | WinoGrande | OpenBookQA | Avg |
|---|---|---|---|---|---|---|---|---|
| Nile-Chat-12B | 12B | 62.59 | 70.69 | 64.04 | 63.53 | 42.06 | 53.13 | 59.34 |
| gemma-3-12b-it | 12B | 61.55 | 77.00 | 49.49 | 63.53 | 38.03 | 48.86 | 56.41 |
| Qwen2.5-14B-Instruct | 14B | 60.81 | 72.33 | 55.84 | 59.97 | 38.26 | 50.28 | 56.25 |
| Nile-Chat-3x4B-A6B | MoE | 52.13 | 75.44 | 59.30 | 57.91 | 41.16 | 48.39 | 55.72 |
| Nile-Chat-2x4B-A6B | MoE | 52.05 | 73.89 | 59.69 | 62.26 | 41.61 | 44.07 | 55.60 |
| AceGPT-v2-8b-chat | 8B | 55.25 | 73.33 | 53.14 | 58.39 | 39.82 | 47.16 | 54.52 |
| Nile-Chat-4B | 4B | 50.25 | 68.56 | 55.92 | 61.87 | 40.94 | 46.02 | 53.93 |
| c4ai-command-r7b | 7B | 70.67 | 61.84 | 50.39 | 57.20 | 36.91 | 46.02 | 53.84 |
| ALLaM-7B-Instruct | 7B | 67.67 | 66.10 | 57.29 | 62.18 | 40.04 | 67.10 | 60.06 |
| gemma-2-9b-it | 9B | 49.44 | 61.35 | 49.53 | 61.79 | 35.79 | 48.01 | 50.99 |
| jais-adapted-13b-chat | 13B | 50.03 | 65.33 | 47.53 | 56.72 | 37.14 | 41.76 | 49.75 |
| jais-family-13b-chat | 13B | 44.85 | 66.33 | 52.99 | 57.91 | 36.91 | 38.64 | 49.61 |
| jais-family-6p7b-chat | 7B | 42.60 | 57.33 | 49.18 | 62.23 | 33.33 | 37.50 | 47.03 |
| gemma-3-4b-it | 4B | 38.56 | 60.32 | 42.56 | 56.49 | 35.79 | 46.73 | 46.74 |
| Qwen2.5-7B-Instruct | 7B | 64.22 | 58.02 | 45.47 | 56.41 | 38.70 | 11.34 | 45.69 |
| jais-adapted-7b-chat | 7B | 40.96 | 55.67 | 40.85 | 56.50 | 32.89 | 42.33 | 44.87 |
| Llama-3.1-8B-Instruct | 8B | 55.89 | 57.97 | 43.10 | 54.27 | 35.57 | 9.06 | 42.64 |
| Fattah-2.5B (post-SFT) ⭐ | 2.5B | 38.40 | 40.78 | 24.00 | 61.30 | 49.40 | 27.96 | 40.31 |
† HellaSwag uses acc_norm (length-normalized accuracy). All other benchmarks use acc. |
||||||||
| ‡ Published baselines are from the NileChat paper (Table 1) — these are instruction-tuned + RLHF-aligned models. | ||||||||
| ⭐ Best Fattah checkpoint (pre-DPO). |
Key Highlights
- PIQA (61.3%) — Fattah outperforms Qwen2.5-7B (56.4%), gemma-3-4b (56.5%), Llama-3.1-8B (54.3%), and all jais models despite being 2.5B
- WinoGrande (49.4%) — Fattah scores higher than every published baseline in the table, including models 3–5× larger
- Average gap — Fattah post-SFT (40.31%) is behind Nile-Chat-4B (53.93%) by 13.6 points; DPO alignment is expected to close this gap significantly
- Comparable baselines — most fair comparison is with gemma-3-4b-it (4B, 46.74%) — Fattah is 2.5B and pre-DPO, 6.4 points behind a fully aligned 4B model
Full Training Journey (Base → DUS → CPT → SFT)
| Benchmark | Base 1.7B | DUS 2.5B | Post-CPT | Post-SFT | Net (Base→SFT) |
|---|---|---|---|---|---|
| EgyptianMMLU | 34.07% | 29.20% | 37.07% | 38.40% | +4.33% ✅ |
| EgyptianPIQA | 54.80% | 51.90% | 61.10% | 61.30% | +6.50% ✅ |
| Belebele-Arz | 37.00% | 32.78% | 41.56% | 40.78% | +3.78% ✅ |
| EgyHellaSwag | 25.00% | 23.60% | 21.40% | 24.00% | −1.00% ⚠️ |
| WinoGrande | 49.40% | 49.40% | 49.40% | 49.40% | 0.00% ➡️ |
| OpenBookQA | 21.03% | 17.67% | 27.74% | 27.96% | +6.93% ✅ |
| Average | 36.88% | 34.09% | 39.71% | 40.31% | +3.43% ✅ |
| EGY Perplexity | 18.84 | 46.31 | 6.69 | — | −12.15 ✅ |
Key observations:
- DUS surgery caused an expected temporary regression (34.09%) as the new layers were randomly initialized
- CPT recovered and surpassed the base (39.71%), acquiring strong Egyptian Arabic dialect knowledge
- SFT further improved average to 40.31%, with MMLU +1.33% and HellaSwag recovering from 21.4% → 24.0%
- EGY Perplexity improvement of ×2.8 (18.84 → 6.69) confirms deep dialect acquisition during CPT
Usage
Installation
pip install transformers>=4.51.0 torch accelerate
Basic Chat
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "belal212/Fattah-2.5B-preview"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{
"role": "system",
"content": "أنت فتاح، مساعد ذكي ومفيد بتتكلم العربي المصري."
},
{
"role": "user",
"content": "كلمني عن القاهرة"
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # disable thinking mode for conversational use
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)
With Thinking Mode (for complex reasoning)
messages = [
{
"role": "system",
"content": "أنت فتاح، مساعد ذكي بتفكر خطوة بخطوة قبل ما تجاوب."
},
{
"role": "user",
"content": "ازاي أحسن خوارزمية للـ sorting في Python؟"
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # activate <think> mode
)
Intended Use
Fattah is designed for:
- ✅ Egyptian Arabic conversational AI
- ✅ Question answering in Egyptian dialect
- ✅ Text generation and creative writing in Egyptian Arabic
- ✅ RAG-based knowledge retrieval systems
- ✅ Foundation for Fattah-Coding (Python + React/TS specialist — coming soon)
- ✅ Agent systems requiring Egyptian Arabic understanding
Limitations
- Factual hallucination: As a 2.5B model without DPO alignment, Fattah may confidently generate incorrect facts. A DPO-aligned version is in development.
- Knowledge cutoff: Training data has a knowledge cutoff. Recent events are not known.
- Dialect coverage: Optimized for Egyptian Arabic. Performance on other Arabic dialects is not guaranteed.
- Model size: At 2.5B parameters, Fattah cannot match the factual depth of larger models. Use RAG for knowledge-intensive applications.
- Pre-DPO: This version has not undergone preference optimization. Responses may occasionally be over-cautious or inconsistent in style.
Roadmap
| Version | Status | Description |
|---|---|---|
| Fattah-2.5B | ✅ Released | CPT + SFT, Egyptian Arabic assistant |
| Fattah-2.5B-v2 | 🔄 In progress | + DPO alignment (Egyptian-DPO-Mixture) |
| Fattah-Python-2.5B | ⏳ Planned | Fattah + Python/AI coding specialization |
| Fattah-React-2.5B | ⏳ Planned | Fattah + React/TypeScript specialization |
| Fattah-Coding-MoE | ⏳ Planned | MoE with LLM-gated routing between Python + React experts |
Training Infrastructure
- GPUs: 2× NVIDIA A6000 48GB
- Framework: ms-swift 4.0.2
- Distributed: DeepSpeed ZeRO Stage 1
- Attention: Flash Attention 2.3.6
- Mixed precision: bfloat16
- Total compute: ~60 GPU-hours (CPT) + ~19 GPU-hours (SFT)
Citation
If you use Fattah in your research, please cite:
@misc{fattah2026,
title = {Fattah: Egyptian Arabic LLM via Depth-Up Scaling and Continual Pre-Training},
author = {Belal},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/belal212/Fattah-2.5B-preview}},
note = {Pre-DPO version}
}
Acknowledgements
- Qwen Team for the Qwen3-1.7B-Base model
- MBZUAI-Paris for the Egyptian-SFT-Mixture dataset and NileChat benchmarks
- UBC-NLP for the NileChat pre-training corpus
- ms-swift for the training framework
Fattah — Opening the doors of AI for Egyptian Arabic speakers
- Downloads last month
- 343
Model tree for belal212/Fattah-2.5B-preview
Base model
Qwen/Qwen3-1.7B-BaseDatasets used to train belal212/Fattah-2.5B-preview
Spaces using belal212/Fattah-2.5B-preview 2
Evaluation results
- EgyptianMMLU (acc) on EgyptianMMLUself-reported38.400
- EgyptianPIQA (acc) on EgyptianPIQAself-reported61.300
- Belebele-Arz (acc) on Belebele-Arzself-reported40.780
- EgyptianHellaSwag (acc_norm) on EgyptianHellaSwagself-reported24.000
- EgyptianWinoGrande (acc) on EgyptianWinoGrandeself-reported49.400
- EgyptianOpenBookQA (acc) on EgyptianOpenBookQAself-reported27.960