فتاح — Fattah-2.5B

نموذج لغوي مصري مبني على Qwen3 بتقنية Depth-Up Scaling

Egyptian Arabic LLM Built on Qwen3 with Depth-Up Scaling

License Model Size Language Base Model


Overview

Fattah (فتاح — meaning "the opener" or "the one who opens doors") is a 2.5B parameter Large Language Model specialized for Egyptian Arabic, the most widely spoken Arabic dialect with over 100 million native speakers.

Fattah is built through a novel three-stage pipeline:

  1. Depth-Up Scaling (DUS) — expanding Qwen3-1.7B from 28 to 40 transformer layers
  2. Continual Pre-Training (CPT) — trained on a ~8.59B token Egyptian Arabic corpus, processing 5.51B tokens (64.1% of the full dataset)
  3. Supervised Fine-Tuning (SFT) — 400K Egyptian Arabic instruction-response pairs

⚠️ Note: This is the pre-DPO version (CPT + SFT only). A DPO-aligned version (Fattah-2.5B-v2) is coming soon with improved factual accuracy, reduced hallucination, and better instruction following.


Model Details

Property Value
Model Name Fattah-2.5B
Base Model Qwen/Qwen3-1.7B-Base
Architecture Qwen3 (expanded via DUS)
Parameters 2,635,771,904 (~2.64B)
Transformer Layers 40 (expanded from 28)
Hidden Size 2048
Context Length 64K tokens (YaRN extended)
Language Egyptian Arabic (primary), MSA, English
License Apache 2.0
Training Compute 2× NVIDIA A6000 48GB

Training Pipeline

Stage 1 — Depth-Up Scaling (DUS)

Starting from Qwen/Qwen3-1.7B-Base, we applied Depth-Up Scaling surgery — the same technique used in SOLAR-10.7B — to expand the model from 28 to 40 transformer layers, increasing parameter count from 1.7B to ~2.5B without any training.

Qwen3-1.7B-Base (28 layers)
        ↓  DUS Surgery
Fattah-DUS (40 layers, ~2.5B)

Layer expansion strategy: concatenate layers [0-23] + layers [4-27], creating a deeper model that inherits the base model's knowledge while providing additional capacity for Egyptian Arabic adaptation.

Stage 2 — Continual Pre-Training (CPT)

Parameter Value
Dataset Custom Egyptian Arabic corpus (~8.59B tokens)
Total dataset tokens ~8.59B tokens
Tokens processed 5.51B tokens (64.1% of dataset)
Training steps 42,000
Learning rate 1e-5 (cosine decay)
Sequence length 4096
Batch size 2 per GPU × 8 grad accum × 2 GPUs = 131,072 tokens/step
Framework ms-swift + DeepSpeed ZeRO-1
Final loss 1.824

Dataset composition:

  • 51.7% Egyptian Arabic (web, subtitles, social media, educational)
  • 22.1% Modern Standard Arabic (MSA)
  • 13.8% English
  • 12.4% Code

Stage 3 — Supervised Fine-Tuning (SFT)

Parameter Value
Dataset MBZUAI-Paris/Egyptian-SFT-Mixture (400K samples)
Epochs 2
Learning rate 5e-6 (cosine decay)
Final eval loss 1.668
Final token accuracy 67.01%
Training time ~19 hours

Context Extension — YaRN

After SFT, the context window was extended from 32K to 64K tokens using YaRN (Yet another RoPE extensioN):

"rope_scaling": {
    "rope_type": "yarn",
    "factor": 2.0,
    "original_max_position_embeddings": 32768
}

Evaluation Results

All evaluations use zero-shot log-likelihood scoring (same methodology as NileChat paper). HellaSwag uses length-normalized accuracy (acc_norm); all other benchmarks use unnormalized accuracy (acc).

Arabic Script Benchmarks — Full Comparison

All evaluations use zero-shot log-likelihood scoring. HellaSwag uses acc_norm (length-normalized accuracy). All other benchmarks use acc (unnormalized accuracy).

Published baselines are from the NileChat paper (Table 1). Fattah rows use our custom evaluation harness with identical zero-shot methodology.

Arabic Script Benchmarks

Model Params MMLU Belebele HellaSwag† PIQA WinoGrande OpenBookQA Avg
Nile-Chat-12B 12B 62.59 70.69 64.04 63.53 42.06 53.13 59.34
gemma-3-12b-it 12B 61.55 77.00 49.49 63.53 38.03 48.86 56.41
Qwen2.5-14B-Instruct 14B 60.81 72.33 55.84 59.97 38.26 50.28 56.25
Nile-Chat-3x4B-A6B MoE 52.13 75.44 59.30 57.91 41.16 48.39 55.72
Nile-Chat-2x4B-A6B MoE 52.05 73.89 59.69 62.26 41.61 44.07 55.60
AceGPT-v2-8b-chat 8B 55.25 73.33 53.14 58.39 39.82 47.16 54.52
Nile-Chat-4B 4B 50.25 68.56 55.92 61.87 40.94 46.02 53.93
c4ai-command-r7b 7B 70.67 61.84 50.39 57.20 36.91 46.02 53.84
ALLaM-7B-Instruct 7B 67.67 66.10 57.29 62.18 40.04 67.10 60.06
gemma-2-9b-it 9B 49.44 61.35 49.53 61.79 35.79 48.01 50.99
jais-adapted-13b-chat 13B 50.03 65.33 47.53 56.72 37.14 41.76 49.75
jais-family-13b-chat 13B 44.85 66.33 52.99 57.91 36.91 38.64 49.61
jais-family-6p7b-chat 7B 42.60 57.33 49.18 62.23 33.33 37.50 47.03
gemma-3-4b-it 4B 38.56 60.32 42.56 56.49 35.79 46.73 46.74
Qwen2.5-7B-Instruct 7B 64.22 58.02 45.47 56.41 38.70 11.34 45.69
jais-adapted-7b-chat 7B 40.96 55.67 40.85 56.50 32.89 42.33 44.87
Llama-3.1-8B-Instruct 8B 55.89 57.97 43.10 54.27 35.57 9.06 42.64
Fattah-2.5B (post-SFT) 2.5B 38.40 40.78 24.00 61.30 49.40 27.96 40.31
† HellaSwag uses acc_norm (length-normalized accuracy). All other benchmarks use acc.
‡ Published baselines are from the NileChat paper (Table 1) — these are instruction-tuned + RLHF-aligned models.
⭐ Best Fattah checkpoint (pre-DPO).

Key Highlights

  • PIQA (61.3%) — Fattah outperforms Qwen2.5-7B (56.4%), gemma-3-4b (56.5%), Llama-3.1-8B (54.3%), and all jais models despite being 2.5B
  • WinoGrande (49.4%) — Fattah scores higher than every published baseline in the table, including models 3–5× larger
  • Average gap — Fattah post-SFT (40.31%) is behind Nile-Chat-4B (53.93%) by 13.6 points; DPO alignment is expected to close this gap significantly
  • Comparable baselines — most fair comparison is with gemma-3-4b-it (4B, 46.74%) — Fattah is 2.5B and pre-DPO, 6.4 points behind a fully aligned 4B model

Full Training Journey (Base → DUS → CPT → SFT)

Benchmark Base 1.7B DUS 2.5B Post-CPT Post-SFT Net (Base→SFT)
EgyptianMMLU 34.07% 29.20% 37.07% 38.40% +4.33%
EgyptianPIQA 54.80% 51.90% 61.10% 61.30% +6.50%
Belebele-Arz 37.00% 32.78% 41.56% 40.78% +3.78%
EgyHellaSwag 25.00% 23.60% 21.40% 24.00% −1.00% ⚠️
WinoGrande 49.40% 49.40% 49.40% 49.40% 0.00% ➡️
OpenBookQA 21.03% 17.67% 27.74% 27.96% +6.93%
Average 36.88% 34.09% 39.71% 40.31% +3.43%
EGY Perplexity 18.84 46.31 6.69 −12.15

Key observations:

  • DUS surgery caused an expected temporary regression (34.09%) as the new layers were randomly initialized
  • CPT recovered and surpassed the base (39.71%), acquiring strong Egyptian Arabic dialect knowledge
  • SFT further improved average to 40.31%, with MMLU +1.33% and HellaSwag recovering from 21.4% → 24.0%
  • EGY Perplexity improvement of ×2.8 (18.84 → 6.69) confirms deep dialect acquisition during CPT

Usage

Installation

pip install transformers>=4.51.0 torch accelerate

Basic Chat

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "belal212/Fattah-2.5B-preview"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {
        "role": "system",
        "content": "أنت فتاح، مساعد ذكي ومفيد بتتكلم العربي المصري."
    },
    {
        "role": "user",
        "content": "كلمني عن القاهرة"
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False   # disable thinking mode for conversational use
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(response)

With Thinking Mode (for complex reasoning)

messages = [
    {
        "role": "system",
        "content": "أنت فتاح، مساعد ذكي بتفكر خطوة بخطوة قبل ما تجاوب."
    },
    {
        "role": "user",
        "content": "ازاي أحسن خوارزمية للـ sorting في Python؟"
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True   # activate <think> mode
)

Intended Use

Fattah is designed for:

  • ✅ Egyptian Arabic conversational AI
  • ✅ Question answering in Egyptian dialect
  • ✅ Text generation and creative writing in Egyptian Arabic
  • ✅ RAG-based knowledge retrieval systems
  • ✅ Foundation for Fattah-Coding (Python + React/TS specialist — coming soon)
  • ✅ Agent systems requiring Egyptian Arabic understanding

Limitations

  • Factual hallucination: As a 2.5B model without DPO alignment, Fattah may confidently generate incorrect facts. A DPO-aligned version is in development.
  • Knowledge cutoff: Training data has a knowledge cutoff. Recent events are not known.
  • Dialect coverage: Optimized for Egyptian Arabic. Performance on other Arabic dialects is not guaranteed.
  • Model size: At 2.5B parameters, Fattah cannot match the factual depth of larger models. Use RAG for knowledge-intensive applications.
  • Pre-DPO: This version has not undergone preference optimization. Responses may occasionally be over-cautious or inconsistent in style.

Roadmap

Version Status Description
Fattah-2.5B ✅ Released CPT + SFT, Egyptian Arabic assistant
Fattah-2.5B-v2 🔄 In progress + DPO alignment (Egyptian-DPO-Mixture)
Fattah-Python-2.5B ⏳ Planned Fattah + Python/AI coding specialization
Fattah-React-2.5B ⏳ Planned Fattah + React/TypeScript specialization
Fattah-Coding-MoE ⏳ Planned MoE with LLM-gated routing between Python + React experts

Training Infrastructure

  • GPUs: 2× NVIDIA A6000 48GB
  • Framework: ms-swift 4.0.2
  • Distributed: DeepSpeed ZeRO Stage 1
  • Attention: Flash Attention 2.3.6
  • Mixed precision: bfloat16
  • Total compute: ~60 GPU-hours (CPT) + ~19 GPU-hours (SFT)

Citation

If you use Fattah in your research, please cite:

@misc{fattah2026,
  title        = {Fattah: Egyptian Arabic LLM via Depth-Up Scaling and Continual Pre-Training},
  author       = {Belal},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/belal212/Fattah-2.5B-preview}},
  note         = {Pre-DPO version}
}

Acknowledgements

  • Qwen Team for the Qwen3-1.7B-Base model
  • MBZUAI-Paris for the Egyptian-SFT-Mixture dataset and NileChat benchmarks
  • UBC-NLP for the NileChat pre-training corpus
  • ms-swift for the training framework

فتاح — بيفتح أبواب الذكاء الاصطناعي للعربي المصري
Fattah — Opening the doors of AI for Egyptian Arabic speakers
Downloads last month
343
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for belal212/Fattah-2.5B-preview

Finetuned
(309)
this model

Datasets used to train belal212/Fattah-2.5B-preview

Spaces using belal212/Fattah-2.5B-preview 2

Evaluation results