You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

TinyQwen3-Engram-HC-Darija-140M

A compact 140M parameter language model for Moroccan Darija (الدارجة المغربية) built from scratch with a novel custom architecture combining three key innovations:

🧠 Engram Layers — n-gram hash embeddings injected at transformer layers [1, 4, 6], providing sub-word priors without additional training data
🔀 Hyper-Connections (HC) — multi-stream residual connections with learned gating (4 parallel streams), replacing standard residual connections
⚡ Grouped Query Attention (GQA) — 8 query heads with 2 KV groups for efficient attention
🔗 Weight-tied embedding + output head to reduce parameter count

This is a base (pretrained) model — it performs text completion, not instruction following. An instruct-tuned version is planned.

Model Details

Parameter	Value
Total params	139.3M
Non-embedding params	61.5M (weight-tied)
Engram params	39.0M
Embedding dim	512
Layers	8
Attention heads	8 (2 KV groups)
Head dim	64
FFN hidden dim	1408
Context length	1024 tokens
Vocab size	151,669 (Qwen3 tokenizer)
Positional encoding	RoPE (base 10,000)
Precision	bfloat16

Training Details

Detail	Value
Training data	Lyte/AryPretrainingCleaned
Clean tokens	~1.4B
Total tokens seen	~2.15B (including data replay across 2 passes)
Optimizer updates	4,107
Final loss	2.4820
Effective batch size	512 (32 × 4 GPUs × 4 grad accum)
Optimizer	AdamW (β1=0.9, β2=0.95, wd=0.1)
Peak LR	5e-4 (cosine decay)
Warmup	500 updates
Hardware	4× GPU (DDP)
Training time	~45 minutes

Dataset Cleaning

The original dataset (Lyte/AryPretrainingDeduped-Splits) contained 32M documents but significant quality issues:

10% tiny documents (<20 characters)
40% short documents (<100 characters)
18% exact duplicates
Thousands of repeated boilerplate (cookie banners, subscription prompts)

After cleaning: 14.4M documents → ~1.4B clean tokens.

Quick Start

Installation

pip install torch transformers safetensors huggingface_hub sympy tokenizers

Basic Usage

from huggingface_hub import snapshot_download
import sys

# Download model
local_dir = snapshot_download("Lyte/TinyQwen3-Engram-HC-Darija-140M")
sys.path.insert(0, local_dir)

from modeling import load_model, generate

model, tokenizer = load_model(local_dir)

# Generate text
print(generate(model, tokenizer, "المغرب بلاد"))
print(generate(model, tokenizer, "كيفاش نقدر"))
print(generate(model, tokenizer, "الدار البيضاء هي"))
print(generate(model, tokenizer, "فالمغرب كاين"))
print(generate(model, tokenizer, "الدارجة المغربية هي"))

Generation Parameters

The generate function supports fine-grained control over text generation:

output = generate(
    model,
    tokenizer,
    prompt="المغرب بلاد",
    max_new=150,              # Maximum new tokens to generate
    temperature=0.5,          # Sampling temperature (lower = more focused)
    top_k=40,                 # Keep top-k tokens
    top_p=0.9,                # Nucleus sampling threshold
    min_p=0.02,               # Minimum probability relative to max
    repetition_penalty=1.3,   # Penalize repeated tokens
    frequency_penalty=0.4,    # Penalize by frequency count
    presence_penalty=0.4,     # Penalize any already-seen token
)

Parameter	Default	Description
`max_new`	150	Maximum number of tokens to generate
`temperature`	0.5	Controls randomness. Lower = more deterministic, higher = more creative
`top_k`	40	Only sample from the top-k most likely tokens
`top_p`	0.9	Nucleus sampling — only consider tokens with cumulative probability < top_p
`min_p`	0.02	Discard tokens with probability less than min_p × max_probability
`repetition_penalty`	1.3	Multiplicative penalty for tokens already in the sequence
`frequency_penalty`	0.4	Additive penalty proportional to how often a token has appeared
`presence_penalty`	0.4	Flat additive penalty for any token that has appeared at all

Recommended Settings

# Creative / storytelling
generate(model, tokenizer, prompt, temperature=0.8, top_k=60, top_p=0.95,
         repetition_penalty=1.2, frequency_penalty=0.3, presence_penalty=0.3)

# Focused / factual
generate(model, tokenizer, prompt, temperature=0.3, top_k=20, top_p=0.85,
         repetition_penalty=1.5, frequency_penalty=0.5, presence_penalty=0.5)

# Greedy (most likely)
generate(model, tokenizer, prompt, temperature=0.1, top_k=1, top_p=1.0,
         repetition_penalty=1.3)

Engram Ablation

You can disable the Engram layers at inference time to compare performance:

# Normal (with Engram)
print(generate(model, tokenizer, "المغرب بلاد"))

# Without Engram — observe degradation
model.set_skip_engram(True)
print(generate(model, tokenizer, "المغرب بلاد"))

# Re-enable
model.set_skip_engram(False)

Using on GPU

# Automatic (uses CUDA if available)
model, tokenizer = load_model(local_dir)

# Explicit device
model, tokenizer = load_model(local_dir, device="cuda")

# With torch.compile for faster inference
model, tokenizer = load_model(local_dir, compile_model=True)

Limitations

Base model only — performs text completion, not instruction following or chat
Limited data — trained on ~1.4B tokens of Darija; may lack knowledge depth
Short context — 1024 token context window
No safety training — no RLHF, DPO, or safety filtering applied
Moroccan Darija focus — may not perform well on MSA or other Arabic dialects

Citation

@misc{tinyqwen3-engram-hc-darija,
  title={TinyQwen3-Engram-HC-Darija-140M},
  author={Lyte},
  year={2026},
  howpublished={\url{https://huggingface.co/Lyte/TinyQwen3-Engram-HC-Darija-140M}},
}

License

Apache 2.0

Downloads last month: -