TinyQwen3-Engram-HC-Darija-140M
A compact 140M parameter language model for Moroccan Darija (الدارجة المغربية) built from scratch with a novel custom architecture combining three key innovations:
- 🧠 Engram Layers — n-gram hash embeddings injected at transformer layers [1, 4, 6], providing sub-word priors without additional training data
- 🔀 Hyper-Connections (HC) — multi-stream residual connections with learned gating (4 parallel streams), replacing standard residual connections
- ⚡ Grouped Query Attention (GQA) — 8 query heads with 2 KV groups for efficient attention
- 🔗 Weight-tied embedding + output head to reduce parameter count
This is a base (pretrained) model — it performs text completion, not instruction following. An instruct-tuned version is planned.
Model Details
| Parameter | Value |
|---|---|
| Total params | 139.3M |
| Non-embedding params | 61.5M (weight-tied) |
| Engram params | 39.0M |
| Embedding dim | 512 |
| Layers | 8 |
| Attention heads | 8 (2 KV groups) |
| Head dim | 64 |
| FFN hidden dim | 1408 |
| Context length | 1024 tokens |
| Vocab size | 151,669 (Qwen3 tokenizer) |
| Positional encoding | RoPE (base 10,000) |
| Precision | bfloat16 |
Training Details
| Detail | Value |
|---|---|
| Training data | Lyte/AryPretrainingCleaned |
| Clean tokens | ~1.4B |
| Total tokens seen | ~2.15B (including data replay across 2 passes) |
| Optimizer updates | 4,107 |
| Final loss | 2.4820 |
| Effective batch size | 512 (32 × 4 GPUs × 4 grad accum) |
| Optimizer | AdamW (β1=0.9, β2=0.95, wd=0.1) |
| Peak LR | 5e-4 (cosine decay) |
| Warmup | 500 updates |
| Hardware | 4× GPU (DDP) |
| Training time | ~45 minutes |
Dataset Cleaning
The original dataset (Lyte/AryPretrainingDeduped-Splits) contained 32M documents but significant quality issues:
- 10% tiny documents (<20 characters)
- 40% short documents (<100 characters)
- 18% exact duplicates
- Thousands of repeated boilerplate (cookie banners, subscription prompts)
After cleaning: 14.4M documents → ~1.4B clean tokens.
Quick Start
Installation
pip install torch transformers safetensors huggingface_hub sympy tokenizers
Basic Usage
from huggingface_hub import snapshot_download
import sys
# Download model
local_dir = snapshot_download("Lyte/TinyQwen3-Engram-HC-Darija-140M")
sys.path.insert(0, local_dir)
from modeling import load_model, generate
model, tokenizer = load_model(local_dir)
# Generate text
print(generate(model, tokenizer, "المغرب بلاد"))
print(generate(model, tokenizer, "كيفاش نقدر"))
print(generate(model, tokenizer, "الدار البيضاء هي"))
print(generate(model, tokenizer, "فالمغرب كاين"))
print(generate(model, tokenizer, "الدارجة المغربية هي"))
Generation Parameters
The generate function supports fine-grained control over text generation:
output = generate(
model,
tokenizer,
prompt="المغرب بلاد",
max_new=150, # Maximum new tokens to generate
temperature=0.5, # Sampling temperature (lower = more focused)
top_k=40, # Keep top-k tokens
top_p=0.9, # Nucleus sampling threshold
min_p=0.02, # Minimum probability relative to max
repetition_penalty=1.3, # Penalize repeated tokens
frequency_penalty=0.4, # Penalize by frequency count
presence_penalty=0.4, # Penalize any already-seen token
)
| Parameter | Default | Description |
|---|---|---|
max_new |
150 | Maximum number of tokens to generate |
temperature |
0.5 | Controls randomness. Lower = more deterministic, higher = more creative |
top_k |
40 | Only sample from the top-k most likely tokens |
top_p |
0.9 | Nucleus sampling — only consider tokens with cumulative probability < top_p |
min_p |
0.02 | Discard tokens with probability less than min_p × max_probability |
repetition_penalty |
1.3 | Multiplicative penalty for tokens already in the sequence |
frequency_penalty |
0.4 | Additive penalty proportional to how often a token has appeared |
presence_penalty |
0.4 | Flat additive penalty for any token that has appeared at all |
Recommended Settings
# Creative / storytelling
generate(model, tokenizer, prompt, temperature=0.8, top_k=60, top_p=0.95,
repetition_penalty=1.2, frequency_penalty=0.3, presence_penalty=0.3)
# Focused / factual
generate(model, tokenizer, prompt, temperature=0.3, top_k=20, top_p=0.85,
repetition_penalty=1.5, frequency_penalty=0.5, presence_penalty=0.5)
# Greedy (most likely)
generate(model, tokenizer, prompt, temperature=0.1, top_k=1, top_p=1.0,
repetition_penalty=1.3)
Engram Ablation
You can disable the Engram layers at inference time to compare performance:
# Normal (with Engram)
print(generate(model, tokenizer, "المغرب بلاد"))
# Without Engram — observe degradation
model.set_skip_engram(True)
print(generate(model, tokenizer, "المغرب بلاد"))
# Re-enable
model.set_skip_engram(False)
Using on GPU
# Automatic (uses CUDA if available)
model, tokenizer = load_model(local_dir)
# Explicit device
model, tokenizer = load_model(local_dir, device="cuda")
# With torch.compile for faster inference
model, tokenizer = load_model(local_dir, compile_model=True)
Limitations
- Base model only — performs text completion, not instruction following or chat
- Limited data — trained on ~1.4B tokens of Darija; may lack knowledge depth
- Short context — 1024 token context window
- No safety training — no RLHF, DPO, or safety filtering applied
- Moroccan Darija focus — may not perform well on MSA or other Arabic dialects
Citation
@misc{tinyqwen3-engram-hc-darija,
title={TinyQwen3-Engram-HC-Darija-140M},
author={Lyte},
year={2026},
howpublished={\url{https://huggingface.co/Lyte/TinyQwen3-Engram-HC-Darija-140M}},
}
License
Apache 2.0
- Downloads last month
- -