You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

TinyQwen3-Engram-HC-Darija-140M

A compact 140M parameter language model for Moroccan Darija (الدارجة المغربية) built from scratch with a novel custom architecture combining three key innovations:

  • 🧠 Engram Layers — n-gram hash embeddings injected at transformer layers [1, 4, 6], providing sub-word priors without additional training data
  • 🔀 Hyper-Connections (HC) — multi-stream residual connections with learned gating (4 parallel streams), replacing standard residual connections
  • Grouped Query Attention (GQA) — 8 query heads with 2 KV groups for efficient attention
  • 🔗 Weight-tied embedding + output head to reduce parameter count

This is a base (pretrained) model — it performs text completion, not instruction following. An instruct-tuned version is planned.

Model Details

Parameter Value
Total params 139.3M
Non-embedding params 61.5M (weight-tied)
Engram params 39.0M
Embedding dim 512
Layers 8
Attention heads 8 (2 KV groups)
Head dim 64
FFN hidden dim 1408
Context length 1024 tokens
Vocab size 151,669 (Qwen3 tokenizer)
Positional encoding RoPE (base 10,000)
Precision bfloat16

Training Details

Detail Value
Training data Lyte/AryPretrainingCleaned
Clean tokens ~1.4B
Total tokens seen ~2.15B (including data replay across 2 passes)
Optimizer updates 4,107
Final loss 2.4820
Effective batch size 512 (32 × 4 GPUs × 4 grad accum)
Optimizer AdamW (β1=0.9, β2=0.95, wd=0.1)
Peak LR 5e-4 (cosine decay)
Warmup 500 updates
Hardware 4× GPU (DDP)
Training time ~45 minutes

Dataset Cleaning

The original dataset (Lyte/AryPretrainingDeduped-Splits) contained 32M documents but significant quality issues:

  • 10% tiny documents (<20 characters)
  • 40% short documents (<100 characters)
  • 18% exact duplicates
  • Thousands of repeated boilerplate (cookie banners, subscription prompts)

After cleaning: 14.4M documents → ~1.4B clean tokens.

Quick Start

Installation

pip install torch transformers safetensors huggingface_hub sympy tokenizers

Basic Usage

from huggingface_hub import snapshot_download
import sys

# Download model
local_dir = snapshot_download("Lyte/TinyQwen3-Engram-HC-Darija-140M")
sys.path.insert(0, local_dir)

from modeling import load_model, generate

model, tokenizer = load_model(local_dir)

# Generate text
print(generate(model, tokenizer, "المغرب بلاد"))
print(generate(model, tokenizer, "كيفاش نقدر"))
print(generate(model, tokenizer, "الدار البيضاء هي"))
print(generate(model, tokenizer, "فالمغرب كاين"))
print(generate(model, tokenizer, "الدارجة المغربية هي"))

Generation Parameters

The generate function supports fine-grained control over text generation:

output = generate(
    model,
    tokenizer,
    prompt="المغرب بلاد",
    max_new=150,              # Maximum new tokens to generate
    temperature=0.5,          # Sampling temperature (lower = more focused)
    top_k=40,                 # Keep top-k tokens
    top_p=0.9,                # Nucleus sampling threshold
    min_p=0.02,               # Minimum probability relative to max
    repetition_penalty=1.3,   # Penalize repeated tokens
    frequency_penalty=0.4,    # Penalize by frequency count
    presence_penalty=0.4,     # Penalize any already-seen token
)
Parameter Default Description
max_new 150 Maximum number of tokens to generate
temperature 0.5 Controls randomness. Lower = more deterministic, higher = more creative
top_k 40 Only sample from the top-k most likely tokens
top_p 0.9 Nucleus sampling — only consider tokens with cumulative probability < top_p
min_p 0.02 Discard tokens with probability less than min_p × max_probability
repetition_penalty 1.3 Multiplicative penalty for tokens already in the sequence
frequency_penalty 0.4 Additive penalty proportional to how often a token has appeared
presence_penalty 0.4 Flat additive penalty for any token that has appeared at all

Recommended Settings

# Creative / storytelling
generate(model, tokenizer, prompt, temperature=0.8, top_k=60, top_p=0.95,
         repetition_penalty=1.2, frequency_penalty=0.3, presence_penalty=0.3)

# Focused / factual
generate(model, tokenizer, prompt, temperature=0.3, top_k=20, top_p=0.85,
         repetition_penalty=1.5, frequency_penalty=0.5, presence_penalty=0.5)

# Greedy (most likely)
generate(model, tokenizer, prompt, temperature=0.1, top_k=1, top_p=1.0,
         repetition_penalty=1.3)

Engram Ablation

You can disable the Engram layers at inference time to compare performance:

# Normal (with Engram)
print(generate(model, tokenizer, "المغرب بلاد"))

# Without Engram — observe degradation
model.set_skip_engram(True)
print(generate(model, tokenizer, "المغرب بلاد"))

# Re-enable
model.set_skip_engram(False)

Using on GPU

# Automatic (uses CUDA if available)
model, tokenizer = load_model(local_dir)

# Explicit device
model, tokenizer = load_model(local_dir, device="cuda")

# With torch.compile for faster inference
model, tokenizer = load_model(local_dir, compile_model=True)

Limitations

  • Base model only — performs text completion, not instruction following or chat
  • Limited data — trained on ~1.4B tokens of Darija; may lack knowledge depth
  • Short context — 1024 token context window
  • No safety training — no RLHF, DPO, or safety filtering applied
  • Moroccan Darija focus — may not perform well on MSA or other Arabic dialects

Citation

@misc{tinyqwen3-engram-hc-darija,
  title={TinyQwen3-Engram-HC-Darija-140M},
  author={Lyte},
  year={2026},
  howpublished={\url{https://huggingface.co/Lyte/TinyQwen3-Engram-HC-Darija-140M}},
}

License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support