AVALON-2B
Adaptive Vision-Augmented Language ON-device
The First Sub-3B Self-Reflective Language Model
Paper | GitHub | GGUF Version
Model Description
AVALON-2B is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure.
Key Innovations
- First Sub-3B Self-RAG: Breaks the 7B parameter barrier for self-reflective capabilities
- On-Device Ready: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3
- 82.5% Token Accuracy: Reliable generation of
[Retrieval],[No Retrieval], and[Utility:X]tokens - No Catastrophic Forgetting: +0.41% MMLU improvement over base model
Self-RAG Tokens
AVALON generates special reflection tokens to enable adaptive retrieval:
| Token | Purpose | Example Use |
|---|---|---|
[Retrieval] |
External knowledge needed | "What happened in the news today?" |
[No Retrieval] |
Parametric knowledge sufficient | "What is the capital of France?" |
[Utility:1-5] |
Response quality rating | End of every response |
Benchmarks
| Model | Params | MMLU | HellaSwag | ARC-C | Self-RAG |
|---|---|---|---|---|---|
| AVALON-2B | 1.88B | 62.04 | 64.14 | 42.75 | 82.5% |
| Qwen 3.5 2B (base) | 1.88B | 61.63 | 62.15 | 41.64 | 0% |
| Gemma 4 E2B | 2.3B | 58.0 | 68.0 | 48.0 | 0% |
| SmolLM3 3B | 3.0B | 55.0 | 70.0 | 50.0 | 0% |
vs Gemma 4 E2B (Head-to-Head)
| Metric | AVALON-2B | Gemma 4 E2B |
|---|---|---|
| Knowledge Accuracy | 100% | 75% |
| Tool Calling (XML) | 100% | 50% |
| Inference Speed | 25.7 tok/s | 11.2 tok/s |
| Self-Reflective Tokens | Yes | No |
On-Device Performance
Tested with Q4_K_M quantization (1.5GB):
| Device | Chip | Speed (tok/s) | Memory |
|---|---|---|---|
| MacBook Air | M3 | 40.2 | 2.1 GB |
| MacBook Pro | M3 Pro | 52.4 | 2.1 GB |
| Mac Studio | M2 Ultra | 78.6 | 2.0 GB |
| iPhone 15 Pro | A17 Pro | 12.4 | 1.8 GB |
Usage
Recommended System Prompt
For best performance, use this system prompt:
You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality
Be concise and accurate. If you're uncertain, acknowledge it.
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")
SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality
Be concise and accurate. If you're uncertain, acknowledge it."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Who won the 2024 US presidential election?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
# Output: [Retrieval]I need current information to answer this question...[Utility:4]
Ollama (Recommended for Local Use)
# Download GGUF version
ollama pull nuroai/avalon-2b
# Run
ollama run avalon-2b "What is quantum computing?"
# Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5]
llama.cpp
# Download Q4_K_M GGUF (1.5GB)
wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf
# Run inference
./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128
Training Details
Architecture
- Base Model: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers)
- Parameters: 1.88B total
- Context Length: 32K tokens
- Vocabulary: 248K tokens (including Self-RAG special tokens)
Training Configuration
- Method: LoRA with
modules_to_save=["embed_tokens", "lm_head"] - Data: 201K synthetic samples (80% Self-RAG, 20% general)
- Hardware: 8x NVIDIA A100 80GB
- Duration: ~6 hours
- Learning Rate: 2e-5
- Epochs: 2
Critical Insight
The key to successful Self-RAG at sub-3B scale is including embedding layers in training:
peft_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
modules_to_save=["embed_tokens", "lm_head"], # CRITICAL for Self-RAG tokens
lora_dropout=0.05,
)
Without modules_to_save, Self-RAG token accuracy drops from 82.5% to 12%.
Quantization
| Format | Size | MMLU | Self-RAG | Speed |
|---|---|---|---|---|
| BF16 | 4.5 GB | 62.04% | 82.5% | 1.0x |
| Q8_0 | 2.3 GB | 61.95% | 82.1% | 1.3x |
| Q4_K_M | 1.5 GB | 61.42% | 80.5% | 1.8x |
| Q4_K_S | 1.3 GB | 60.91% | 79.2% | 1.9x |
Limitations
- Mathematical Reasoning: GSM8K performance drops 4.5% due to token interruption in multi-step calculations
- English Only: Trained and evaluated exclusively on English data
- Static Retrieval Decision: Binary decision at generation start; cannot adapt mid-response
Citation
@article{ponnada2026avalon,
title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment},
author={Ponnada, Akhil and Arvapalli, Naga Sri},
journal={arXiv preprint},
year={2026}
}
License
Apache 2.0
Authors
- Akhil Ponnada - akhil@nuroailabs.com
- Naga Sri Arvapalli - nagasri3007@gmail.com
Contact
- Organization: Nuro AI Labs Limited
- GitHub: Nuro-Labs/avalon-2b
- Downloads last month
- 248
Model tree for nuroai/Avalon-2B
Evaluation results
- Accuracy on MMLUself-reported62.040
- Accuracy (Normalized) on HellaSwagself-reported64.140
- Accuracy (Normalized) on ARC-Challengeself-reported42.750