---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- self-rag
- retrieval-augmented-generation
- qwen
- text-generation
- on-device
- edge-ai
- reflection-tokens
base_model: Qwen/Qwen3.5-2B
datasets:
- custom
pipeline_tag: text-generation
model-index:
- name: Avalon-2B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU
type: cais/mmlu
metrics:
- name: Accuracy
type: accuracy
value: 62.04
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag
type: Rowan/hellaswag
metrics:
- name: Accuracy (Normalized)
type: accuracy
value: 64.14
- task:
type: text-generation
name: Text Generation
dataset:
name: ARC-Challenge
type: allenai/ai2_arc
metrics:
- name: Accuracy (Normalized)
type: accuracy
value: 42.75
---
# AVALON-2B
### Adaptive Vision-Augmented Language ON-device
**The First Sub-3B Self-Reflective Language Model**
[](https://huggingface.co/nuroai/Avalon-2B)
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/nuroai/Avalon-2B)
[](https://huggingface.co/nuroai/Avalon-2B)
[Paper](https://github.com/Nuro-Labs/avalon-2b) | [GitHub](https://github.com/Nuro-Labs/avalon-2b) | [GGUF Version](https://huggingface.co/nuroai/Avalon-2B-GGUF)
## Model Description
**AVALON-2B** is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure.
### Key Innovations
- **First Sub-3B Self-RAG**: Breaks the 7B parameter barrier for self-reflective capabilities
- **On-Device Ready**: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3
- **82.5% Token Accuracy**: Reliable generation of `[Retrieval]`, `[No Retrieval]`, and `[Utility:X]` tokens
- **No Catastrophic Forgetting**: +0.41% MMLU improvement over base model
### Self-RAG Tokens
AVALON generates special reflection tokens to enable adaptive retrieval:
| Token | Purpose | Example Use |
|-------|---------|-------------|
| `[Retrieval]` | External knowledge needed | "What happened in the news today?" |
| `[No Retrieval]` | Parametric knowledge sufficient | "What is the capital of France?" |
| `[Utility:1-5]` | Response quality rating | End of every response |
## Benchmarks
| Model | Params | MMLU | HellaSwag | ARC-C | Self-RAG |
|-------|--------|------|-----------|-------|----------|
| **AVALON-2B** | 1.88B | **62.04** | **64.14** | 42.75 | **82.5%** |
| Qwen 3.5 2B (base) | 1.88B | 61.63 | 62.15 | 41.64 | 0% |
| Gemma 4 E2B | 2.3B | 58.0 | 68.0 | 48.0 | 0% |
| SmolLM3 3B | 3.0B | 55.0 | 70.0 | 50.0 | 0% |
### vs Gemma 4 E2B (Head-to-Head)
| Metric | AVALON-2B | Gemma 4 E2B |
|--------|-----------|-------------|
| Knowledge Accuracy | **100%** | 75% |
| Tool Calling (XML) | **100%** | 50% |
| Inference Speed | **25.7 tok/s** | 11.2 tok/s |
| Self-Reflective Tokens | **Yes** | No |
## On-Device Performance
Tested with Q4_K_M quantization (1.5GB):
| Device | Chip | Speed (tok/s) | Memory |
|--------|------|---------------|--------|
| MacBook Air | M3 | 40.2 | 2.1 GB |
| MacBook Pro | M3 Pro | 52.4 | 2.1 GB |
| Mac Studio | M2 Ultra | 78.6 | 2.0 GB |
| iPhone 15 Pro | A17 Pro | 12.4 | 1.8 GB |
## Usage
### Recommended System Prompt
For best performance, use this system prompt:
```
You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality
Be concise and accurate. If you're uncertain, acknowledge it.
```
### Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")
SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality
Be concise and accurate. If you're uncertain, acknowledge it."""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Who won the 2024 US presidential election?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
# Output: [Retrieval]I need current information to answer this question...[Utility:4]
```
### Ollama (Recommended for Local Use)
```bash
# Download GGUF version
ollama pull nuroai/avalon-2b
# Run
ollama run avalon-2b "What is quantum computing?"
# Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5]
```
### llama.cpp
```bash
# Download Q4_K_M GGUF (1.5GB)
wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf
# Run inference
./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128
```
## Training Details
### Architecture
- **Base Model**: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers)
- **Parameters**: 1.88B total
- **Context Length**: 32K tokens
- **Vocabulary**: 248K tokens (including Self-RAG special tokens)
### Training Configuration
- **Method**: LoRA with `modules_to_save=["embed_tokens", "lm_head"]`
- **Data**: 201K synthetic samples (80% Self-RAG, 20% general)
- **Hardware**: 8x NVIDIA A100 80GB
- **Duration**: ~6 hours
- **Learning Rate**: 2e-5
- **Epochs**: 2
### Critical Insight
The key to successful Self-RAG at sub-3B scale is including embedding layers in training:
```python
peft_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
modules_to_save=["embed_tokens", "lm_head"], # CRITICAL for Self-RAG tokens
lora_dropout=0.05,
)
```
Without `modules_to_save`, Self-RAG token accuracy drops from 82.5% to 12%.
## Quantization
| Format | Size | MMLU | Self-RAG | Speed |
|--------|------|------|----------|-------|
| BF16 | 4.5 GB | 62.04% | 82.5% | 1.0x |
| Q8_0 | 2.3 GB | 61.95% | 82.1% | 1.3x |
| **Q4_K_M** | **1.5 GB** | **61.42%** | **80.5%** | **1.8x** |
| Q4_K_S | 1.3 GB | 60.91% | 79.2% | 1.9x |
## Limitations
- **Mathematical Reasoning**: GSM8K performance drops 4.5% due to token interruption in multi-step calculations
- **English Only**: Trained and evaluated exclusively on English data
- **Static Retrieval Decision**: Binary decision at generation start; cannot adapt mid-response
## Citation
```bibtex
@article{ponnada2026avalon,
title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment},
author={Ponnada, Akhil and Arvapalli, Naga Sri},
journal={arXiv preprint},
year={2026}
}
```
## License
Apache 2.0
## Authors
- **Akhil Ponnada** - akhil@nuroailabs.com
- **Naga Sri Arvapalli** - nagasri3007@gmail.com
## Contact
- **Organization**: [Nuro AI Labs Limited](https://nuro.one)
- **GitHub**: [Nuro-Labs/avalon-2b](https://github.com/Nuro-Labs/avalon-2b)