AVALON-2B

Adaptive Vision-Augmented Language ON-device

The First Sub-3B Self-Reflective Language Model

Model Description

AVALON-2B is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure.

Key Innovations

First Sub-3B Self-RAG: Breaks the 7B parameter barrier for self-reflective capabilities
On-Device Ready: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3
82.5% Token Accuracy: Reliable generation of [Retrieval], [No Retrieval], and [Utility:X] tokens
No Catastrophic Forgetting: +0.41% MMLU improvement over base model

Self-RAG Tokens

AVALON generates special reflection tokens to enable adaptive retrieval:

Token	Purpose	Example Use
`[Retrieval]`	External knowledge needed	"What happened in the news today?"
`[No Retrieval]`	Parametric knowledge sufficient	"What is the capital of France?"
`[Utility:1-5]`	Response quality rating	End of every response

Benchmarks

Model	Params	MMLU	HellaSwag	ARC-C	Self-RAG
AVALON-2B	1.88B	62.04	64.14	42.75	82.5%
Qwen 3.5 2B (base)	1.88B	61.63	62.15	41.64	0%
Gemma 4 E2B	2.3B	58.0	68.0	48.0	0%
SmolLM3 3B	3.0B	55.0	70.0	50.0	0%

vs Gemma 4 E2B (Head-to-Head)

Metric	AVALON-2B	Gemma 4 E2B
Knowledge Accuracy	100%	75%
Tool Calling (XML)	100%	50%
Inference Speed	25.7 tok/s	11.2 tok/s
Self-Reflective Tokens	Yes	No

On-Device Performance

Tested with Q4_K_M quantization (1.5GB):

Device	Chip	Speed (tok/s)	Memory
MacBook Air	M3	40.2	2.1 GB
MacBook Pro	M3 Pro	52.4	2.1 GB
Mac Studio	M2 Ultra	78.6	2.0 GB
iPhone 15 Pro	A17 Pro	12.4	1.8 GB

Usage

Recommended System Prompt

For best performance, use this system prompt:

You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")

SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it."""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Who won the 2024 US presidential election?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
# Output: [Retrieval]I need current information to answer this question...[Utility:4]

Ollama (Recommended for Local Use)

# Download GGUF version
ollama pull nuroai/avalon-2b

# Run
ollama run avalon-2b "What is quantum computing?"
# Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5]

llama.cpp

# Download Q4_K_M GGUF (1.5GB)
wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf

# Run inference
./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128

Training Details

Architecture

Base Model: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers)
Parameters: 1.88B total
Context Length: 32K tokens
Vocabulary: 248K tokens (including Self-RAG special tokens)

Training Configuration

Method: LoRA with modules_to_save=["embed_tokens", "lm_head"]
Data: 201K synthetic samples (80% Self-RAG, 20% general)
Hardware: 8x NVIDIA A100 80GB
Duration: ~6 hours
Learning Rate: 2e-5
Epochs: 2

Critical Insight

The key to successful Self-RAG at sub-3B scale is including embedding layers in training:

peft_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["embed_tokens", "lm_head"],  # CRITICAL for Self-RAG tokens
    lora_dropout=0.05,
)

Without modules_to_save, Self-RAG token accuracy drops from 82.5% to 12%.

Quantization

Format	Size	MMLU	Self-RAG	Speed
BF16	4.5 GB	62.04%	82.5%	1.0x
Q8_0	2.3 GB	61.95%	82.1%	1.3x
Q4_K_M	1.5 GB	61.42%	80.5%	1.8x
Q4_K_S	1.3 GB	60.91%	79.2%	1.9x

Limitations

Mathematical Reasoning: GSM8K performance drops 4.5% due to token interruption in multi-step calculations
English Only: Trained and evaluated exclusively on English data
Static Retrieval Decision: Binary decision at generation start; cannot adapt mid-response

Citation

@article{ponnada2026avalon,
  title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment},
  author={Ponnada, Akhil and Arvapalli, Naga Sri},
  journal={arXiv preprint},
  year={2026}
}