AVALON-2B

Adaptive Vision-Augmented Language ON-device

The First Sub-3B Self-Reflective Language Model

Model Size License Self-RAG MMLU

Paper | GitHub | GGUF Version

Model Description

AVALON-2B is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure.

Key Innovations

  • First Sub-3B Self-RAG: Breaks the 7B parameter barrier for self-reflective capabilities
  • On-Device Ready: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3
  • 82.5% Token Accuracy: Reliable generation of [Retrieval], [No Retrieval], and [Utility:X] tokens
  • No Catastrophic Forgetting: +0.41% MMLU improvement over base model

Self-RAG Tokens

AVALON generates special reflection tokens to enable adaptive retrieval:

Token Purpose Example Use
[Retrieval] External knowledge needed "What happened in the news today?"
[No Retrieval] Parametric knowledge sufficient "What is the capital of France?"
[Utility:1-5] Response quality rating End of every response

Benchmarks

Model Params MMLU HellaSwag ARC-C Self-RAG
AVALON-2B 1.88B 62.04 64.14 42.75 82.5%
Qwen 3.5 2B (base) 1.88B 61.63 62.15 41.64 0%
Gemma 4 E2B 2.3B 58.0 68.0 48.0 0%
SmolLM3 3B 3.0B 55.0 70.0 50.0 0%

vs Gemma 4 E2B (Head-to-Head)

Metric AVALON-2B Gemma 4 E2B
Knowledge Accuracy 100% 75%
Tool Calling (XML) 100% 50%
Inference Speed 25.7 tok/s 11.2 tok/s
Self-Reflective Tokens Yes No

On-Device Performance

Tested with Q4_K_M quantization (1.5GB):

Device Chip Speed (tok/s) Memory
MacBook Air M3 40.2 2.1 GB
MacBook Pro M3 Pro 52.4 2.1 GB
Mac Studio M2 Ultra 78.6 2.0 GB
iPhone 15 Pro A17 Pro 12.4 1.8 GB

Usage

Recommended System Prompt

For best performance, use this system prompt:

You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it.

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")

SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it."""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Who won the 2024 US presidential election?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
# Output: [Retrieval]I need current information to answer this question...[Utility:4]

Ollama (Recommended for Local Use)

# Download GGUF version
ollama pull nuroai/avalon-2b

# Run
ollama run avalon-2b "What is quantum computing?"
# Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5]

llama.cpp

# Download Q4_K_M GGUF (1.5GB)
wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf

# Run inference
./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128

Training Details

Architecture

  • Base Model: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers)
  • Parameters: 1.88B total
  • Context Length: 32K tokens
  • Vocabulary: 248K tokens (including Self-RAG special tokens)

Training Configuration

  • Method: LoRA with modules_to_save=["embed_tokens", "lm_head"]
  • Data: 201K synthetic samples (80% Self-RAG, 20% general)
  • Hardware: 8x NVIDIA A100 80GB
  • Duration: ~6 hours
  • Learning Rate: 2e-5
  • Epochs: 2

Critical Insight

The key to successful Self-RAG at sub-3B scale is including embedding layers in training:

peft_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["embed_tokens", "lm_head"],  # CRITICAL for Self-RAG tokens
    lora_dropout=0.05,
)

Without modules_to_save, Self-RAG token accuracy drops from 82.5% to 12%.

Quantization

Format Size MMLU Self-RAG Speed
BF16 4.5 GB 62.04% 82.5% 1.0x
Q8_0 2.3 GB 61.95% 82.1% 1.3x
Q4_K_M 1.5 GB 61.42% 80.5% 1.8x
Q4_K_S 1.3 GB 60.91% 79.2% 1.9x

Limitations

  • Mathematical Reasoning: GSM8K performance drops 4.5% due to token interruption in multi-step calculations
  • English Only: Trained and evaluated exclusively on English data
  • Static Retrieval Decision: Binary decision at generation start; cannot adapt mid-response

Citation

@article{ponnada2026avalon,
  title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment},
  author={Ponnada, Akhil and Arvapalli, Naga Sri},
  journal={arXiv preprint},
  year={2026}
}

License

Apache 2.0

Authors

Contact

Downloads last month
248
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nuroai/Avalon-2B

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(107)
this model
Quantizations
3 models

Evaluation results