---
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - self-rag
  - retrieval-augmented-generation
  - qwen
  - text-generation
  - on-device
  - edge-ai
  - reflection-tokens
base_model: Qwen/Qwen3.5-2B
datasets:
  - custom
pipeline_tag: text-generation
model-index:
  - name: Avalon-2B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU
          type: cais/mmlu
        metrics:
          - name: Accuracy
            type: accuracy
            value: 62.04
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag
          type: Rowan/hellaswag
        metrics:
          - name: Accuracy (Normalized)
            type: accuracy
            value: 64.14
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ARC-Challenge
          type: allenai/ai2_arc
        metrics:
          - name: Accuracy (Normalized)
            type: accuracy
            value: 42.75
---

<div align="center">

# AVALON-2B

### Adaptive Vision-Augmented Language ON-device

**The First Sub-3B Self-Reflective Language Model**

[![Model Size](https://img.shields.io/badge/Parameters-1.88B-blue)](https://huggingface.co/nuroai/Avalon-2B)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://opensource.org/licenses/Apache-2.0)
[![Self-RAG](https://img.shields.io/badge/Self--RAG-82.5%25-orange)](https://huggingface.co/nuroai/Avalon-2B)
[![MMLU](https://img.shields.io/badge/MMLU-62.04%25-purple)](https://huggingface.co/nuroai/Avalon-2B)

[Paper](https://github.com/Nuro-Labs/avalon-2b) | [GitHub](https://github.com/Nuro-Labs/avalon-2b) | [GGUF Version](https://huggingface.co/nuroai/Avalon-2B-GGUF)

</div>

## Model Description

**AVALON-2B** is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure.

### Key Innovations

- **First Sub-3B Self-RAG**: Breaks the 7B parameter barrier for self-reflective capabilities
- **On-Device Ready**: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3
- **82.5% Token Accuracy**: Reliable generation of `[Retrieval]`, `[No Retrieval]`, and `[Utility:X]` tokens
- **No Catastrophic Forgetting**: +0.41% MMLU improvement over base model

### Self-RAG Tokens

AVALON generates special reflection tokens to enable adaptive retrieval:

| Token | Purpose | Example Use |
|-------|---------|-------------|
| `[Retrieval]` | External knowledge needed | "What happened in the news today?" |
| `[No Retrieval]` | Parametric knowledge sufficient | "What is the capital of France?" |
| `[Utility:1-5]` | Response quality rating | End of every response |

## Benchmarks

| Model | Params | MMLU | HellaSwag | ARC-C | Self-RAG |
|-------|--------|------|-----------|-------|----------|
| **AVALON-2B** | 1.88B | **62.04** | **64.14** | 42.75 | **82.5%** |
| Qwen 3.5 2B (base) | 1.88B | 61.63 | 62.15 | 41.64 | 0% |
| Gemma 4 E2B | 2.3B | 58.0 | 68.0 | 48.0 | 0% |
| SmolLM3 3B | 3.0B | 55.0 | 70.0 | 50.0 | 0% |

### vs Gemma 4 E2B (Head-to-Head)

| Metric | AVALON-2B | Gemma 4 E2B |
|--------|-----------|-------------|
| Knowledge Accuracy | **100%** | 75% |
| Tool Calling (XML) | **100%** | 50% |
| Inference Speed | **25.7 tok/s** | 11.2 tok/s |
| Self-Reflective Tokens | **Yes** | No |

## On-Device Performance

Tested with Q4_K_M quantization (1.5GB):

| Device | Chip | Speed (tok/s) | Memory |
|--------|------|---------------|--------|
| MacBook Air | M3 | 40.2 | 2.1 GB |
| MacBook Pro | M3 Pro | 52.4 | 2.1 GB |
| Mac Studio | M2 Ultra | 78.6 | 2.0 GB |
| iPhone 15 Pro | A17 Pro | 12.4 | 1.8 GB |

## Usage

### Recommended System Prompt

For best performance, use this system prompt:

```
You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it.
```

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B")

SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question:
1. Determine if you need external information by generating [Retrieval] or [No Retrieval]
2. For time-sensitive questions (news, current events, prices), always use [Retrieval]
3. For factual knowledge (capitals, math, definitions), use [No Retrieval]
4. End every response with [Utility:X] where X is 1-5 rating of response quality

Be concise and accurate. If you're uncertain, acknowledge it."""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Who won the 2024 US presidential election?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
# Output: [Retrieval]I need current information to answer this question...[Utility:4]
```

### Ollama (Recommended for Local Use)

```bash
# Download GGUF version
ollama pull nuroai/avalon-2b

# Run
ollama run avalon-2b "What is quantum computing?"
# Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5]
```

### llama.cpp

```bash
# Download Q4_K_M GGUF (1.5GB)
wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf

# Run inference
./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128
```

## Training Details

### Architecture
- **Base Model**: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers)
- **Parameters**: 1.88B total
- **Context Length**: 32K tokens
- **Vocabulary**: 248K tokens (including Self-RAG special tokens)

### Training Configuration
- **Method**: LoRA with `modules_to_save=["embed_tokens", "lm_head"]`
- **Data**: 201K synthetic samples (80% Self-RAG, 20% general)
- **Hardware**: 8x NVIDIA A100 80GB
- **Duration**: ~6 hours
- **Learning Rate**: 2e-5
- **Epochs**: 2

### Critical Insight

The key to successful Self-RAG at sub-3B scale is including embedding layers in training:

```python
peft_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    modules_to_save=["embed_tokens", "lm_head"],  # CRITICAL for Self-RAG tokens
    lora_dropout=0.05,
)
```

Without `modules_to_save`, Self-RAG token accuracy drops from 82.5% to 12%.

## Quantization

| Format | Size | MMLU | Self-RAG | Speed |
|--------|------|------|----------|-------|
| BF16 | 4.5 GB | 62.04% | 82.5% | 1.0x |
| Q8_0 | 2.3 GB | 61.95% | 82.1% | 1.3x |
| **Q4_K_M** | **1.5 GB** | **61.42%** | **80.5%** | **1.8x** |
| Q4_K_S | 1.3 GB | 60.91% | 79.2% | 1.9x |

## Limitations

- **Mathematical Reasoning**: GSM8K performance drops 4.5% due to token interruption in multi-step calculations
- **English Only**: Trained and evaluated exclusively on English data
- **Static Retrieval Decision**: Binary decision at generation start; cannot adapt mid-response

## Citation

```bibtex
@article{ponnada2026avalon,
  title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment},
  author={Ponnada, Akhil and Arvapalli, Naga Sri},
  journal={arXiv preprint},
  year={2026}
}
```

## License

Apache 2.0

## Authors

- **Akhil Ponnada** - akhil@nuroailabs.com
- **Naga Sri Arvapalli** - nagasri3007@gmail.com

## Contact

- **Organization**: [Nuro AI Labs Limited](https://nuro.one)
- **GitHub**: [Nuro-Labs/avalon-2b](https://github.com/Nuro-Labs/avalon-2b)