--- language: - en license: apache-2.0 library_name: transformers tags: - self-rag - retrieval-augmented-generation - qwen - text-generation - on-device - edge-ai - reflection-tokens base_model: Qwen/Qwen3.5-2B datasets: - custom pipeline_tag: text-generation model-index: - name: Avalon-2B results: - task: type: text-generation name: Text Generation dataset: name: MMLU type: cais/mmlu metrics: - name: Accuracy type: accuracy value: 62.04 - task: type: text-generation name: Text Generation dataset: name: HellaSwag type: Rowan/hellaswag metrics: - name: Accuracy (Normalized) type: accuracy value: 64.14 - task: type: text-generation name: Text Generation dataset: name: ARC-Challenge type: allenai/ai2_arc metrics: - name: Accuracy (Normalized) type: accuracy value: 42.75 ---
# AVALON-2B ### Adaptive Vision-Augmented Language ON-device **The First Sub-3B Self-Reflective Language Model** [![Model Size](https://img.shields.io/badge/Parameters-1.88B-blue)](https://huggingface.co/nuroai/Avalon-2B) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://opensource.org/licenses/Apache-2.0) [![Self-RAG](https://img.shields.io/badge/Self--RAG-82.5%25-orange)](https://huggingface.co/nuroai/Avalon-2B) [![MMLU](https://img.shields.io/badge/MMLU-62.04%25-purple)](https://huggingface.co/nuroai/Avalon-2B) [Paper](https://github.com/Nuro-Labs/avalon-2b) | [GitHub](https://github.com/Nuro-Labs/avalon-2b) | [GGUF Version](https://huggingface.co/nuroai/Avalon-2B-GGUF)
## Model Description **AVALON-2B** is the first sub-3B parameter language model to implement Self-Reflective Retrieval-Augmented Generation (Self-RAG) with learned reflection tokens. Built upon Qwen 3.5 2B, AVALON introduces a novel training pipeline that teaches the model to generate retrieval decision tokens without external retrieval infrastructure. ### Key Innovations - **First Sub-3B Self-RAG**: Breaks the 7B parameter barrier for self-reflective capabilities - **On-Device Ready**: 1.5GB quantized (Q4_K_M) runs at 40+ tok/s on Apple M3 - **82.5% Token Accuracy**: Reliable generation of `[Retrieval]`, `[No Retrieval]`, and `[Utility:X]` tokens - **No Catastrophic Forgetting**: +0.41% MMLU improvement over base model ### Self-RAG Tokens AVALON generates special reflection tokens to enable adaptive retrieval: | Token | Purpose | Example Use | |-------|---------|-------------| | `[Retrieval]` | External knowledge needed | "What happened in the news today?" | | `[No Retrieval]` | Parametric knowledge sufficient | "What is the capital of France?" | | `[Utility:1-5]` | Response quality rating | End of every response | ## Benchmarks | Model | Params | MMLU | HellaSwag | ARC-C | Self-RAG | |-------|--------|------|-----------|-------|----------| | **AVALON-2B** | 1.88B | **62.04** | **64.14** | 42.75 | **82.5%** | | Qwen 3.5 2B (base) | 1.88B | 61.63 | 62.15 | 41.64 | 0% | | Gemma 4 E2B | 2.3B | 58.0 | 68.0 | 48.0 | 0% | | SmolLM3 3B | 3.0B | 55.0 | 70.0 | 50.0 | 0% | ### vs Gemma 4 E2B (Head-to-Head) | Metric | AVALON-2B | Gemma 4 E2B | |--------|-----------|-------------| | Knowledge Accuracy | **100%** | 75% | | Tool Calling (XML) | **100%** | 50% | | Inference Speed | **25.7 tok/s** | 11.2 tok/s | | Self-Reflective Tokens | **Yes** | No | ## On-Device Performance Tested with Q4_K_M quantization (1.5GB): | Device | Chip | Speed (tok/s) | Memory | |--------|------|---------------|--------| | MacBook Air | M3 | 40.2 | 2.1 GB | | MacBook Pro | M3 Pro | 52.4 | 2.1 GB | | Mac Studio | M2 Ultra | 78.6 | 2.0 GB | | iPhone 15 Pro | A17 Pro | 12.4 | 1.8 GB | ## Usage ### Recommended System Prompt For best performance, use this system prompt: ``` You are AVALON, a self-reflective AI assistant. Before answering any question: 1. Determine if you need external information by generating [Retrieval] or [No Retrieval] 2. For time-sensitive questions (news, current events, prices), always use [Retrieval] 3. For factual knowledge (capitals, math, definitions), use [No Retrieval] 4. End every response with [Utility:X] where X is 1-5 rating of response quality Be concise and accurate. If you're uncertain, acknowledge it. ``` ### Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("nuroai/Avalon-2B", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("nuroai/Avalon-2B") SYSTEM_PROMPT = """You are AVALON, a self-reflective AI assistant. Before answering any question: 1. Determine if you need external information by generating [Retrieval] or [No Retrieval] 2. For time-sensitive questions (news, current events, prices), always use [Retrieval] 3. For factual knowledge (capitals, math, definitions), use [No Retrieval] 4. End every response with [Utility:X] where X is 1-5 rating of response quality Be concise and accurate. If you're uncertain, acknowledge it.""" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "Who won the 2024 US presidential election?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=False)) # Output: [Retrieval]I need current information to answer this question...[Utility:4] ``` ### Ollama (Recommended for Local Use) ```bash # Download GGUF version ollama pull nuroai/avalon-2b # Run ollama run avalon-2b "What is quantum computing?" # Output: [No Retrieval]Quantum computing is a type of computation that...[Utility:5] ``` ### llama.cpp ```bash # Download Q4_K_M GGUF (1.5GB) wget https://huggingface.co/nuroai/Avalon-2B-GGUF/resolve/main/avalon-2b-q4km.gguf # Run inference ./llama-cli -m avalon-2b-q4km.gguf -p "What is the capital of Japan?" -n 128 ``` ## Training Details ### Architecture - **Base Model**: Qwen 3.5 2B (18 GDN + 6 Softmax attention layers) - **Parameters**: 1.88B total - **Context Length**: 32K tokens - **Vocabulary**: 248K tokens (including Self-RAG special tokens) ### Training Configuration - **Method**: LoRA with `modules_to_save=["embed_tokens", "lm_head"]` - **Data**: 201K synthetic samples (80% Self-RAG, 20% general) - **Hardware**: 8x NVIDIA A100 80GB - **Duration**: ~6 hours - **Learning Rate**: 2e-5 - **Epochs**: 2 ### Critical Insight The key to successful Self-RAG at sub-3B scale is including embedding layers in training: ```python peft_config = LoraConfig( r=64, lora_alpha=128, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], modules_to_save=["embed_tokens", "lm_head"], # CRITICAL for Self-RAG tokens lora_dropout=0.05, ) ``` Without `modules_to_save`, Self-RAG token accuracy drops from 82.5% to 12%. ## Quantization | Format | Size | MMLU | Self-RAG | Speed | |--------|------|------|----------|-------| | BF16 | 4.5 GB | 62.04% | 82.5% | 1.0x | | Q8_0 | 2.3 GB | 61.95% | 82.1% | 1.3x | | **Q4_K_M** | **1.5 GB** | **61.42%** | **80.5%** | **1.8x** | | Q4_K_S | 1.3 GB | 60.91% | 79.2% | 1.9x | ## Limitations - **Mathematical Reasoning**: GSM8K performance drops 4.5% due to token interruption in multi-step calculations - **English Only**: Trained and evaluated exclusively on English data - **Static Retrieval Decision**: Binary decision at generation start; cannot adapt mid-response ## Citation ```bibtex @article{ponnada2026avalon, title={AVALON-2B: The First Sub-3B Self-Reflective Language Model for On-Device Deployment}, author={Ponnada, Akhil and Arvapalli, Naga Sri}, journal={arXiv preprint}, year={2026} } ``` ## License Apache 2.0 ## Authors - **Akhil Ponnada** - akhil@nuroailabs.com - **Naga Sri Arvapalli** - nagasri3007@gmail.com ## Contact - **Organization**: [Nuro AI Labs Limited](https://nuro.one) - **GitHub**: [Nuro-Labs/avalon-2b](https://github.com/Nuro-Labs/avalon-2b)