Qwen3-0.6B-Reasoning-Opus
This is a fine-tuned version of Qwen3-0.6B optimized for multi-step reasoning. It was trained using QLoRA exclusively on a filtered dataset of reasoning traces distilled from Claude 4.6 Opus.
The goal of this project was to induce "System 2" (deliberate) thinking in a sub-1B parameter model and study the resulting "Alignment Tax"—specifically, how behavioral cloning of reasoning traces affects pre-trained factual knowledge.
Model Details
- Developed by: Shreyansh Pathak
- Institution: Dayananda Sagar College of Engineering (DSCE), Bangalore
- Model type: Causal Language Model with Chain-of-Thought (CoT) capabilities
- Base Model: Qwen/Qwen3-0.6B
- Language(s): English
- Fine-tuning Technique: QLoRA (Unsloth)
- Rank (r): 32
- Learning Rate: 5e-5
Performance & Evaluation
This model was evaluated on both reasoning capabilities (GSM8K) and factual knowledge retention (ARC-Challenge) to measure the impact of training exclusively on reasoning data.
| Benchmark | Base Qwen3-0.6B | Qwen3-0.6B-Reasoning-Opus | Impact |
|---|---|---|---|
| GSM8K Accuracy (n=50) | 26.0% | 32.0% | +6.0% Absolute Gain |
| ARC-Challenge (Factual) | Baseline | Degraded | -24.31% Absolute Loss |
Key Findings & The "Alignment Tax"
- Reasoning Activation: The fine-tuned model successfully triggers a
<think>block for complex queries, effectively decomposing multi-step arithmetic that the base model failed on. - Catastrophic Forgetting: Training exclusively on open-ended Opus reasoning data caused severe degradation in the model's fundamental knowledge representation, resulting in a massive 24.31% drop on the ARC-Challenge benchmark.
- Mode Collapse (Hallucination): The model successfully learned the structure of reasoning (
<think>...**Answer: B**) but frequently filled the reasoning traces with overconfident, factually incorrect statements. - Degenerate Loops: Without a strict repetition penalty during inference, the model is prone to cyclical logic loops due to the heavy behavioral cloning.
Training Procedure
- Hardware: NVIDIA L4 / A100 (via Modal/Unsloth)
- Method: Supervised Fine-Tuning (SFT)
- Learning Rate: 5e-5
- Dataset:
nohurry/Opus-4.6-Reasoning-3000x-filtered(100% split) - Steps: ~500
Usage & Recommendations
This model is published primarily for research purposes to demonstrate the catastrophic forgetting effects of pure-SFT reasoning distillation on small models. For production use cases or factual stability, we strongly recommend using our data-mixed SFT model or our GRPO-optimized checkpoints instead.
When running inference, a repetition penalty is required to prevent loop degeneration.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Shreyansh327/Qwen3-0.6B-Reasoning-Opus"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
{"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
{"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.9,
repetition_penalty=1.15 # Required to prevent degenerate loops
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
- Downloads last month
- 14