Qwen3-0.6B-Reasoning-Opus

This is a fine-tuned version of Qwen3-0.6B optimized for multi-step reasoning. It was trained using QLoRA exclusively on a filtered dataset of reasoning traces distilled from Claude 4.6 Opus.

The goal of this project was to induce "System 2" (deliberate) thinking in a sub-1B parameter model and study the resulting "Alignment Tax"—specifically, how behavioral cloning of reasoning traces affects pre-trained factual knowledge.

Model Details

Developed by: Shreyansh Pathak
Institution: Dayananda Sagar College of Engineering (DSCE), Bangalore
Model type: Causal Language Model with Chain-of-Thought (CoT) capabilities
Base Model: Qwen/Qwen3-0.6B
Language(s): English
Fine-tuning Technique: QLoRA (Unsloth)
Rank (r): 32
Learning Rate: 5e-5

Performance & Evaluation

This model was evaluated on both reasoning capabilities (GSM8K) and factual knowledge retention (ARC-Challenge) to measure the impact of training exclusively on reasoning data.

Benchmark	Base Qwen3-0.6B	Qwen3-0.6B-Reasoning-Opus	Impact
GSM8K Accuracy (n=50)	26.0%	32.0%	+6.0% Absolute Gain
ARC-Challenge (Factual)	Baseline	Degraded	-24.31% Absolute Loss

Key Findings & The "Alignment Tax"

Reasoning Activation: The fine-tuned model successfully triggers a <think> block for complex queries, effectively decomposing multi-step arithmetic that the base model failed on.
Catastrophic Forgetting: Training exclusively on open-ended Opus reasoning data caused severe degradation in the model's fundamental knowledge representation, resulting in a massive 24.31% drop on the ARC-Challenge benchmark.
Mode Collapse (Hallucination): The model successfully learned the structure of reasoning (<think>...**Answer: B**) but frequently filled the reasoning traces with overconfident, factually incorrect statements.
Degenerate Loops: Without a strict repetition penalty during inference, the model is prone to cyclical logic loops due to the heavy behavioral cloning.

Training Procedure

Hardware: NVIDIA L4 / A100 (via Modal/Unsloth)
Method: Supervised Fine-Tuning (SFT)
Learning Rate: 5e-5
Dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered (100% split)
Steps: ~500

Usage & Recommendations

This model is published primarily for research purposes to demonstrate the catastrophic forgetting effects of pure-SFT reasoning distillation on small models. For production use cases or factual stability, we strongly recommend using our data-mixed SFT model or our GRPO-optimized checkpoints instead.

When running inference, a repetition penalty is required to prevent loop degeneration.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shreyansh327/Qwen3-0.6B-Reasoning-Opus"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
    {"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.15 # Required to prevent degenerate loops
)
print(tokenizer.decode(outputs, skip_special_tokens=True))

Downloads last month: 14

Safetensors

Model size

0.6B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shreyansh327/Qwen3-0.6B-Reasoning-Opus

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(794)

this model

Shreyansh327
/

Qwen3-0.6B-Reasoning-Opus