Qwen3-0.6B-Reasoning-Opus

This is a fine-tuned version of Qwen3-0.6B optimized for multi-step reasoning. It was trained using QLoRA exclusively on a filtered dataset of reasoning traces distilled from Claude 4.6 Opus.

The goal of this project was to induce "System 2" (deliberate) thinking in a sub-1B parameter model and study the resulting "Alignment Tax"—specifically, how behavioral cloning of reasoning traces affects pre-trained factual knowledge.

Model Details

  • Developed by: Shreyansh Pathak
  • Institution: Dayananda Sagar College of Engineering (DSCE), Bangalore
  • Model type: Causal Language Model with Chain-of-Thought (CoT) capabilities
  • Base Model: Qwen/Qwen3-0.6B
  • Language(s): English
  • Fine-tuning Technique: QLoRA (Unsloth)
  • Rank (r): 32
  • Learning Rate: 5e-5

Performance & Evaluation

This model was evaluated on both reasoning capabilities (GSM8K) and factual knowledge retention (ARC-Challenge) to measure the impact of training exclusively on reasoning data.

Benchmark Base Qwen3-0.6B Qwen3-0.6B-Reasoning-Opus Impact
GSM8K Accuracy (n=50) 26.0% 32.0% +6.0% Absolute Gain
ARC-Challenge (Factual) Baseline Degraded -24.31% Absolute Loss

Key Findings & The "Alignment Tax"

  • Reasoning Activation: The fine-tuned model successfully triggers a <think> block for complex queries, effectively decomposing multi-step arithmetic that the base model failed on.
  • Catastrophic Forgetting: Training exclusively on open-ended Opus reasoning data caused severe degradation in the model's fundamental knowledge representation, resulting in a massive 24.31% drop on the ARC-Challenge benchmark.
  • Mode Collapse (Hallucination): The model successfully learned the structure of reasoning (<think>...**Answer: B**) but frequently filled the reasoning traces with overconfident, factually incorrect statements.
  • Degenerate Loops: Without a strict repetition penalty during inference, the model is prone to cyclical logic loops due to the heavy behavioral cloning.

Training Procedure

  • Hardware: NVIDIA L4 / A100 (via Modal/Unsloth)
  • Method: Supervised Fine-Tuning (SFT)
  • Learning Rate: 5e-5
  • Dataset: nohurry/Opus-4.6-Reasoning-3000x-filtered (100% split)
  • Steps: ~500

Usage & Recommendations

This model is published primarily for research purposes to demonstrate the catastrophic forgetting effects of pure-SFT reasoning distillation on small models. For production use cases or factual stability, we strongly recommend using our data-mixed SFT model or our GRPO-optimized checkpoints instead.

When running inference, a repetition penalty is required to prevent loop degeneration.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shreyansh327/Qwen3-0.6B-Reasoning-Opus"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
    {"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    temperature=0.6,
    top_p=0.9,
    repetition_penalty=1.15 # Required to prevent degenerate loops
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
Downloads last month
14
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shreyansh327/Qwen3-0.6B-Reasoning-Opus

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(794)
this model

Dataset used to train Shreyansh327/Qwen3-0.6B-Reasoning-Opus