Qwen3-1.7B-GRPO-Efficient-Reasoning
Model Overview
Qwen3-1.7B-GRPO-Efficient-Reasoning is a fine-tuned version of the base Qwen3-1.7B model, optimized specifically for highly efficient mathematical and logical reasoning.
This model was trained using Group Relative Policy Optimization (GRPO) with a custom reward function that includes a length penalty.
Key Achievements
- Maintained Accuracy: Preserved the base model's strong reasoning capabilities (~85% on GSM8K).
- 22.2% Cost/Latency Reduction: Reduced the average
<think>block generation length from 579.9 words down to 450.8 words. - Zero Knowledge Degradation: Avoided the severe ARC-Challenge regressions typically seen when applying Supervised Fine-Tuning (SFT) to small reasoning models.
Training Details
Methodology
Unlike standard SFT approaches that rely on behavioral cloning (e.g., using datasets like Opus 4.6), this model was trained purely via Reinforcement Learning (RL) using the GRPO algorithm.
Reward Function
The reward function was designed to balance correctness with inference efficiency:
- Correctness Reward: Exact-match validation on the final
<answer>block. - Formatting Reward: Enforced strict adherence to the
<think>...</think>\n<answer>...</answer>XML structure. - Universal Length Penalty: A gentle, static negative reward applied to the total token count of the generation. By applying this universally to both correct and incorrect rollouts, the policy gradients successfully pruned redundant cyclical logic and dead-ends without triggering reward-hacking (where the model outputs zero tokens to maximize score).
Evaluation & Results
The model was evaluated against the base Qwen3-1.7B checkpoint to measure the precise impact of the RLHF process on both accuracy and verbosity.
| Metric | Base Qwen3-1.7B | GRPO Fine-Tuned (This Model) | Net Change |
|---|---|---|---|
| GSM8K Accuracy | 84.99% | 85.06% | +0.07% (Maintained) |
| Avg Thinking Length | 579.9 words | 450.8 words | -22.2% (Optimized) |
How to Get Started
We recommend using vLLM for deployment to maximize the efficiency gains of the shortened reasoning traces. As inherited from the Qwen3 architecture, do not use greedy decoding when the thinking mode is enabled.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Shreyansh327/Qwen3-1.7B-grpo-gsm8k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
{"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
# Recommended generation parameters for Qwen3 thinking mode
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.9,
min_p=0.0
)
print(tokenizer.decode(outputs, skip_special_tokens=True))
- Downloads last month
- 20