Qwen3-1.7B-GRPO-Efficient-Reasoning

Model Overview

Qwen3-1.7B-GRPO-Efficient-Reasoning is a fine-tuned version of the base Qwen3-1.7B model, optimized specifically for highly efficient mathematical and logical reasoning.

This model was trained using Group Relative Policy Optimization (GRPO) with a custom reward function that includes a length penalty.

Key Achievements

Maintained Accuracy: Preserved the base model's strong reasoning capabilities (~85% on GSM8K).
22.2% Cost/Latency Reduction: Reduced the average <think> block generation length from 579.9 words down to 450.8 words.
Zero Knowledge Degradation: Avoided the severe ARC-Challenge regressions typically seen when applying Supervised Fine-Tuning (SFT) to small reasoning models.

Training Details

Methodology

Unlike standard SFT approaches that rely on behavioral cloning (e.g., using datasets like Opus 4.6), this model was trained purely via Reinforcement Learning (RL) using the GRPO algorithm.

Reward Function

The reward function was designed to balance correctness with inference efficiency:

Correctness Reward: Exact-match validation on the final <answer> block.
Formatting Reward: Enforced strict adherence to the <think>...</think>\n<answer>...</answer> XML structure.
Universal Length Penalty: A gentle, static negative reward applied to the total token count of the generation. By applying this universally to both correct and incorrect rollouts, the policy gradients successfully pruned redundant cyclical logic and dead-ends without triggering reward-hacking (where the model outputs zero tokens to maximize score).

Evaluation & Results

The model was evaluated against the base Qwen3-1.7B checkpoint to measure the precise impact of the RLHF process on both accuracy and verbosity.

Metric	Base Qwen3-1.7B	GRPO Fine-Tuned (This Model)	Net Change
GSM8K Accuracy	84.99%	85.06%	+0.07% (Maintained)
Avg Thinking Length	579.9 words	450.8 words	-22.2% (Optimized)

How to Get Started

We recommend using vLLM for deployment to maximize the efficiency gains of the shortened reasoning traces. As inherited from the Qwen3 architecture, do not use greedy decoding when the thinking mode is enabled.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shreyansh327/Qwen3-1.7B-grpo-gsm8k"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
    {"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Recommended generation parameters for Qwen3 thinking mode
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.9,
    min_p=0.0
)

print(tokenizer.decode(outputs, skip_special_tokens=True))

Downloads last month: 20

Safetensors

Model size

2B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for Shreyansh327/Qwen3-1.7B-grpo-gsm8k

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B