Qwen3-1.7B-GRPO-Efficient-Reasoning

Model Overview

Qwen3-1.7B-GRPO-Efficient-Reasoning is a fine-tuned version of the base Qwen3-1.7B model, optimized specifically for highly efficient mathematical and logical reasoning.

This model was trained using Group Relative Policy Optimization (GRPO) with a custom reward function that includes a length penalty.

Key Achievements

  • Maintained Accuracy: Preserved the base model's strong reasoning capabilities (~85% on GSM8K).
  • 22.2% Cost/Latency Reduction: Reduced the average <think> block generation length from 579.9 words down to 450.8 words.
  • Zero Knowledge Degradation: Avoided the severe ARC-Challenge regressions typically seen when applying Supervised Fine-Tuning (SFT) to small reasoning models.

Training Details

Methodology

Unlike standard SFT approaches that rely on behavioral cloning (e.g., using datasets like Opus 4.6), this model was trained purely via Reinforcement Learning (RL) using the GRPO algorithm.

Reward Function

The reward function was designed to balance correctness with inference efficiency:

  1. Correctness Reward: Exact-match validation on the final <answer> block.
  2. Formatting Reward: Enforced strict adherence to the <think>...</think>\n<answer>...</answer> XML structure.
  3. Universal Length Penalty: A gentle, static negative reward applied to the total token count of the generation. By applying this universally to both correct and incorrect rollouts, the policy gradients successfully pruned redundant cyclical logic and dead-ends without triggering reward-hacking (where the model outputs zero tokens to maximize score).

Evaluation & Results

The model was evaluated against the base Qwen3-1.7B checkpoint to measure the precise impact of the RLHF process on both accuracy and verbosity.

Metric Base Qwen3-1.7B GRPO Fine-Tuned (This Model) Net Change
GSM8K Accuracy 84.99% 85.06% +0.07% (Maintained)
Avg Thinking Length 579.9 words 450.8 words -22.2% (Optimized)

How to Get Started

We recommend using vLLM for deployment to maximize the efficiency gains of the shortened reasoning traces. As inherited from the Qwen3 architecture, do not use greedy decoding when the thinking mode is enabled.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Shreyansh327/Qwen3-1.7B-grpo-gsm8k"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

prompt = "If a train travels 60 mph for 2.5 hours, how far does it go?"
messages = [
    {"role": "system", "content": "You are a helpful AI assistant. Please reason through the problem inside <think> tags, and then output your final answer inside <answer> tags."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Recommended generation parameters for Qwen3 thinking mode
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.9,
    min_p=0.0
)

print(tokenizer.decode(outputs, skip_special_tokens=True))
Downloads last month
20
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Shreyansh327/Qwen3-1.7B-grpo-gsm8k

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(613)
this model

Dataset used to train Shreyansh327/Qwen3-1.7B-grpo-gsm8k