Qwen3-4B-Thinking-2507-ERPD-006

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3-4B-Thinking-2507. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts ร— 16 rollouts per iteration. This checkpoint corresponds to the 6rd iterative round (ERPD-006).

๐Ÿ“„ Paper: Extreme Region Policy Distillation
๐Ÿ  Project: https://github.com/ChangyuChen347/ERPD

Performance

AIME 2025 HMMT Feb 25 HMMT Nov 25 Beyond AIME
Qwen3.5-4B โ€” 74.0 76.8 โ€”
Qwen3-30BA3B-Thinking-2507 โ€” 63.1 73.8 โ€”
Qwen3-4B-Thinking-2507 81.1 56.0 66.6 53.8
Qwen3-4B-Thinking-2507-ERPD-006 89.2 73.6 79.0 60.5

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results:

{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "max_tokens": 131072,
}

We provide the output file. You can directly use the code from https://github.com/ChenxinAn-fdu/POLARIS to reproduce the results with the following command:

python evaluation/grade.py --file_name ./aime25-0.6-32-131072-0.95-20.jsonl

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}
Downloads last month
23
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including adalaw/Qwen3-4B-Thinking-2507-ERPD-006

Paper for adalaw/Qwen3-4B-Thinking-2507-ERPD-006