Qwen3-4B-Thinking-2507-ERPD-006
Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3-4B-Thinking-2507. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts ร 16 rollouts per iteration. This checkpoint corresponds to the 6rd iterative round (ERPD-006).
๐ Paper: Extreme Region Policy Distillation
๐ Project: https://github.com/ChangyuChen347/ERPD
Performance
| AIME 2025 | HMMT Feb 25 | HMMT Nov 25 | Beyond AIME | |
|---|---|---|---|---|
| Qwen3.5-4B | โ | 74.0 | 76.8 | โ |
| Qwen3-30BA3B-Thinking-2507 | โ | 63.1 | 73.8 | โ |
| Qwen3-4B-Thinking-2507 | 81.1 | 56.0 | 66.6 | 53.8 |
| Qwen3-4B-Thinking-2507-ERPD-006 | 89.2 | 73.6 | 79.0 | 60.5 |
Sampling Parameters
We suggest using the following sampling parameters to reproduce the results:
{
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"max_tokens": 131072,
}
We provide the output file. You can directly use the code from https://github.com/ChenxinAn-fdu/POLARIS to reproduce the results with the following command:
python evaluation/grade.py --file_name ./aime25-0.6-32-131072-0.95-20.jsonl
Citation
If you find our work helpful, feel free to give us a cite.
@misc{chen2026extremeregionpolicydistillation,
title={Extreme Region Policy Distillation},
author={Changyu Chen and Xiting Wang and Rui Yan},
year={2026},
eprint={2605.25582},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.25582},
}
- Downloads last month
- 23