Qwen3-4B-Thinking-2507-ERPD-006

Extreme Region Policy Distillation (ERPD) is a two-stage reinforcement learning framework that decouples sample efficiency from KL efficiency. This model is obtained by applying ERPD to Qwen3-4B-Thinking-2507. Prompts are randomly sampled from Polaris collection, with each training round sampling 1K prompts × 16 rollouts per iteration. This checkpoint corresponds to the 6rd iterative round (ERPD-006).

📄 Paper: Extreme Region Policy Distillation
🏠 Project: https://github.com/ChangyuChen347/ERPD

Performance

	AIME 2025	HMMT Feb 25	HMMT Nov 25	Beyond AIME
Qwen3.5-4B	—	74.0	76.8	—
Qwen3-30BA3B-Thinking-2507	—	63.1	73.8	—
Qwen3-4B-Thinking-2507	81.1	56.0	66.6	53.8
Qwen3-4B-Thinking-2507-ERPD-006	89.2	73.6	79.0	60.5

Sampling Parameters

We suggest using the following sampling parameters to reproduce the results:

{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "min_p": 0.0,
  "max_tokens": 131072,
}

We provide the output file. You can directly use the code from https://github.com/ChenxinAn-fdu/POLARIS to reproduce the results with the following command:
python evaluation/grade.py --file_name ./aime25-0.6-32-131072-0.95-20.jsonl

Citation

If you find our work helpful, feel free to give us a cite.

@misc{chen2026extremeregionpolicydistillation,
      title={Extreme Region Policy Distillation}, 
      author={Changyu Chen and Xiting Wang and Rui Yan},
      year={2026},
      eprint={2605.25582},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.25582}, 
}

Downloads last month: 23

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including adalaw/Qwen3-4B-Thinking-2507-ERPD-006

Extreme Region Policy Distillation

Collection

4 items • Updated 19 days ago

Paper for adalaw/Qwen3-4B-Thinking-2507-ERPD-006

Extreme Region Policy Distillation

Paper • 2605.25582 • Published 20 days ago