Llama-3.2-3B-Instruct-GRPO-MATH-1EPOCH
This model is a GRPO-fine-tuned version of meta-llama/Llama-3.2-3B-Instruct trained on the MATH dataset for one epoch. It was developed as part of the research presented in the paper Learning to Reason without External Rewards.
The official code implementation is available in the Intuitor GitHub repository.
Model Description
The model explores Reinforcement Learning from Internal Feedback (RLIF), a framework that enables Large Language Models (LLMs) to learn from intrinsic signals—such as self-certainty—without the need for external rewards or labeled data. This specific checkpoint serves as a baseline for the Intuitor method described in the paper, which uses a model's own confidence as its sole reward signal in Group Relative Policy Optimization (GRPO).
- Paper: Learning to Reason without External Rewards
- Repository: sunblaze-ucb/Intuitor
Citation
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}
- Downloads last month
- 6
Model tree for sunblaze-ucb/Llama-3.2-3B-Instruct-GRPO-MATH-1EPOCH
Base model
meta-llama/Llama-3.2-3B-Instruct