Llama-3.2-3B-Instruct-GRPO-MATH-1EPOCH

This model is a GRPO-fine-tuned version of meta-llama/Llama-3.2-3B-Instruct trained on the MATH dataset for one epoch. It was developed as part of the research presented in the paper Learning to Reason without External Rewards.

The official code implementation is available in the Intuitor GitHub repository.

Model Description

The model explores Reinforcement Learning from Internal Feedback (RLIF), a framework that enables Large Language Models (LLMs) to learn from intrinsic signals—such as self-certainty—without the need for external rewards or labeled data. This specific checkpoint serves as a baseline for the Intuitor method described in the paper, which uses a model's own confidence as its sole reward signal in Group Relative Policy Optimization (GRPO).

Paper: Learning to Reason without External Rewards
Repository: sunblaze-ucb/Intuitor

Citation

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}