Intuitor
Collection
Models in the paper "Learning to Reason without External Rewards" • 12 items • Updated • 1
Description:
A GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset with a system prompt.
This model was developed as part of the research presented in the paper Learning to Reason without External Rewards. It utilizes the Intuitor method, an instantiation of Reinforcement Learning from Internal Feedback (RLIF), which enables models to learn from intrinsic signals like self-certainty without requiring external rewards or labeled gold solutions.
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}