OLMo-2-7B-SFT-GRPO-MATH-1EPOCH-SYSP

Description:

A GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset with a system prompt.

This model was developed as part of the research presented in the paper Learning to Reason without External Rewards. It utilizes the Intuitor method, an instantiation of Reinforcement Learning from Internal Feedback (RLIF), which enables models to learn from intrinsic signals like self-certainty without requiring external rewards or labeled gold solutions.

Resources

Paper: Learning to Reason without External Rewards
Repository: sunblaze-ucb/Intuitor

Citation

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}