OLMo-2-7B-SFT-GRPO-MATH-1EPOCH-SYSP

Description:

A GRPO-fine-tuned version of allenai/OLMo-2-1124-7B-SFT trained on the MATH dataset with a system prompt.

This model was developed as part of the research presented in the paper Learning to Reason without External Rewards. It utilizes the Intuitor method, an instantiation of Reinforcement Learning from Internal Feedback (RLIF), which enables models to learn from intrinsic signals like self-certainty without requiring external rewards or labeled gold solutions.

Resources


Citation

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}
Downloads last month
4
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH-SYSP

Finetuned
(9)
this model

Collection including sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH-SYSP

Paper for sunblaze-ucb/OLMo-2-7B-SFT-GRPO-MATH-1EPOCH-SYSP