OTP: Failure-Risk Process Reward Model

A process reward model based on failure-risk dynamics (OTP). It predicts per-token margins $m_t = \text{head}(h_t)$, where the per-step reward is the margin difference $r_t = m_t - m_{t-1}$.

Architecture

Backbone: Qwen2.5-Math-7B-Instruct (frozen during D_phi pretraining)
Head: Single linear layer (hidden_size -> 1) predicting the success logit
Training: Binary cross-entropy on outcome labels (correct/incorrect), 1000 steps
No step-level annotations required -- trained with outcome supervision only

Usage

import torch
from transformers import AutoModel, AutoTokenizer
import torch.nn as nn

class FailureRiskModel(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True)
        self.head = nn.Linear(self.backbone.config.hidden_size, 1, dtype=torch.bfloat16)
        head_state = torch.load(f"{model_name}/head.pt", map_location="cpu", weights_only=True)
        self.head.load_state_dict(head_state)

    def forward(self, input_ids, attention_mask):
        h = self.backbone(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        return self.head(h).squeeze(-1)  # margins m_t, shape (B, L)

model = FailureRiskModel("luca0621/OTP-Qwen2.5-Math-7B")
tokenizer = AutoTokenizer.from_pretrained("luca0621/OTP-Qwen2.5-Math-7B", trust_remote_code=True)

# Compute per-step rewards: r_t = m_t - m_{t-1}
inputs = tokenizer("Solve: 2+2=?\\nStep 1: 2+2=4\\nAnswer: 4", return_tensors="pt")
with torch.no_grad():
    margins = model(**inputs)  # (1, L)
    rewards = margins[:, 1:] - margins[:, :-1]  # per-token reward

Results

Benchmark	Score
ProcessBench Avg F1	44.0
BoN@64 (3-gen avg)	61.3%
Dynamics Localization	65.3%

Citation

@article{otp2026,
  title={Outcome-to-Process: Failure-Risk Dynamics for Dense Reward in Mathematical Reasoning},
  year={2026}
}

Downloads last month: -

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for luca0621/OTP-Qwen2.5-Math-7B

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Math-7B

Finetuned

Qwen/Qwen2.5-Math-7B-Instruct

Finetuned

(136)

this model

luca0621
/

OTP-Qwen2.5-Math-7B

OTP: Failure-Risk Process Reward Model

Architecture

Usage

Results

Citation

Model tree for luca0621/OTP-Qwen2.5-Math-7B

Dataset used to train luca0621/OTP-Qwen2.5-Math-7B