Dianjin-PRM

Dianjin-PRM is a Process Reward Model built on the Qwen3-8B architecture. It scores each reasoning step in a chain-of-thought trajectory, enabling Best-of-N selection and other process-supervision strategies for financial and mathematical reasoning tasks. This dataset is licensed under CC BY-NC-SA 4.0. The platform license tag may be limited by supported options.

Model Details

Property	Value
Base Architecture	Qwen3-8B (`Qwen3ForProcessRewardModel`)
Parameters	~8B
Precision	bfloat16
Max Sequence Length	40960 tokens
Output Labels	2 (negative / positive per step)
Step Separator Token	`<extra_0>`

Requirements

pip install torch transformers

The model uses custom trust_remote_code classes (v1_fin_prm.Qwen3ForProcessRewardModel and v1_fin_config.Qwen3PRMConfig) that are loaded automatically via the auto_map in config.json.

Quick Start

1. Load the Model

import torch
from transformers import AutoModel, AutoTokenizer

MODEL_PATH = "path/to/Dianjin-PRM"

model = AutoModel.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    device_map=None,
).eval()

# Multi-GPU via DataParallel (optional)
model = torch.nn.DataParallel(model).cuda()

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

2. Prepare Input

The model expects input in the following format, with each reasoning step separated by <extra_0>:

##Question
<your question here>

##Thinking Trajectory
<step 1><extra_0><step 2><extra_0>...<step N><extra_0>

Example:

question = "What is the present value of $1000 received in 5 years at a 10% discount rate?"

steps = [
    "We need to calculate the present value using the formula PV = FV / (1 + r)^n.",
    "Substituting the values: PV = 1000 / (1 + 0.10)^5.",
    "PV = 1000 / 1.61051 ≈ 620.92.",
]

trajectory = "<extra_0>".join(steps) + "<extra_0>"
completion = f"##Question\n{question}\n\n##Thinking Trajectory\n{trajectory}"

3. Compute Step Rewards

def make_step_rewards(logits, token_masks):
    """Extract per-step reward scores from model logits."""
    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    probabilities = probabilities * token_masks.unsqueeze(-1)
    all_scores_res = []
    for i in range(probabilities.size(0)):
        sample = probabilities[i]
        positive_probs = sample[sample != 0].view(-1, 2)[:, 1]
        all_scores_res.append(positive_probs.cpu().tolist())
    return all_scores_res


# Tokenize
input_ids = tokenizer(
    [completion],
    return_tensors="pt",
    padding=True,
    truncation=True,
)["input_ids"].to("cuda")

# Forward pass
with torch.inference_mode():
    outputs = model(input_ids=input_ids)

# Build step-separator mask and extract rewards
step_sep_id = tokenizer.encode("<extra_0>")[0]
token_masks = (input_ids == step_sep_id)
step_rewards = make_step_rewards(outputs.logits, token_masks)

print(step_rewards)
# e.g. [[0.92, 0.87, 0.95]]  — one score per step, per sample

Each score is the probability assigned to the positive label at the corresponding <extra_0> step boundary. Higher values indicate higher-quality reasoning steps.

4. Best-of-N Selection

To perform Best-of-N selection over multiple candidate responses:

import numpy as np

candidates = [...]  # list of (trajectory_string, final_answer) tuples
all_rewards = []

for trajectory, answer in candidates:
    completion = f"##Question\n{question}\n\n##Thinking Trajectory\n{trajectory}"
    input_ids = tokenizer(
        [completion], return_tensors="pt", padding=True, truncation=True
    )["input_ids"].to("cuda")

    with torch.inference_mode():
        outputs = model(input_ids=input_ids)

    step_sep_id = tokenizer.encode("<extra_0>")[0]
    token_masks = (input_ids == step_sep_id)
    rewards = make_step_rewards(outputs.logits, token_masks)
    # Use the minimum step score as the overall trajectory score
    all_rewards.append(min(rewards[0]))

best_idx = int(np.argmax(all_rewards))
best_answer = candidates[best_idx][1]

Input Format Summary

Component	Description
`##Question`	The original question/problem
`##Thinking Trajectory`	Reasoning steps separated by `<extra_0>`
`<extra_0>`	Special token used as step separator (token id: 151669)

Notes

The model outputs 2 logits per token (negative, positive). The reward score for each step is the softmax probability of the positive class at each <extra_0> position.
For batch inference, pass multiple completions as a list to the tokenizer with padding=True.
Multi-GPU is supported via torch.nn.DataParallel.
Always use trust_remote_code=True when loading the model, as it relies on custom architecture classes.

Downloads last month: 25

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support