File size: 10,687 Bytes
1cb29d7
f0a7e46
 
1cb29d7
 
f0a7e46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1cb29d7
 
f0a7e46
1cb29d7
f0a7e46
1cb29d7
f0a7e46
1cb29d7
f0a7e46
1cb29d7
f0a7e46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
language:
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- web-agent
- process-reward-model
- preference
- reward-model
- web-navigation
- reasoning
- grpo
base_model: Qwen/Qwen2.5-7B-Instruct
datasets:
- ZYao720/WebArbiter-Data
model-index:
- name: WebArbiter-7B
  results:
  - task:
      type: text-generation
      name: Web Process Reward Modeling
    dataset:
      name: WebPRMBench
      type: ZYao720/WEBPRMBENCH
    metrics:
    - name: Avg Pairwise Accuracy
      type: accuracy
      value: 89.19
    - name: Avg BoN Accuracy
      type: accuracy
      value: 74.60
---

<div align="center">

# WebArbiter-7B

**A principle-guided reasoning Process Reward Model for web agents**

**Published at ICLR 2026**

[Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)

</div>

## Introduction

**WebArbiter-7B** is a 7B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation β€” producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

On [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), WebArbiter-7B achieves an **Avg. BoN Acc of 74.60%**, outperforming GPT-5 by **9.1 points** and the previous SOTA WebPRM (WebShepherd-8B) by **31 points**. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to **6.4 points** in success rate.

## Highlights

- **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
- **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
- **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
- **Robust generalization**: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).

## Results on WebPRMBench

Models marked with ⋆ are ours. **Bold** = best overall.

| Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | |
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN |
| *Proprietary LLM-as-judge* | | | | | | | | | | |
| GPT-4o-mini | 81.74 | 50.92 | 78.23 | 56.72 | 89.17 | 73.33 | 81.43 | 46.70 | 82.64 | 56.92 |
| GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
| GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
| Claude-3.7-Sonnet | 80.20 | 57.90 | 82.80 | 64.10 | 81.50 | 61.30 | 82.10 | 60.60 | 81.65 | 60.98 |
| Gemini-2.5-Flash | 81.30 | 57.01 | 82.71 | 62.19 | 80.00 | 63.33 | 83.30 | 56.13 | 81.83 | 59.67 |
| DeepSeek-R1 | 81.62 | 57.37 | 82.04 | 60.21 | 78.49 | 56.18 | 84.12 | 63.89 | 81.57 | 59.41 |
| *Open-source LLM-as-judge* | | | | | | | | | | |
| Qwen2.5-7B-Instruct | 77.79 | 39.18 | 74.88 | 42.79 | 84.17 | 53.33 | 77.58 | 35.85 | 77.61 | 42.78 |
| Llama-3-70B-Instruct | 80.55 | 49.36 | 77.36 | 50.75 | 85.83 | 70.00 | 79.08 | 40.09 | 80.71 | 52.55 |
| *WebPRMs* | | | | | | | | | | |
| WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 |
| ⋆ **WebArbiter-7B** | **97.07** | **89.53** | **88.43** | **68.66** | **89.17** | **70.00** | **82.09** | **70.19** | **89.19** | **74.60** |

## Reward-Guided Trajectory Search (WebArena-Lite)

WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on [WebArena-Lite](https://arxiv.org/abs/2408.06327):

| Policy | WebPRM | Shopping | CMS | Reddit | GitLab | MAP | Avg. | Ξ” |
|--------|--------|:--------:|:---:|:------:|:------:|:---:|:----:|:-:|
| GPT-4o-mini | w/o Search | 21.74 | 22.86 | 19.05 | 34.38 | 19.35 | 23.48 | β€” |
| GPT-4o-mini | GPT-4o-mini (as WebPRM) | 24.44 | 22.86 | 26.32 | 33.33 | 15.38 | 24.47 | +0.99 |
| GPT-4o-mini | WebShepherd-8B | 26.09 | 45.71 | 23.81 | 40.62 | 35.48 | 34.34 | +10.86 |
| GPT-4o-mini | **WebArbiter-7B** | **37.78** | 42.86 | **36.84** | **46.67** | **38.46** | **40.52** | **+17.04** |
| GPT-4o | w/o Search | 23.91 | 31.43 | 28.57 | 56.25 | 19.35 | 31.90 | β€” |
| GPT-4o | GPT-4o-mini (as WebPRM) | 26.67 | 37.14 | 42.11 | 40.00 | 19.23 | 33.03 | +1.13 |
| GPT-4o | WebShepherd-8B | 30.43 | 42.86 | 47.62 | 46.88 | 35.48 | 40.65 | +8.75 |
| GPT-4o | **WebArbiter-7B** | **44.44** | 42.86 | **52.63** | **56.67** | **38.46** | **47.01** | **+15.11** |

## Quick Start

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZYao720/WebArbiter-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

# Construct your prompt following the WebPRMBench format.
# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses

messages = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(response)
```

**Example output:**
```xml
<State>The user is on the DuckDuckGo homepage with a search box visible.
Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
<Criteria>1. Goal alignment (weight 0.6) β€” Does the action advance the search task?
2. Element reference accuracy (weight 0.25) β€” Is the referenced element correct?
3. Efficiency (weight 0.15) β€” Does the action avoid unnecessary steps?</Criteria>
<Analysis>Response 1 directly fills the search query into the textbox, which is the
most direct path to completing the search task. Response 2 clicks an irrelevant link
that does not contribute to the search goal.</Analysis>
<Answer>Response 1</Answer>
```

## Training Details

| | Stage 1: Reasoning Distillation | Stage 2: RLVR |
|---|---|---|
| Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
| Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
| Teacher | o3 | β€” |
| Base Model | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Stage 1 checkpoint |
| Fine-tuning | LoRA (rank 128, lr 8e-4) | FSDP + LoRA (lr 7e-6) |
| Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) |
| Hardware | 8 Γ— NVIDIA A100-80GB | 8 Γ— NVIDIA A100-80GB |
| Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) |

**Key training insights** (from ablation studies in the paper):
- Explicit principles are essential β€” removing them notably degrades performance, especially on out-of-domain environments.
- Cold-start RL without reasoning distillation is unstable across environments.
- Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments.

## Intended Uses

WebArbiter-7B is designed to:
- **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task.
- **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
- **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.

## Limitations

- **Text-only observations**: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
- **English-only**: Training and evaluation are conducted exclusively in English-language web environments.
- **Safe-action bias**: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
- **Element reference hallucination**: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

## License

This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).

## Related Resources

| Resource | Link |
|----------|------|
| WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) |
| WebArbiter-4B-Qwen3 | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) |
| WebArbiter-3B | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) |
| WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) |
| Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) |
| Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) |

## Citation

```bibtex
@misc{zhang2026ZYao720principleguidedreasoningprocess,
      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, 
      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
      year={2026},
      eprint={2601.21872},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.21872}, 
}
```