| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - web-agent |
| - process-reward-model |
| - preference |
| - reward-model |
| - web-navigation |
| - reasoning |
| - grpo |
| base_model: Qwen/Qwen2.5-7B-Instruct |
| datasets: |
| - ZYao720/WebArbiter-Data |
| model-index: |
| - name: WebArbiter-7B |
| results: |
| - task: |
| type: text-generation |
| name: Web Process Reward Modeling |
| dataset: |
| name: WebPRMBench |
| type: ZYao720/WEBPRMBENCH |
| metrics: |
| - name: Avg Pairwise Accuracy |
| type: accuracy |
| value: 89.19 |
| - name: Avg BoN Accuracy |
| type: accuracy |
| value: 74.60 |
| --- |
| |
| <div align="center"> |
|
|
| # WebArbiter-7B |
|
|
| **A principle-guided reasoning Process Reward Model for web agents** |
|
|
| **Published at ICLR 2026** |
|
|
| [Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html) |
|
|
| </div> |
|
|
| ## Introduction |
|
|
| **WebArbiter-7B** is a 7B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion. |
|
|
| On [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), WebArbiter-7B achieves an **Avg. BoN Acc of 74.60%**, outperforming GPT-5 by **9.1 points** and the previous SOTA WebPRM (WebShepherd-8B) by **31 points**. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to **6.4 points** in success rate. |
|
|
| ## Highlights |
|
|
| - **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists. |
| - **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments. |
| - **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness. |
| - **Robust generalization**: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench). |
|
|
| ## Results on WebPRMBench |
|
|
| Models marked with ⋆ are ours. **Bold** = best overall. |
|
|
| | Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | | |
| |-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |
| | | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | |
| | *Proprietary LLM-as-judge* | | | | | | | | | | | |
| | GPT-4o-mini | 81.74 | 50.92 | 78.23 | 56.72 | 89.17 | 73.33 | 81.43 | 46.70 | 82.64 | 56.92 | |
| | GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 | |
| | GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 | |
| | Claude-3.7-Sonnet | 80.20 | 57.90 | 82.80 | 64.10 | 81.50 | 61.30 | 82.10 | 60.60 | 81.65 | 60.98 | |
| | Gemini-2.5-Flash | 81.30 | 57.01 | 82.71 | 62.19 | 80.00 | 63.33 | 83.30 | 56.13 | 81.83 | 59.67 | |
| | DeepSeek-R1 | 81.62 | 57.37 | 82.04 | 60.21 | 78.49 | 56.18 | 84.12 | 63.89 | 81.57 | 59.41 | |
| | *Open-source LLM-as-judge* | | | | | | | | | | | |
| | Qwen2.5-7B-Instruct | 77.79 | 39.18 | 74.88 | 42.79 | 84.17 | 53.33 | 77.58 | 35.85 | 77.61 | 42.78 | |
| | Llama-3-70B-Instruct | 80.55 | 49.36 | 77.36 | 50.75 | 85.83 | 70.00 | 79.08 | 40.09 | 80.71 | 52.55 | |
| | *WebPRMs* | | | | | | | | | | | |
| | WebShepherd-8B | 86.66 | 73.69 | 68.33 | 43.88 | 55.92 | 30.00 | 54.56 | 25.53 | 64.34 | 43.28 | |
| | ⋆ **WebArbiter-7B** | **97.07** | **89.53** | **88.43** | **68.66** | **89.17** | **70.00** | **82.09** | **70.19** | **89.19** | **74.60** | |
|
|
| ## Reward-Guided Trajectory Search (WebArena-Lite) |
|
|
| WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on [WebArena-Lite](https://arxiv.org/abs/2408.06327): |
|
|
| | Policy | WebPRM | Shopping | CMS | Reddit | GitLab | MAP | Avg. | Δ | |
| |--------|--------|:--------:|:---:|:------:|:------:|:---:|:----:|:-:| |
| | GPT-4o-mini | w/o Search | 21.74 | 22.86 | 19.05 | 34.38 | 19.35 | 23.48 | — | |
| | GPT-4o-mini | GPT-4o-mini (as WebPRM) | 24.44 | 22.86 | 26.32 | 33.33 | 15.38 | 24.47 | +0.99 | |
| | GPT-4o-mini | WebShepherd-8B | 26.09 | 45.71 | 23.81 | 40.62 | 35.48 | 34.34 | +10.86 | |
| | GPT-4o-mini | **WebArbiter-7B** | **37.78** | 42.86 | **36.84** | **46.67** | **38.46** | **40.52** | **+17.04** | |
| | GPT-4o | w/o Search | 23.91 | 31.43 | 28.57 | 56.25 | 19.35 | 31.90 | — | |
| | GPT-4o | GPT-4o-mini (as WebPRM) | 26.67 | 37.14 | 42.11 | 40.00 | 19.23 | 33.03 | +1.13 | |
| | GPT-4o | WebShepherd-8B | 30.43 | 42.86 | 47.62 | 46.88 | 35.48 | 40.65 | +8.75 | |
| | GPT-4o | **WebArbiter-7B** | **44.44** | 42.86 | **52.63** | **56.67** | **38.46** | **47.01** | **+15.11** | |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "ZYao720/WebArbiter-7B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| # Construct your prompt following the WebPRMBench format. |
| # See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples. |
| user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses |
| |
| messages = [{"role": "user", "content": user_prompt}] |
| input_ids = tokenizer.apply_chat_template( |
| messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", |
| ).to(model.device) |
| |
| with torch.no_grad(): |
| output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False) |
| |
| response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| **Example output:** |
| ```xml |
| <State>The user is on the DuckDuckGo homepage with a search box visible. |
| Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State> |
| <Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task? |
| 2. Element reference accuracy (weight 0.25) — Is the referenced element correct? |
| 3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria> |
| <Analysis>Response 1 directly fills the search query into the textbox, which is the |
| most direct path to completing the search task. Response 2 clicks an irrelevant link |
| that does not contribute to the search goal.</Analysis> |
| <Answer>Response 1</Answer> |
| ``` |
|
|
| ## Training Details |
|
|
| | | Stage 1: Reasoning Distillation | Stage 2: RLVR | |
| |---|---|---| |
| | Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards | |
| | Data | 9,642 teacher-distilled examples | 18,921 preference pairs | |
| | Teacher | o3 | — | |
| | Base Model | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Stage 1 checkpoint | |
| | Fine-tuning | LoRA (rank 128, lr 8e-4) | FSDP + LoRA (lr 7e-6) | |
| | Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) | |
| | Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB | |
| | Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) | |
|
|
| **Key training insights** (from ablation studies in the paper): |
| - Explicit principles are essential — removing them notably degrades performance, especially on out-of-domain environments. |
| - Cold-start RL without reasoning distillation is unstable across environments. |
| - Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments. |
|
|
| ## Intended Uses |
|
|
| WebArbiter-7B is designed to: |
| - **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task. |
| - **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution. |
| - **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis. |
|
|
| ## Limitations |
|
|
| - **Text-only observations**: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals. |
| - **English-only**: Training and evaluation are conducted exclusively in English-language web environments. |
| - **Safe-action bias**: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects. |
| - **Element reference hallucination**: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references. |
|
|
| ## License |
|
|
| This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). |
|
|
| ## Related Resources |
|
|
| | Resource | Link | |
| |----------|------| |
| | WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) | |
| | WebArbiter-4B-Qwen3 | [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) | |
| | WebArbiter-3B | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) | |
| | WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) | |
| | Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) | |
| | Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) | |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{zhang2026ZYao720principleguidedreasoningprocess, |
| title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents}, |
| author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp}, |
| year={2026}, |
| eprint={2601.21872}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.AI}, |
| url={https://arxiv.org/abs/2601.21872}, |
| } |
| ``` |
|
|