README.md · ZYao720/WebArbiter-7B at main

WebArbiter-7B / README.md

ZYao720

Upload README.md with huggingface_hub

f0a7e46 verified 21 days ago

preview code

raw

history blame contribute delete

10.7 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- web-agent
	- process-reward-model
	- preference
	- reward-model
	- web-navigation
	- reasoning
	- grpo
	base_model: Qwen/Qwen2.5-7B-Instruct
	datasets:
	- ZYao720/WebArbiter-Data
	model-index:
	- name: WebArbiter-7B
	results:
	- task:
	type: text-generation
	name: Web Process Reward Modeling
	dataset:
	name: WebPRMBench
	type: ZYao720/WEBPRMBENCH
	metrics:
	- name: Avg Pairwise Accuracy
	type: accuracy
	value: 89.19
	- name: Avg BoN Accuracy
	type: accuracy
	value: 74.60
	---

	<div align="center">

	# WebArbiter-7B

	A principle-guided reasoning Process Reward Model for web agents

	Published at ICLR 2026

	[Paper](https://arxiv.org/abs/2601.21872) \| [Code](https://github.com/YaoZhang720/WebArbiter) \| [Website](https://yaozhang.ai/WebArbiter/) \| [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) \| [Demo](https://yaozhang.ai/WebArbiter/demo.html)

	</div>

	## Introduction

	WebArbiter-7B is a 7B reasoning Process Reward Model (PRM) for web agents, built on [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct). Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.

	On [WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH), WebArbiter-7B achieves an Avg. BoN Acc of 74.60%, outperforming GPT-5 by 9.1 points and the previous SOTA WebPRM (WebShepherd-8B) by 31 points. In reward-guided trajectory search on WebArena-Lite, it surpasses WebShepherd-8B by up to 6.4 points in success rate.

	## Highlights

	- Reasoning as reward: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains, instead of scalar scores or brittle checklists.
	- Principle-inducing evaluation: Dynamically derives evaluation principles from user intent and page state, enabling robust assessment that generalizes across environments.
	- Two-stage training: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO) to correct teacher biases and align verdicts with ground-truth correctness.
	- Robust generalization: SOTA performance across all four WebPRMBench environments, including out-of-domain enterprise workflows (WorkArena) and open-world websites (AssistantBench).

	## Results on WebPRMBench

	Models marked with ⋆ are ours. Bold = best overall.

	\| Model \| Mind2Web \| \| WebArena \| \| AssistantBench \| \| WorkArena \| \| Avg. \| \|
	\|-------\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \| Pair \| BoN \|
	\| Proprietary LLM-as-judge \| \| \| \| \| \| \| \| \| \| \|
	\| GPT-4o-mini \| 81.74 \| 50.92 \| 78.23 \| 56.72 \| 89.17 \| 73.33 \| 81.43 \| 46.70 \| 82.64 \| 56.92 \|
	\| GPT-4o \| 79.99 \| 52.62 \| 84.58 \| 66.67 \| 85.83 \| 66.67 \| 84.33 \| 55.19 \| 83.68 \| 60.29 \|
	\| GPT-5 \| 80.86 \| 62.39 \| 84.83 \| 71.64 \| 81.67 \| 63.33 \| 81.14 \| 64.62 \| 82.13 \| 65.50 \|
	\| Claude-3.7-Sonnet \| 80.20 \| 57.90 \| 82.80 \| 64.10 \| 81.50 \| 61.30 \| 82.10 \| 60.60 \| 81.65 \| 60.98 \|
	\| Gemini-2.5-Flash \| 81.30 \| 57.01 \| 82.71 \| 62.19 \| 80.00 \| 63.33 \| 83.30 \| 56.13 \| 81.83 \| 59.67 \|
	\| DeepSeek-R1 \| 81.62 \| 57.37 \| 82.04 \| 60.21 \| 78.49 \| 56.18 \| 84.12 \| 63.89 \| 81.57 \| 59.41 \|
	\| Open-source LLM-as-judge \| \| \| \| \| \| \| \| \| \| \|
	\| Qwen2.5-7B-Instruct \| 77.79 \| 39.18 \| 74.88 \| 42.79 \| 84.17 \| 53.33 \| 77.58 \| 35.85 \| 77.61 \| 42.78 \|
	\| Llama-3-70B-Instruct \| 80.55 \| 49.36 \| 77.36 \| 50.75 \| 85.83 \| 70.00 \| 79.08 \| 40.09 \| 80.71 \| 52.55 \|
	\| WebPRMs \| \| \| \| \| \| \| \| \| \| \|
	\| WebShepherd-8B \| 86.66 \| 73.69 \| 68.33 \| 43.88 \| 55.92 \| 30.00 \| 54.56 \| 25.53 \| 64.34 \| 43.28 \|
	\| ⋆ WebArbiter-7B \| 97.07 \| 89.53 \| 88.43 \| 68.66 \| 89.17 \| 70.00 \| 82.09 \| 70.19 \| 89.19 \| 74.60 \|

	## Reward-Guided Trajectory Search (WebArena-Lite)

	WebArbiter also excels as a practical reward signal for trajectory search. Using Best-of-5 sampling with a Knockout Tournament mechanism on [WebArena-Lite](https://arxiv.org/abs/2408.06327):

	\| Policy \| WebPRM \| Shopping \| CMS \| Reddit \| GitLab \| MAP \| Avg. \| Δ \|
	\|--------\|--------\|:--------:\|:---:\|:------:\|:------:\|:---:\|:----:\|:-:\|
	\| GPT-4o-mini \| w/o Search \| 21.74 \| 22.86 \| 19.05 \| 34.38 \| 19.35 \| 23.48 \| — \|
	\| GPT-4o-mini \| GPT-4o-mini (as WebPRM) \| 24.44 \| 22.86 \| 26.32 \| 33.33 \| 15.38 \| 24.47 \| +0.99 \|
	\| GPT-4o-mini \| WebShepherd-8B \| 26.09 \| 45.71 \| 23.81 \| 40.62 \| 35.48 \| 34.34 \| +10.86 \|
	\| GPT-4o-mini \| WebArbiter-7B \| 37.78 \| 42.86 \| 36.84 \| 46.67 \| 38.46 \| 40.52 \| +17.04 \|
	\| GPT-4o \| w/o Search \| 23.91 \| 31.43 \| 28.57 \| 56.25 \| 19.35 \| 31.90 \| — \|
	\| GPT-4o \| GPT-4o-mini (as WebPRM) \| 26.67 \| 37.14 \| 42.11 \| 40.00 \| 19.23 \| 33.03 \| +1.13 \|
	\| GPT-4o \| WebShepherd-8B \| 30.43 \| 42.86 \| 47.62 \| 46.88 \| 35.48 \| 40.65 \| +8.75 \|
	\| GPT-4o \| WebArbiter-7B \| 44.44 \| 42.86 \| 52.63 \| 56.67 \| 38.46 \| 47.01 \| +15.11 \|

	## Quick Start

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "ZYao720/WebArbiter-7B"

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)

	# Construct your prompt following the WebPRMBench format.
	# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
	user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses

	messages = [{"role": "user", "content": user_prompt}]
	input_ids = tokenizer.apply_chat_template(
	messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
	).to(model.device)

	with torch.no_grad():
	output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)

	response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
	print(response)
	```

	Example output:
	```xml
	<State>The user is on the DuckDuckGo homepage with a search box visible.
	Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
	<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
	2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
	3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
	<Analysis>Response 1 directly fills the search query into the textbox, which is the
	most direct path to completing the search task. Response 2 clicks an irrelevant link
	that does not contribute to the search goal.</Analysis>
	<Answer>Response 1</Answer>
	```

	## Training Details

	\| \| Stage 1: Reasoning Distillation \| Stage 2: RLVR \|
	\|---\|---\|---\|
	\| Method \| Supervised fine-tuning (SFT) \| GRPO with binary verifiable rewards \|
	\| Data \| 9,642 teacher-distilled examples \| 18,921 preference pairs \|
	\| Teacher \| o3 \| — \|
	\| Base Model \| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) \| Stage 1 checkpoint \|
	\| Fine-tuning \| LoRA (rank 128, lr 8e-4) \| FSDP + LoRA (lr 7e-6) \|
	\| Framework \| [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) \| [veRL](https://github.com/volcengine/verl) \|
	\| Hardware \| 8 × NVIDIA A100-80GB \| 8 × NVIDIA A100-80GB \|
	\| Source Data \| [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) \|

	Key training insights (from ablation studies in the paper):
	- Explicit principles are essential — removing them notably degrades performance, especially on out-of-domain environments.
	- Cold-start RL without reasoning distillation is unstable across environments.
	- Reasoning distillation provides stable discrimination, while RL acts as an amplifier that widens the margin between correct and incorrect judgments.

	## Intended Uses

	WebArbiter-7B is designed to:
	- Evaluate web agent actions: Given a web state and two candidate actions, determine which better advances the user's task.
	- Guide trajectory search: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
	- Provide interpretable feedback: Generate structured justifications explaining why one action is preferred, useful for debugging and analysis.

	## Limitations

	- Text-only observations: WebArbiter relies on accessibility tree representations without visual observations. In environments where layout, spatial arrangement, or visual cues carry task-relevant information, this text-only formulation may miss critical signals.
	- English-only: Training and evaluation are conducted exclusively in English-language web environments.
	- Safe-action bias: The model may sometimes overvalue cautious actions (e.g., hover over click) because the accessibility tree does not encode interaction effects.
	- Element reference hallucination: When a candidate action's reasoning is strongly task-aligned, the model may trust the semantic signal over low-level bid verification, potentially missing incorrect element references.

	## License

	This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct).

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| WebArbiter-8B-Qwen3 (strongest) \| [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) \|
	\| WebArbiter-4B-Qwen3 \| [ZYao720/WebArbiter-4B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-4B-Qwen3) \|
	\| WebArbiter-3B \| [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) \|
	\| WEBPRMBENCH (benchmark) \| [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) \|
	\| Training Data \| [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) \|
	\| Search Trajectories \| [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) \|

	## Citation

	```bibtex
	@misc{zhang2026ZYao720principleguidedreasoningprocess,
	title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
	author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
	year={2026},
	eprint={2601.21872},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2601.21872},
	}
	```