Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

xhluca commited on 4 days ago

Commit

66b7a63

verified ·

1 Parent(s): c3c9e9a

Add A3-Qwen3.5-9B results (WebArena, VisualWebArena, WorkArena-L1, MiniWoB)

Browse files

## A3-Qwen3.5-9B

A 9B vision-language model fine-tuned using the Agent-as-Annotators (A3) pipeline on WebSynth trajectories.

### Results

| Benchmark | Score | Std Err |
|---|---|---|
| WebArena | 42.1 | 1.7 |
| VisualWebArena | 33.7 | 1.6 |
| WorkArena-L1 | 51.5 | 2.8 |
| MiniWoB | 69.0 | 1.9 |

- **Base model:** Qwen3.5-9B
- **Agent:** GenericAgent from AgentLab
- **Code:** https://github.com/McGill-NLP/llm-annotators
- **Paper:** COLM 2026 (forthcoming)

Files changed (5) hide show

results/A3-Qwen3.5-9B/README.md +66 -0
results/A3-Qwen3.5-9B/miniwob.json +16 -0
results/A3-Qwen3.5-9B/visualwebarena.json +16 -0
results/A3-Qwen3.5-9B/webarena.json +16 -0
results/A3-Qwen3.5-9B/workarena-l1.json +16 -0

results/A3-Qwen3.5-9B/README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+### A3-Qwen3.5-9B
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab), fine-tuned using the Agent-as-Annotators (A3) pipeline.
+- **Model Name:** A3-Qwen3.5-9B
+- **Base Model:** Qwen/Qwen3.5-9B
+- **Model Architecture:**
+  - Type: Vision-Language Model (VLM)
+  - Architecture: Causal LM with vision encoder
+  - Number of Parameters: 9B
+- **Input/Output Format:**
+  - Input: Accessibility tree + Set-of-Mark (SoM) screenshot
+  - Output: Text action in BrowserGym format
+  - Flags:
+    ```python
+    GenericPromptFlags(
+        obs=ObsFlags(
+            use_html=False,
+            use_ax_tree=True,
+            use_tabs=True,
+            use_focused_element=True,
+            use_error_logs=True,
+            use_history=True,
+            use_past_error_logs=False,
+            use_action_history=True,
+            use_think_history=False,
+            use_diff=False,
+            html_type='pruned_html',
+            use_screenshot=True,
+            use_som=True,
+            extract_visible_tag=True,
+            extract_clickable_tag=True,
+            extract_coords='False',
+            filter_visible_elements_only=False,
+        ),
+        action=ActionFlags(
+            action_set=HighLevelActionSetArgs(
+                subsets=('webarena',),
+                multiaction=False,
+                strict=False,
+                retry_with_force=True,
+                demo_mode='off',
+            ),
+            long_description=False,
+            individual_examples=False,
+        ),
+        use_plan=False,
+        use_criticise=False,
+        use_thinking=True,
+        use_memory=False,
+        use_concrete_example=True,
+        use_abstract_example=True,
+        use_hints=True,
+        enable_chat=False,
+        max_prompt_tokens=57344,
+        be_cautious=True,
+        extra_instructions=None,
+    )
+    ```
+- **Training Details:**
+  - Dataset: WebSynth trajectories collected via the A3 pipeline (agent-generated annotations on real websites)
+  - Fine-tuning method: Supervised Fine-Tuning (SFT) with FSDP
+  - Temperature at inference: 0.6
+- **Paper Link:** (forthcoming — COLM 2026 submission)
+- **Code Repository:** https://github.com/McGill-NLP/llm-annotators
+- **License:** Apache-2.0

results/A3-Qwen3.5-9B/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "A3-Qwen3.5-9B",
+        "study_id": "2026-03-16_16-50-10_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-miniwob",
+        "date_time": "2026-03-16 16:50:10",
+        "benchmark": "MiniWoB",
+        "score": 69.0,
+        "std_err": 1.9,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "625 tasks. Model not trained on MiniWoB data.",
+        "original_or_reproduced": "Original"
+    }
+]

results/A3-Qwen3.5-9B/visualwebarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "A3-Qwen3.5-9B",
+        "study_id": "2026-03-20_23-46-04_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-visualwebarena-test-test",
+        "date_time": "2026-03-20 23:46:04",
+        "benchmark": "VisualWebArena",
+        "score": 33.7,
+        "std_err": 1.6,
+        "benchmark_specific": "Yes",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Combined visualwebarena-train (461 tasks) and visualwebarena-test (449 tasks) for a total of 910 tasks. Model fine-tuned on WebSynth trajectories.",
+        "original_or_reproduced": "Original"
+    }
+]

results/A3-Qwen3.5-9B/webarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "A3-Qwen3.5-9B",
+        "study_id": "2026-03-10_15-17-39_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-webarena-test-test",
+        "date_time": "2026-03-10 15:17:39",
+        "benchmark": "WebArena",
+        "score": 42.1,
+        "std_err": 1.7,
+        "benchmark_specific": "Yes",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "Combined webarena-train (431 tasks) and webarena-test (381 tasks) for a total of 812 tasks. Model fine-tuned on WebSynth trajectories.",
+        "original_or_reproduced": "Original"
+    }
+]

results/A3-Qwen3.5-9B/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+        "agent_name": "A3-Qwen3.5-9B",
+        "study_id": "2026-03-12_18-27-40_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-workarena-l1-full",
+        "date_time": "2026-03-12 18:27:40",
+        "benchmark": "WorkArena-L1",
+        "score": 51.5,
+        "std_err": 2.8,
+        "benchmark_specific": "No",
+        "benchmark_tuned": "No",
+        "followed_evaluation_protocol": "Yes",
+        "reproducible": "Yes",
+        "comments": "330 tasks. Model not trained on ServiceNow data.",
+        "original_or_reproduced": "Original"
+    }
+]