xhluca commited on
Commit
66b7a63
·
verified ·
1 Parent(s): c3c9e9a

Add A3-Qwen3.5-9B results (WebArena, VisualWebArena, WorkArena-L1, MiniWoB)

Browse files

## A3-Qwen3.5-9B

A 9B vision-language model fine-tuned using the Agent-as-Annotators (A3) pipeline on WebSynth trajectories.

### Results

| Benchmark | Score | Std Err |
|---|---|---|
| WebArena | 42.1 | 1.7 |
| VisualWebArena | 33.7 | 1.6 |
| WorkArena-L1 | 51.5 | 2.8 |
| MiniWoB | 69.0 | 1.9 |

- **Base model:** Qwen3.5-9B
- **Agent:** GenericAgent from AgentLab
- **Code:** https://github.com/McGill-NLP/llm-annotators
- **Paper:** COLM 2026 (forthcoming)

results/A3-Qwen3.5-9B/README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### A3-Qwen3.5-9B
2
+
3
+ This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab), fine-tuned using the Agent-as-Annotators (A3) pipeline.
4
+
5
+ - **Model Name:** A3-Qwen3.5-9B
6
+ - **Base Model:** Qwen/Qwen3.5-9B
7
+ - **Model Architecture:**
8
+ - Type: Vision-Language Model (VLM)
9
+ - Architecture: Causal LM with vision encoder
10
+ - Number of Parameters: 9B
11
+ - **Input/Output Format:**
12
+ - Input: Accessibility tree + Set-of-Mark (SoM) screenshot
13
+ - Output: Text action in BrowserGym format
14
+ - Flags:
15
+ ```python
16
+ GenericPromptFlags(
17
+ obs=ObsFlags(
18
+ use_html=False,
19
+ use_ax_tree=True,
20
+ use_tabs=True,
21
+ use_focused_element=True,
22
+ use_error_logs=True,
23
+ use_history=True,
24
+ use_past_error_logs=False,
25
+ use_action_history=True,
26
+ use_think_history=False,
27
+ use_diff=False,
28
+ html_type='pruned_html',
29
+ use_screenshot=True,
30
+ use_som=True,
31
+ extract_visible_tag=True,
32
+ extract_clickable_tag=True,
33
+ extract_coords='False',
34
+ filter_visible_elements_only=False,
35
+ ),
36
+ action=ActionFlags(
37
+ action_set=HighLevelActionSetArgs(
38
+ subsets=('webarena',),
39
+ multiaction=False,
40
+ strict=False,
41
+ retry_with_force=True,
42
+ demo_mode='off',
43
+ ),
44
+ long_description=False,
45
+ individual_examples=False,
46
+ ),
47
+ use_plan=False,
48
+ use_criticise=False,
49
+ use_thinking=True,
50
+ use_memory=False,
51
+ use_concrete_example=True,
52
+ use_abstract_example=True,
53
+ use_hints=True,
54
+ enable_chat=False,
55
+ max_prompt_tokens=57344,
56
+ be_cautious=True,
57
+ extra_instructions=None,
58
+ )
59
+ ```
60
+ - **Training Details:**
61
+ - Dataset: WebSynth trajectories collected via the A3 pipeline (agent-generated annotations on real websites)
62
+ - Fine-tuning method: Supervised Fine-Tuning (SFT) with FSDP
63
+ - Temperature at inference: 0.6
64
+ - **Paper Link:** (forthcoming — COLM 2026 submission)
65
+ - **Code Repository:** https://github.com/McGill-NLP/llm-annotators
66
+ - **License:** Apache-2.0
results/A3-Qwen3.5-9B/miniwob.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "agent_name": "A3-Qwen3.5-9B",
4
+ "study_id": "2026-03-16_16-50-10_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-miniwob",
5
+ "date_time": "2026-03-16 16:50:10",
6
+ "benchmark": "MiniWoB",
7
+ "score": 69.0,
8
+ "std_err": 1.9,
9
+ "benchmark_specific": "No",
10
+ "benchmark_tuned": "No",
11
+ "followed_evaluation_protocol": "Yes",
12
+ "reproducible": "Yes",
13
+ "comments": "625 tasks. Model not trained on MiniWoB data.",
14
+ "original_or_reproduced": "Original"
15
+ }
16
+ ]
results/A3-Qwen3.5-9B/visualwebarena.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "agent_name": "A3-Qwen3.5-9B",
4
+ "study_id": "2026-03-20_23-46-04_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-visualwebarena-test-test",
5
+ "date_time": "2026-03-20 23:46:04",
6
+ "benchmark": "VisualWebArena",
7
+ "score": 33.7,
8
+ "std_err": 1.6,
9
+ "benchmark_specific": "Yes",
10
+ "benchmark_tuned": "No",
11
+ "followed_evaluation_protocol": "Yes",
12
+ "reproducible": "Yes",
13
+ "comments": "Combined visualwebarena-train (461 tasks) and visualwebarena-test (449 tasks) for a total of 910 tasks. Model fine-tuned on WebSynth trajectories.",
14
+ "original_or_reproduced": "Original"
15
+ }
16
+ ]
results/A3-Qwen3.5-9B/webarena.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "agent_name": "A3-Qwen3.5-9B",
4
+ "study_id": "2026-03-10_15-17-39_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-webarena-test-test",
5
+ "date_time": "2026-03-10 15:17:39",
6
+ "benchmark": "WebArena",
7
+ "score": 42.1,
8
+ "std_err": 1.7,
9
+ "benchmark_specific": "Yes",
10
+ "benchmark_tuned": "No",
11
+ "followed_evaluation_protocol": "Yes",
12
+ "reproducible": "Yes",
13
+ "comments": "Combined webarena-train (431 tasks) and webarena-test (381 tasks) for a total of 812 tasks. Model fine-tuned on WebSynth trajectories.",
14
+ "original_or_reproduced": "Original"
15
+ }
16
+ ]
results/A3-Qwen3.5-9B/workarena-l1.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "agent_name": "A3-Qwen3.5-9B",
4
+ "study_id": "2026-03-12_18-27-40_genericagent-checkpoints-qwen-qwen3-5-9b-web-pro-low-8903051-checkpoint-latest-on-workarena-l1-full",
5
+ "date_time": "2026-03-12 18:27:40",
6
+ "benchmark": "WorkArena-L1",
7
+ "score": 51.5,
8
+ "std_err": 2.8,
9
+ "benchmark_specific": "No",
10
+ "benchmark_tuned": "No",
11
+ "followed_evaluation_protocol": "Yes",
12
+ "reproducible": "Yes",
13
+ "comments": "330 tasks. Model not trained on ServiceNow data.",
14
+ "original_or_reproduced": "Original"
15
+ }
16
+ ]