Add A3-Qwen3.5-9B WorkArena-L2 results (10.6%) and update all comments

#14
by xhluca - opened

Adds WorkArena++ L2 evaluation (341 tasks, full benchmark) and updates all benchmark comments.

Changes:

  • New: workarena-l2.json (10.6%, 341 tasks)
  • Updated comments: "WebSynth" -> "A3-Synth", added Agent-as-Annotators framework reference
  • Updated README to match project style (https://github.com/McGill-NLP/agent-as-annotators)
  • Fixed benchmark_specific field for WA/VWA
ServiceNow org

LGTM!

jaiswala changed pull request status to merged

Sign up or log in to comment