Image-Text-to-Text
Safetensors
gui-grounding
lora
qwen3.5
screenspot
mdabis's picture
Upload README.md with huggingface_hub
af7cd69 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
  - gui-grounding
  - lora
  - qwen3.5
  - screenspot
datasets:
  - showlab/ShowUI-desktop
  - zonghanHZH/UGround-V1-8k
  - zonghanHZH/AMEX-8k
pipeline_tag: image-text-to-text

Qwen3.5-4B GUI Grounding — v2 (SFT LoRA)

LoRA adapter for Qwen3.5-4B fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element.

Results — ScreenSpot-V2

Split Correct Total Accuracy
Desktop 320 334 95.8%
Mobile 474 501 94.6%
Web 394 437 90.2%
Overall 1188 1272 93.4%

Training Data

~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms.

Output Format

<|box_start|>(x,y)<|box_end|>

Coordinates are in [0, 1000] normalized space. To convert to pixel coordinates:

pixel_x = x / 1000 * image_width
pixel_y = y / 1000 * image_height

Usage

Requires transformers>=5.2.0 and peft.

from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
import torch

base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v2")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")

Version History

Version ScreenSpot-V2
v1 92.5%
v2 93.4%

Access

Model weights are gated. Request access to download. Training configuration details are included with the model files.