Image-Text-to-Text
Safetensors
gui-grounding
lora
qwen3.5
screenspot
mdabis's picture
Upload README.md with huggingface_hub
af7cd69 verified
---
license: apache-2.0
base_model: Qwen/Qwen3.5-4B
tags:
- gui-grounding
- lora
- qwen3.5
- screenspot
datasets:
- showlab/ShowUI-desktop
- zonghanHZH/UGround-V1-8k
- zonghanHZH/AMEX-8k
pipeline_tag: image-text-to-text
---
# Qwen3.5-4B GUI Grounding — v2 (SFT LoRA)
LoRA adapter for **Qwen3.5-4B** fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element.
## Results — ScreenSpot-V2
| Split | Correct | Total | Accuracy |
|-------|---------|-------|----------|
| Desktop | 320 | 334 | **95.8%** |
| Mobile | 474 | 501 | **94.6%** |
| Web | 394 | 437 | **90.2%** |
| **Overall** | **1188** | **1272** | **93.4%** |
## Training Data
~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms.
## Output Format
```
<|box_start|>(x,y)<|box_end|>
```
Coordinates are in [0, 1000] normalized space. To convert to pixel coordinates:
```python
pixel_x = x / 1000 * image_width
pixel_y = y / 1000 * image_height
```
## Usage
Requires `transformers>=5.2.0` and `peft`.
```python
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
import torch
base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v2")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")
```
## Version History
| Version | ScreenSpot-V2 |
|---------|---------------|
| [v1](https://huggingface.co/dabism23/qwen35-gui-grounding) | 92.5% |
| **v2** | **93.4%** |
## Access
Model weights are gated. Request access to download. Training configuration details are included with the model files.