| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3.5-4B |
| tags: |
| - gui-grounding |
| - lora |
| - qwen3.5 |
| - screenspot |
| datasets: |
| - showlab/ShowUI-desktop |
| - zonghanHZH/UGround-V1-8k |
| - zonghanHZH/AMEX-8k |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Qwen3.5-4B GUI Grounding — v2 (SFT LoRA) |
|
|
| LoRA adapter for **Qwen3.5-4B** fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element. |
|
|
| ## Results — ScreenSpot-V2 |
|
|
| | Split | Correct | Total | Accuracy | |
| |-------|---------|-------|----------| |
| | Desktop | 320 | 334 | **95.8%** | |
| | Mobile | 474 | 501 | **94.6%** | |
| | Web | 394 | 437 | **90.2%** | |
| | **Overall** | **1188** | **1272** | **93.4%** | |
|
|
| ## Training Data |
|
|
| ~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms. |
|
|
| ## Output Format |
|
|
| ``` |
| <|box_start|>(x,y)<|box_end|> |
| ``` |
|
|
| Coordinates are in [0, 1000] normalized space. To convert to pixel coordinates: |
| ```python |
| pixel_x = x / 1000 * image_width |
| pixel_y = y / 1000 * image_height |
| ``` |
|
|
| ## Usage |
|
|
| Requires `transformers>=5.2.0` and `peft`. |
|
|
| ```python |
| from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration |
| from peft import PeftModel |
| import torch |
| |
| base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16) |
| model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v2") |
| processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B") |
| ``` |
|
|
| ## Version History |
|
|
| | Version | ScreenSpot-V2 | |
| |---------|---------------| |
| | [v1](https://huggingface.co/dabism23/qwen35-gui-grounding) | 92.5% | |
| | **v2** | **93.4%** | |
|
|
| ## Access |
|
|
| Model weights are gated. Request access to download. Training configuration details are included with the model files. |
|
|