File size: 3,048 Bytes

193d23a

---
license: apache-2.0
tags:
  - vision
  - design
  - qwen
  - fine-tuned
  - visual-quality
  - pairwise-comparison
base_model: Qwen/Qwen3.5-0.8B
pipeline_tag: image-text-to-text
---

# Qwen Visual Design Judge

A fine-tuned Qwen3.5-0.8B model that judges visual design quality between image pairs.

## 🎯 Performance

| Metric | Score |
|--------|-------|
| Overall Accuracy | **82%** |
| High agreement pairs (≥80%) | 90.9% |
| Low agreement pairs (<80%) | 79.5% |

Matches GPT-4.1 performance while being ~1000x cheaper to run locally!

## 📊 Training

- **Base model**: Qwen/Qwen3.5-0.8B
- **Training data**: 40K synthetic preference pairs labeled by GPT-4.1
- **Domains**: Landing pages, websites, mobile UI, graphics
- **Epochs**: 1
- **Hardware**: NVIDIA T4 GPU (~13 hours)

## 🚀 Usage

```python
import torch
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "DillonNys/qwen-visual-design-judge",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained("DillonNys/qwen-visual-design-judge")

def judge_pair(img_a: str, img_b: str) -> str:
    prompt = """You are an expert visual design judge. Compare these two images and determine which has better visual design quality.

Consider: layout, typography, color harmony, visual hierarchy, spacing, and overall aesthetic appeal.

Respond with ONLY "A" or "B" to indicate the better design."""

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "text", "text": "\n\nImage A:"},
            {"type": "image", "image": img_a},
            {"type": "text", "text": "\n\nImage B:"},
            {"type": "image", "image": img_b},
            {"type": "text", "text": "\n\nWhich is better? Answer A or B:"},
        ],
    }]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=8, do_sample=False)
    
    response = processor.decode(output_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    return "A" if "A" in response.upper() else "B"

# Example
winner = judge_pair("design_a.png", "design_b.png")
print(f"Better design: {winner}")
```

## 📝 Citation

If you use this model, please cite:

```bibtex
@misc{qwen-visual-design-judge,
  author = {Dillon Nys},
  title = {Qwen Visual Design Judge},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DillonNys/qwen-visual-design-judge}
}
```

## 🙏 Acknowledgments

- Qwen team for the excellent base model
- OpenAI for GPT-4.1 used in synthetic labeling
- The Vibe Arena community for preference data