DillonNys's picture
Initial release: Qwen Visual Design Judge (82% accuracy)
193d23a verified
metadata
license: apache-2.0
tags:
  - vision
  - design
  - qwen
  - fine-tuned
  - visual-quality
  - pairwise-comparison
base_model: Qwen/Qwen3.5-0.8B
pipeline_tag: image-text-to-text

Qwen Visual Design Judge

A fine-tuned Qwen3.5-0.8B model that judges visual design quality between image pairs.

🎯 Performance

Metric Score
Overall Accuracy 82%
High agreement pairs (β‰₯80%) 90.9%
Low agreement pairs (<80%) 79.5%

Matches GPT-4.1 performance while being ~1000x cheaper to run locally!

πŸ“Š Training

  • Base model: Qwen/Qwen3.5-0.8B
  • Training data: 40K synthetic preference pairs labeled by GPT-4.1
  • Domains: Landing pages, websites, mobile UI, graphics
  • Epochs: 1
  • Hardware: NVIDIA T4 GPU (~13 hours)

πŸš€ Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "DillonNys/qwen-visual-design-judge",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = AutoProcessor.from_pretrained("DillonNys/qwen-visual-design-judge")

def judge_pair(img_a: str, img_b: str) -> str:
    prompt = """You are an expert visual design judge. Compare these two images and determine which has better visual design quality.

Consider: layout, typography, color harmony, visual hierarchy, spacing, and overall aesthetic appeal.

Respond with ONLY "A" or "B" to indicate the better design."""

    messages = [{
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {"type": "text", "text": "\n\nImage A:"},
            {"type": "image", "image": img_a},
            {"type": "text", "text": "\n\nImage B:"},
            {"type": "image", "image": img_b},
            {"type": "text", "text": "\n\nWhich is better? Answer A or B:"},
        ],
    }]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=8, do_sample=False)
    
    response = processor.decode(output_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    return "A" if "A" in response.upper() else "B"

# Example
winner = judge_pair("design_a.png", "design_b.png")
print(f"Better design: {winner}")

πŸ“ Citation

If you use this model, please cite:

@misc{qwen-visual-design-judge,
  author = {Dillon Nys},
  title = {Qwen Visual Design Judge},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DillonNys/qwen-visual-design-judge}
}

πŸ™ Acknowledgments

  • Qwen team for the excellent base model
  • OpenAI for GPT-4.1 used in synthetic labeling
  • The Vibe Arena community for preference data