Audio-Composed Fashion Retrieval (Qwen2-VL based)

Audio-conditioned composed image retrieval for fashion. Show the model a reference garment image and speak a modification — "make it black", "shorter sleeves", "something more formal" — and it retrieves the matching item from a catalog. The spoken modification enters the model natively through an audio encoder — there is no ASR step.

Built on DanJZY/Qwen2-VL-7B-Speech, a speech-extended Qwen2-VL. Full code, plans, and write-ups: github.com/ZhuoyuanJiang/fashion-retrieval-agent.

What's in this repo

Two trained two-tower checkpoints — same architecture, the query-side modification channel differs (spoken vs typed):

Folder	Query modification	R@1	R@5	R@10	R@50
`audio/`	spoken (audio, native — no ASR)	0.210	0.522	0.624	0.853
`text/`	typed text	0.231	0.528	0.654	0.866

Evaluation: FACap dress slice, 1,000-query held-out, 59,048-item gallery. Swapping the typed modification for synthesized speech costs only ~0.03 R@10 — the audio-native query is competitive with text, without any ASR.

Results in context (R@10)

Method	R@10
This model — text two-tower	0.654
This model — audio two-tower (native speech)	0.624
Caption + Marqo-FashionCLIP (best caption baseline)	0.533
Caption + Qwen3-Embedding-8B (strong general-purpose baseline)	0.522
Caption + MiniLM-L6 (anchor baseline)	0.240

Is it actually using the audio? A 3-way sensitivity probe on the dev set: real audio → R@10 ≈ 0.67; audio removed (image only) → ≈ 0.06–0.08; mismatched audio → ≈ 0.02. Removing or scrambling the speech collapses retrieval — the model is genuinely grounded in the spoken modification, not exploiting an image-only shortcut.

Why the caption-baseline comparison matters. The caption baselines are two-stage pipelines: a VLM first generates a target-oriented caption, then a frozen text encoder embeds it for retrieval. Our two-tower is end-to-end — the query tower produces a retrieval embedding directly, with no language intermediate. So beyond the +12.1 pp R@10 over the best caption baseline, the model is structurally simpler at inference (one forward + nearest neighbor) and avoids the language bottleneck — visual nuance that gets lost when the query is forced through a text intermediate.

Architecture

Two-tower contrastive retrieval over one shared, frozen speechQwen2-VL backbone:

Query tower — (reference image, modification) → query embedding. The modification is typed text (text/) or a spoken-modification waveform fed natively through the Whisper audio encoder (audio/, no ASR).
Target tower — (target image, fixed prompt) → target embedding.
One frozen ~9B backbone + two PEFT LoRA adapters (query / target) + two 512-d projection heads + a learnable logit_scale. Only ~48.8M params are trained; the backbone, its Whisper audio encoder, and the audio projector stay frozen.
Trained from scratch (both towers co-trained) with symmetric multi-positive InfoNCE and cross-GPU negative gathering.

Each checkpoint folder:

<ckpt>/
├── head_query.pt          # query projection head  (torch, fp32)
├── head_target.pt         # target projection head (torch, fp32)
├── logit_scale.pt         # learnable temperature
├── metrics.json           # full per-epoch eval trajectory
└── shared_backbone/
    ├── query/             # PEFT LoRA adapter (query side)
    └── target/            # PEFT LoRA adapter (target side)

Training data

Query modifications come from the FACap dress-slice composed-retrieval triplets (reference image + modification + target image). For the audio model, the modification texts are TTS-synthesized to speech: a 110-speaker VCTK reference bank (100 training-pool + 10 held-out OOD speakers) voiced by Chatterbox zero-shot cloning, ~56k clips. No human-recorded speech is used in training. (Pipeline: src/data/build_tts_audio.py in the project repo.)

Usage

This is a custom two-tower model, not a standard transformers auto-class — load it with the project code (src/training/two_tower_model.py, TwoTowerSharedBackbone):

from huggingface_hub import snapshot_download
ckpt = snapshot_download("DanJZY/audio-composed-fashion-item-retriever")

# Build the model (frozen speechQwen2-VL base + 2 LoRA adapters + 2 heads),
# then load weights from <ckpt>/audio (or <ckpt>/text):
#   peft weights : load_peft_weights(f"{ckpt}/audio/shared_backbone")
#   heads        : torch.load(f"{ckpt}/audio/head_query.pt"), head_target.pt
#   temperature  : torch.load(f"{ckpt}/audio/logit_scale.pt")
# See src/training/train_plan10.py for the exact load routine.

Retrieval: encode the gallery once with the target tower, encode each (image, spoken/typed modification) with the query tower, rank by cosine similarity.

License & intended use

Research use only. These weights are LoRA adapters + projection heads on top of DanJZY/Qwen2-VL-7B-Speech; no third-party image data is included. The model is trained on FACap, whose upstream license is not stated — clarify FACap's terms before any redistribution or commercial use.