Audio-Composed Fashion Retrieval (Qwen2-VL based)

Audio-conditioned composed image retrieval for fashion. Show the model a reference garment image and speak a modification β€” "make it black", "shorter sleeves", "something more formal" β€” and it retrieves the matching item from a catalog. The spoken modification enters the model natively through an audio encoder β€” there is no ASR step.

Built on DanJZY/Qwen2-VL-7B-Speech, a speech-extended Qwen2-VL. Full code, plans, and write-ups: github.com/ZhuoyuanJiang/fashion-retrieval-agent.

What's in this repo

Two trained two-tower checkpoints β€” same architecture, the query-side modification channel differs (spoken vs typed):

Folder Query modification R@1 R@5 R@10 R@50
audio/ spoken (audio, native β€” no ASR) 0.210 0.522 0.624 0.853
text/ typed text 0.231 0.528 0.654 0.866

Evaluation: FACap dress slice, 1,000-query held-out, 59,048-item gallery. Swapping the typed modification for synthesized speech costs only ~0.03 R@10 β€” the audio-native query is competitive with text, without any ASR.

Results in context (R@10)

Method R@10
This model β€” text two-tower 0.654
This model β€” audio two-tower (native speech) 0.624
Caption + Marqo-FashionCLIP (best caption baseline) 0.533
Caption + Qwen3-Embedding-8B (strong general-purpose baseline) 0.522
Caption + MiniLM-L6 (anchor baseline) 0.240

Is it actually using the audio? A 3-way sensitivity probe on the dev set: real audio β†’ R@10 β‰ˆ 0.67; audio removed (image only) β†’ β‰ˆ 0.06–0.08; mismatched audio β†’ β‰ˆ 0.02. Removing or scrambling the speech collapses retrieval β€” the model is genuinely grounded in the spoken modification, not exploiting an image-only shortcut.

Why the caption-baseline comparison matters. The caption baselines are two-stage pipelines: a VLM first generates a target-oriented caption, then a frozen text encoder embeds it for retrieval. Our two-tower is end-to-end β€” the query tower produces a retrieval embedding directly, with no language intermediate. So beyond the +12.1 pp R@10 over the best caption baseline, the model is structurally simpler at inference (one forward + nearest neighbor) and avoids the language bottleneck β€” visual nuance that gets lost when the query is forced through a text intermediate.

Architecture

Two-tower contrastive retrieval over one shared, frozen speechQwen2-VL backbone:

  • Query tower β€” (reference image, modification) β†’ query embedding. The modification is typed text (text/) or a spoken-modification waveform fed natively through the Whisper audio encoder (audio/, no ASR).
  • Target tower β€” (target image, fixed prompt) β†’ target embedding.
  • One frozen ~9B backbone + two PEFT LoRA adapters (query / target) + two 512-d projection heads + a learnable logit_scale. Only ~48.8M params are trained; the backbone, its Whisper audio encoder, and the audio projector stay frozen.
  • Trained from scratch (both towers co-trained) with symmetric multi-positive InfoNCE and cross-GPU negative gathering.

Each checkpoint folder:

<ckpt>/
β”œβ”€β”€ head_query.pt          # query projection head  (torch, fp32)
β”œβ”€β”€ head_target.pt         # target projection head (torch, fp32)
β”œβ”€β”€ logit_scale.pt         # learnable temperature
β”œβ”€β”€ metrics.json           # full per-epoch eval trajectory
└── shared_backbone/
    β”œβ”€β”€ query/             # PEFT LoRA adapter (query side)
    └── target/            # PEFT LoRA adapter (target side)

Training data

Query modifications come from the FACap dress-slice composed-retrieval triplets (reference image + modification + target image). For the audio model, the modification texts are TTS-synthesized to speech: a 110-speaker VCTK reference bank (100 training-pool + 10 held-out OOD speakers) voiced by Chatterbox zero-shot cloning, ~56k clips. No human-recorded speech is used in training. (Pipeline: src/data/build_tts_audio.py in the project repo.)

Usage

This is a custom two-tower model, not a standard transformers auto-class β€” load it with the project code (src/training/two_tower_model.py, TwoTowerSharedBackbone):

from huggingface_hub import snapshot_download
ckpt = snapshot_download("DanJZY/audio-composed-fashion-item-retriever")

# Build the model (frozen speechQwen2-VL base + 2 LoRA adapters + 2 heads),
# then load weights from <ckpt>/audio (or <ckpt>/text):
#   peft weights : load_peft_weights(f"{ckpt}/audio/shared_backbone")
#   heads        : torch.load(f"{ckpt}/audio/head_query.pt"), head_target.pt
#   temperature  : torch.load(f"{ckpt}/audio/logit_scale.pt")
# See src/training/train_plan10.py for the exact load routine.

Retrieval: encode the gallery once with the target tower, encode each (image, spoken/typed modification) with the query tower, rank by cosine similarity.

License & intended use

Research use only. These weights are LoRA adapters + projection heads on top of DanJZY/Qwen2-VL-7B-Speech; no third-party image data is included. The model is trained on FACap, whose upstream license is not stated β€” clarify FACap's terms before any redistribution or commercial use.

Links

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DanJZY/audio-composed-fashion-item-retriever

Base model

Qwen/Qwen2-VL-7B
Adapter
(2)
this model