Instructions to use DanJZY/audio-composed-fashion-item-retriever with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use DanJZY/audio-composed-fashion-item-retriever with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Audio-Composed Fashion Retrieval (Qwen2-VL based)
Audio-conditioned composed image retrieval for fashion. Show the model a reference garment image and speak a modification β "make it black", "shorter sleeves", "something more formal" β and it retrieves the matching item from a catalog. The spoken modification enters the model natively through an audio encoder β there is no ASR step.
Built on DanJZY/Qwen2-VL-7B-Speech,
a speech-extended Qwen2-VL. Full code, plans, and write-ups:
github.com/ZhuoyuanJiang/fashion-retrieval-agent.
What's in this repo
Two trained two-tower checkpoints β same architecture, the query-side modification channel differs (spoken vs typed):
| Folder | Query modification | R@1 | R@5 | R@10 | R@50 |
|---|---|---|---|---|---|
audio/ |
spoken (audio, native β no ASR) | 0.210 | 0.522 | 0.624 | 0.853 |
text/ |
typed text | 0.231 | 0.528 | 0.654 | 0.866 |
Evaluation: FACap dress slice, 1,000-query held-out, 59,048-item gallery. Swapping the typed modification for synthesized speech costs only ~0.03 R@10 β the audio-native query is competitive with text, without any ASR.
Results in context (R@10)
| Method | R@10 |
|---|---|
| This model β text two-tower | 0.654 |
| This model β audio two-tower (native speech) | 0.624 |
| Caption + Marqo-FashionCLIP (best caption baseline) | 0.533 |
| Caption + Qwen3-Embedding-8B (strong general-purpose baseline) | 0.522 |
| Caption + MiniLM-L6 (anchor baseline) | 0.240 |
Is it actually using the audio? A 3-way sensitivity probe on the dev set: real audio β R@10 β 0.67; audio removed (image only) β β 0.06β0.08; mismatched audio β β 0.02. Removing or scrambling the speech collapses retrieval β the model is genuinely grounded in the spoken modification, not exploiting an image-only shortcut.
Why the caption-baseline comparison matters. The caption baselines are two-stage pipelines: a VLM first generates a target-oriented caption, then a frozen text encoder embeds it for retrieval. Our two-tower is end-to-end β the query tower produces a retrieval embedding directly, with no language intermediate. So beyond the +12.1 pp R@10 over the best caption baseline, the model is structurally simpler at inference (one forward + nearest neighbor) and avoids the language bottleneck β visual nuance that gets lost when the query is forced through a text intermediate.
Architecture
Two-tower contrastive retrieval over one shared, frozen speechQwen2-VL backbone:
- Query tower β
(reference image, modification)β query embedding. The modification is typed text (text/) or a spoken-modification waveform fed natively through the Whisper audio encoder (audio/, no ASR). - Target tower β
(target image, fixed prompt)β target embedding. - One frozen ~9B backbone + two PEFT LoRA adapters (query / target) + two
512-d projection heads + a learnable
logit_scale. Only ~48.8M params are trained; the backbone, its Whisper audio encoder, and the audio projector stay frozen. - Trained from scratch (both towers co-trained) with symmetric multi-positive InfoNCE and cross-GPU negative gathering.
Each checkpoint folder:
<ckpt>/
βββ head_query.pt # query projection head (torch, fp32)
βββ head_target.pt # target projection head (torch, fp32)
βββ logit_scale.pt # learnable temperature
βββ metrics.json # full per-epoch eval trajectory
βββ shared_backbone/
βββ query/ # PEFT LoRA adapter (query side)
βββ target/ # PEFT LoRA adapter (target side)
Training data
Query modifications come from the FACap dress-slice composed-retrieval
triplets (reference image + modification + target image). For the audio model,
the modification texts are TTS-synthesized to speech: a 110-speaker VCTK
reference bank (100 training-pool + 10 held-out OOD speakers) voiced by
Chatterbox zero-shot cloning, ~56k clips. No human-recorded speech is used in
training. (Pipeline: src/data/build_tts_audio.py in the project repo.)
Usage
This is a custom two-tower model, not a standard transformers auto-class β
load it with the project code (src/training/two_tower_model.py,
TwoTowerSharedBackbone):
from huggingface_hub import snapshot_download
ckpt = snapshot_download("DanJZY/audio-composed-fashion-item-retriever")
# Build the model (frozen speechQwen2-VL base + 2 LoRA adapters + 2 heads),
# then load weights from <ckpt>/audio (or <ckpt>/text):
# peft weights : load_peft_weights(f"{ckpt}/audio/shared_backbone")
# heads : torch.load(f"{ckpt}/audio/head_query.pt"), head_target.pt
# temperature : torch.load(f"{ckpt}/audio/logit_scale.pt")
# See src/training/train_plan10.py for the exact load routine.
Retrieval: encode the gallery once with the target tower, encode each
(image, spoken/typed modification) with the query tower, rank by cosine
similarity.
License & intended use
Research use only. These weights are LoRA adapters + projection heads on top
of DanJZY/Qwen2-VL-7B-Speech; no third-party image data is included. The model
is trained on FACap, whose upstream license is not stated β clarify
FACap's terms before any redistribution or commercial use.
Links
- Project / code: https://github.com/ZhuoyuanJiang/fashion-retrieval-agent
- Base model: https://huggingface.co/DanJZY/Qwen2-VL-7B-Speech
- Dataset: FACap (composed fashion image retrieval)
- Downloads last month
- -