AGC-Qwen2.5-VL-3B (ViDoRe)
This model applies Attention-Guided Clustering (AGC) to compress multi-vector visual document representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-VL-3B-Instruct and finetuned on the ColPali train set for text-to-visual-document retrieval with bidirectional attention.
AGC compresses ~1300 visual document token vectors into a fixed budget of 64 vectors (95.1% compression), while maintaining 94.5% of uncompressed retrieval performance (nDCG@5).
Method Overview
AGC consists of three components:
Attention-based Centroid Selection — Learned universal query tokens are appended to the document token sequence. The attention weights from these learned tokens to document tokens at the last transformer layer produce saliency scores, identifying the most semantically important regions. The top-m tokens by saliency are selected as cluster centroids.
Hard Clustering — Every document token is assigned to its nearest centroid via cosine similarity, grouping related content into coherent clusters while preserving distinct semantic details.
Weighted Aggregation — Tokens within each cluster are aggregated into a single vector using saliency-weighted averaging, prioritizing informative tokens and maintaining gradient flow stable for training.
The resulting m compressed vectors are used with ColBERT-style MaxSim scoring for retrieval.
Results on ViDoRe v2
| Method | Tokens | nDCG@5 (Avg) | Bio | Econ | ESG-R | ESG-H |
|---|---|---|---|---|---|---|
| ColPali | – | 53.3 | 56.5 | 49.9 | 55.7 | 51.1 |
| ColQwenOmni | – | 56.5 | 56.5 | 53.2 | 54.2 | 62.2 |
| MetaEmbed | 64 | 58.8 | 58.7 | 55.5 | 57.4 | 63.7 |
| Baseline (Ours, uncompressed) | 1297 | 60.0 | 61.4 | 53.9 | 57.0 | 67.6 |
| SeqResize | 64 | 51.7 | 54.7 | 53.5 | 45.2 | 53.5 |
| MemTok | 64 | 54.3 | 56.8 | 53.0 | 46.4 | 61.4 |
| H-Pool | 64 | 56.4 | 59.6 | 52.1 | 53.4 | 60.6 |
| AGC-Qwen2.5-VL-3B (This model) | 64 | 56.7 | 59.0 | 54.5 | 55.8 | 57.3 |
AGC at 64 tokens outperforms all other learned compression methods (SeqResize, MemTok) and demonstrates the most stable performance across domains compared to H-Pool, while using only 4.9% of the original index size.
Model Details
| Initial weights | Qwen2.5-VL-3B-Instruct |
| Architecture | Qwen2.5-VL with bidirectional attention |
| Hidden dimension | 2048 |
| Compression method | AGC (Attention-Guided Clustering) |
| Universal query tokens | 64 learned universal query tokens (<|mem0|> – <|mem63|>) |
| Default budget | 64 vectors per document |
| Scoring | ColBERT-style MaxSim (late interaction) |
| Normalization | L2-normalized embeddings |
| Query prefix | "Query: " |
| Passage prefix | "Passage: " |
| Precision | bfloat16 |
| Max image tokens | 1280 |
Usage
import torch
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info
from src.arguments import ModelArguments
from src.encoder.select_encoder import AttentionSelectEncoder
from src.models.qwen2_5_vl_embed.qwen2_5_vl_embed import Qwen2_5ForEmbedding
from src.utils import get_appending_token_strings
MODEL_ID = "hltcoe/AGC_qwen2.5-vl_colpali"
IMAGE_PATH = "PLACEHOLDER"
NUM_PROXY_TOKENS = 64
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_PROXY_TOKENS))
# --- Setup ---
model_args = ModelArguments(
model_name_or_path=MODEL_ID,
pooling="select",
normalize=True,
num_appending_token=NUM_PROXY_TOKENS,
use_cluster_pooling=True,
use_attn_weight_cluster_pooling=True,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AttentionSelectEncoder.load(
Qwen2_5ForEmbedding,
model_args,
attn_implementation=model_args.attn_implementation,
dtype=torch.bfloat16,
)
model = model.to("cuda").eval()
# --- Encode an image document ---
passage_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Passage: "},
{"type": "image", "image": IMAGE_PATH, "max_pixels": 1003520, "min_pixels": 614656},
],
}
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
image_inputs, video_inputs = process_vision_info(passage_messages)
passage_inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",
).to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
print(doc_embeddings.shape)
# doc_embeddings: (1, 64, 2048) — 64 compressed AGC vectors
# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: What types of tissues are unable to regenerate spontaneously?"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
with torch.inference_mode():
query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
print(query_embeddings.shape)
# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")
Command line usage
For running inference and evaluation from the command line, see the Quick Start section.
Citation
@misc{qin2026multivectorindexcompressionmodality,
title={Multi-Vector Index Compression in Any Modality},
author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
year={2026},
eprint={2602.21202},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2602.21202},
}
- Downloads last month
- -
Model tree for hltcoe/AGC_qwen2.5-vl_colpali
Base model
Qwen/Qwen2.5-VL-3B-Instruct