AGC-Qwen2.5-Omni-3B

This model applies Attention-Guided Clustering (AGC) to compress multi-vector audiovisual video representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-Omni-3B-Instruct (thinker part) and finetuned on RankVideo-Dataset and tested on MultiVENT2.0 for audiovisual text-to-video retrieval with bidirectional attention.

AGC compresses video and audio token vectors into a fixed budget of 64 vectors, achieving the best R@10 among all compression methods and competitive nDCG@10 on MultiVENT 2.0.

Method Overview

AGC consists of three components:

Attention-based Centroid Selection — Learned universal query tokens are appended to the document token sequence. The attention weights from these learned tokens to document tokens at the last transformer layer produce saliency scores, identifying the most semantically important regions. The top-m tokens by saliency are selected as cluster centroids.
Hard Clustering — Every document token is assigned to its nearest centroid via cosine similarity, grouping related content into coherent clusters while preserving distinct semantic details.
Weighted Aggregation — Tokens within each cluster are aggregated into a single vector using saliency-weighted averaging, prioritizing informative tokens and maintaining gradient flow stable for training.

The resulting m compressed vectors are used with ColBERT-style MaxSim scoring for retrieval.

Results on MultiVENT 2.0

Method	Tokens	R@10	nDCG@10
SeqResize	64	41.1	38.5
MemTok	64	48.7	44.8
H-Pool	64	49.2	46.5
AGC (this model)	64	49.6	46.3

AGC at 64 tokens achieves the best R@10 among all compression methods and competitive nDCG@10 on MultiVENT 2.0.

Model Details


Initial weights	Qwen2.5-Omni-3B-Instruct (thinker)
Architecture	Qwen2.5-Omni (thinker) with bidirectional attention
Hidden dimension	2048
Compression method	AGC (Attention-Guided Clustering)
Universal query tokens	64 learned universal query tokens (`<\|mem0\|>` – `<\|mem63\|>`)
Default budget	64 vectors per document
Scoring	ColBERT-style MaxSim (late interaction)
Normalization	L2-normalized embeddings
Query prefix	`"Query: "`
Passage prefix	`"Passage: "`
Precision	bfloat16
Training video frames	24
Audio sampling rate	4KHz

Usage

import torch
from transformers import AutoProcessor
from qwen_omni_utils import process_mm_info

from src.arguments import ModelArguments
from src.encoder.select_encoder import AttentionSelectEncoder
from src.models.qwen2_5_omni_embed.qwen2_5_omni_embed import Qwen2_5OmniForEmbedding
from src.utils import get_appending_token_strings

MODEL_ID = "PLACEHOLDER"
VIDEO_PATH = "PLACEHOLDER"
AUDIO_PATH = "PLACEHOLDER"
NUM_PROXY_TOKENS = 64
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_PROXY_TOKENS))

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="select",
    normalize=True,
    num_appending_token=NUM_PROXY_TOKENS,
    use_cluster_pooling=True,
    use_attn_weight_cluster_pooling=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AttentionSelectEncoder.load(
    Qwen2_5OmniForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode a video+audio document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 75264, "min_pixels": 65856},
            {"type": "audio", "audio": AUDIO_PATH},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
audio_inputs, image_inputs, video_inputs = process_mm_info([passage_messages], use_audio_in_video=False)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, audio=audio_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 64, 2048) — 64 compressed AGC vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}

Downloads last month: 4

Safetensors

Model size

1.26M params

Tensor type

BF16

Datasets used to train hltcoe/AGC_qwen2.5-omni_multivent

Collection including hltcoe/AGC_qwen2.5-omni_multivent

Multi-Vector Index Compression in Any Modality

Collection

Models and Paper for Multi-Vector Index Compression in Any Modality • 15 items • Updated Mar 9 • 3

Paper for hltcoe/AGC_qwen2.5-omni_multivent

Multi-Vector Index Compression in Any Modality

Paper • 2602.21202 • Published Feb 24 • 22