ReAG-Critic β Passage Relevance Filter for KB-VQA
ReAG-Critic is the passage filtering component of ReAG, a Reasoning-Augmented Multimodal RAG pipeline for Knowledge-Based Visual Question Answering (KB-VQA). Given an image, a question, and a retrieved text passage, it predicts whether the passage is relevant and should be forwarded to the generator β or discarded as noise.
It is based on Qwen2.5-VL and operates by comparing the probability of the next token being "Yes" vs "No" against a configurable threshold (default: 0.1), making it fast and easy to calibrate.
The filtered passages are then passed to the generator (aimagelab/ReAG-3B or aimagelab/ReAG-7B) for answer generation with explicit chain-of-thought reasoning.
Model Description
Standard retrieval-augmented VQA methods often pass noisy or irrelevant passages directly to the generator, limiting answer quality. ReAG addresses this with a two-step approach:
- ReAG-Critic (this model) evaluates each retrieved passage and filters out irrelevant ones using a multimodal relevance signal (image + question + passage). It outputs a
yes_probabilityscore; passages above the threshold are kept. - ReAG Generator receives only the filtered, relevant passages and generates an answer with explicit chain-of-thought reasoning enclosed in
<think>β¦</think>tags, followed by a concise<answer>β¦</answer>.
ReAG significantly outperforms prior methods on both Encyclopedic-VQA and InfoSeek.
Full Pipeline Usage
The snippet below shows the complete ReAG inference pipeline: critic filtering followed by generator inference.
import re
from io import BytesIO
import requests
import torch
from PIL import Image
from transformers import (
AutoModelForImageTextToText,
AutoProcessor,
Qwen2_5_VLForConditionalGeneration,
)
REAG_MODEL_NAME = "aimagelab/ReAG-3B"
CRITIC_MODEL_NAME = "aimagelab/ReAG-Critic"
SYSTEM_PROMPT_REASONING = (
"A conversation between User and Assistant. The user asks a question, "
"and the Assistant solves it. The assistant first thinks about the "
"reasoning process in the mind and then provides the user with the answer. "
"The reasoning process and answer are enclosed within <think> </think> and "
"<answer> </answer> tags, respectively, i.e., "
"<think>reasoning process here</think><answer>short answer here</answer>"
)
RELEVANCY_EVAL_SYSTEM_PROMPT = """You are a multimodal reasoning assistant specialized in Knowledge-Based Visual Question Answering (KB-VQA).
Your task is to evaluate whether a given text passage provides useful and relevant information for answering a question about an image.
You will be given:
- Image: a visual scene containing entities, actions, and context.
- Question: a natural-language question that refers to the image.
- Text Passage: an external knowledge snippet retrieved from a database or the web.
You must analyze the semantic alignment between the text, the image, and the question.
Follow these steps carefully before giving your final decision:
1. Understand the visual scene: Identify the key objects, people, actions, and context visible in the image.
2. Interpret the question: Determine what information the question seeks.
3. Analyze the text passage: Extract the main claims, facts, and entities mentioned in the text.
Compare for relevance: Assess whether the information in the text:
- Contains at least one sentence that supports answering the question about the image, OR
- Provides background knowledge needed to interpret or reason about the image-question pair.
Important:
- If even a single sentence in the passage is relevant or useful, consider the entire passage as relevant and answer "Yes".
- If no part of the passage contributes meaningfully to answering the question, answer "No".
Output only one word:
"Yes" -> if the text provides relevant or useful information for answering the question.
"No" -> if the text is irrelevant or unhelpful."""
SECTION_EVAL_USER_TEMPLATE = """Here is the question on the image above:
{question}
Here is the text passage to analyze:
{passage}
Does the text passage contain at least one sentence that may have some information useful to answer the user question?
"Yes"/"No" answer:"""
CONTEXT_VQA_PROMPT = """\
{question}
The following paragraphs may contain useful information to help answer the question correctly:
{context}
"""
def load_image(image_url: str) -> Image.Image:
response = requests.get(image_url, timeout=30, headers={"User-Agent": "Mozilla/5.0"})
response.raise_for_status()
return Image.open(BytesIO(response.content)).convert("RGB")
def get_model_kwargs():
if torch.cuda.is_available():
return {
"device": "cuda",
"device_map": "balanced",
"torch_dtype": torch.bfloat16,
"attn_implementation": "flash_attention_2",
}
return {
"device": "cpu",
"device_map": "auto",
"torch_dtype": torch.float32,
}
def parse_reag_output(text: str):
answer_match = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
think_match = re.search(r"<think>(.*?)</think>", text, re.DOTALL)
return {
"raw_output": text.strip(),
"reasoning": think_match.group(1).strip() if think_match else "",
"answer": answer_match.group(1).strip() if answer_match else text.strip(),
}
def run_reag_generator(model, processor, image: Image.Image, question: str):
messages = [
{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT_REASONING}]},
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]},
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt + "<think>"], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
generated_ids = model.generate(
**inputs, max_new_tokens=512, stop_strings=["</answer>"], tokenizer=processor.tokenizer
)
input_length = inputs["input_ids"].shape[1]
generated_text = "<think>" + processor.batch_decode(
generated_ids[:, input_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return parse_reag_output(generated_text)
def run_reag_critic(critic, processor, image: Image.Image, question: str, passage: str, yes_prob_threshold: float = 0.1):
messages = [
{"role": "system", "content": [{"type": "text", "text": RELEVANCY_EVAL_SYSTEM_PROMPT}]},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": SECTION_EVAL_USER_TEMPLATE.format(question=question, passage=passage)},
],
},
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], text=[prompt], return_tensors="pt", padding=True)
inputs = {k: v.to(critic.device) for k, v in inputs.items()}
with torch.inference_mode():
outputs = critic(**inputs)
logits = outputs.logits[:, -1, :].float()
probs = torch.softmax(logits, dim=-1)
yes_token_id = processor.tokenizer.convert_tokens_to_ids("Yes")
no_token_id = processor.tokenizer.convert_tokens_to_ids("No")
return {
"relevant": probs[0, yes_token_id].item() > yes_prob_threshold,
"yes_probability": probs[0, yes_token_id].item(),
"no_probability": probs[0, no_token_id].item(),
}
# ββ Example ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
image = load_image(
"https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Clinopodium_vulgare_inflorescence.jpg/250px-Clinopodium_vulgare_inflorescence.jpg"
)
question = "What kind of properties does this plant have?"
passages = [
"# Description:\nWild basil is a perennial rhizomatous herb ...",
"# Distribution:\nWild basil occurs in suitable locations in most of Europe ...",
"# Uses:\nThe leaves of wild basil are used as an aromatic herb ... It has been shown to have anti-bacterial properties.",
]
model_kwargs = get_model_kwargs()
device = model_kwargs.pop("device")
# 1. Load and run the critic
critic_processor = AutoProcessor.from_pretrained(CRITIC_MODEL_NAME, padding_side="left", use_fast=True)
critic = Qwen2_5_VLForConditionalGeneration.from_pretrained(CRITIC_MODEL_NAME, **model_kwargs)
critic.eval()
relevant_passages = []
for passage in passages:
result = run_reag_critic(critic, critic_processor, image, question, passage)
if result["relevant"]:
relevant_passages.append(passage)
# 2. Load the generator and answer with filtered context
context = "\n\n\n".join(relevant_passages) if relevant_passages else ""
processor = AutoProcessor.from_pretrained(REAG_MODEL_NAME, padding_side="left", use_fast=True)
generator = AutoModelForImageTextToText.from_pretrained(REAG_MODEL_NAME, **model_kwargs)
generator.eval()
question_with_context = CONTEXT_VQA_PROMPT.format(question=question, context=context)
output = run_reag_generator(generator, processor, image, question_with_context)
print("Answer:", output["answer"])
print("Reasoning:", output["reasoning"])
Model Collection
| Model | Description |
|---|---|
| aimagelab/ReAG-3B | Generator (3B) |
| aimagelab/ReAG-7B | Generator (7B) |
| aimagelab/ReAG-Critic | Passage relevance critic (this model) |
Repository & Evaluation
Full inference scripts, dataset setup, FAISS index downloads, and evaluation instructions are available in the official GitHub repository.
Citation
@inproceedings{compagnoni2026reag,
title={{ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering}},
author={Compagnoni, Alberto and Morini, Marco and Sarto, Sara and Cocchi, Federico and Caffagni, Davide and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
booktitle={Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference},
year={2026}
}
- Downloads last month
- 39