How to use from the
Use from the
sentence-transformers library
from sentence_transformers import CrossEncoder

model = CrossEncoder("infgrad/Prism-Qwen3-Reranker-4B-exp")

query = "Which planet is known as the Red Planet?"
passages = [
	"Venus is often called Earth's twin because of its similar size and proximity.",
	"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
	"Jupiter, the largest planet in our solar system, has a prominent red spot.",
	"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]

scores = model.predict([(query, passage) for passage in passages])
print(scores)

Prism-Reranker

Beyond Relevance Scoring — Jointly Producing Contributions and Evidence for Agentic Retrieval.

A reranker family that, unlike standard rerankers that emit only a relevance score, returns three things in a single forward pass: a calibrated score, a one-sentence contribution, and a self-contained evidence passage extracted from the document.

Model Architecture

Released models

Five checkpoints are released on the Hugging Face Hub. Four are fine-tuned from the Qwen3.5 backbone; one (-4B-exp) is an experimental extension built on top of Qwen3-Reranker-4B, demonstrating that the same recipe transfers to an existing LLM-based reranker without losing ranking quality.

Model Backbone Parameters Hugging Face
Prism-Qwen3.5-Reranker-0.8B Qwen3.5 0.8B infgrad/Prism-Qwen3.5-Reranker-0.8B
Prism-Qwen3.5-Reranker-2B Qwen3.5 2B infgrad/Prism-Qwen3.5-Reranker-2B
Prism-Qwen3.5-Reranker-4B Qwen3.5 4B infgrad/Prism-Qwen3.5-Reranker-4B
Prism-Qwen3.5-Reranker-9B Qwen3.5 9B infgrad/Prism-Qwen3.5-Reranker-9B
Prism-Qwen3-Reranker-4B-exp Qwen3-Reranker-4B 4B infgrad/Prism-Qwen3-Reranker-4B-exp

Why this model?

In agentic / RAG pipelines, a relevance score is rarely the end goal. After deciding a document is relevant, the agent still has to read it, denoise it, and decide what to do next. Prism-Reranker folds that work into the reranker itself:

  • Relevance scores(q, d) = σ(ℓ_yes − ℓ_no) ∈ (0, 1). Calibrated, ranking-ready.
  • <contribution> — one sentence stating every core point the document contributes to the query. Useful for the agent to plan its next step without re-reading the doc.
  • <evidence> — a self-contained, faithfully-rephrased rewrite of the query-relevant content. Drops irrelevant background, preserves verbatim proper nouns / numbers / dates / code / URLs. You can feed <evidence> directly to a downstream LLM and skip the raw document — saving context tokens and removing web-noise.

If the document is not relevant, the model outputs no and stops. No contribution/evidence is generated.

Highlights

  • Backbones: Qwen3.5 series for the four main sizes, no architectural changes; one extension variant on top of Qwen3-Reranker-4B.
  • Context length: training data capped at 10K tokens per example, covering most real-world documents.
  • Multilingual: Chinese / English primary; other languages supported but with less coverage.
  • Keyword-query robust: agents often emit keyword-style queries instead of well-formed questions. ~30% of training queries were rewritten by an LLM into keyword form, so the model handles both natural and keyword queries.
  • Real-world data distribution: in addition to open reranker datasets (MS MARCO, T2Ranking, MIRACL, …), training includes synthetic queries paired with real Tavily / Exa web-search results, matching what an actual agent sees at inference time.
  • Length × score balanced: training data was rebalanced so that document length is not a relevance shortcut.
  • Training recipe: distillation (point-wise MSE on a strong commercial reranker's scores) + SFT on yes/no + <contribution> + <evidence>, supervised by a 5-LLM-as-judge ensemble.

Quickstart

Two ways to call the model. Both produce the same relevance score s(q, d) = σ(ℓ_yes − ℓ_no). Use A when you also want <contribution> / <evidence>. Use B when you only need a score and want a drop-in replacement for any other CrossEncoder reranker.

We use one shared example throughout so you can compare the outputs side by side:

QUERY = "What is the boiling point of water at sea level?"
DOCUMENTS = [
    "Water boils at 100 C (212 F) at standard atmospheric pressure (1 atm), "
    "which corresponds to sea-level conditions.",
    "Mount Everest is the highest mountain on Earth, with a peak elevation "
    "of 8,848 meters above sea level.",
]

A. Transformers (full output: score + contribution + evidence)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B"  # or any sibling repo above

SYSTEM_PROMPT = (
    "Judge whether the Document meets the requirements based on "
    "the Query and the Instruct provided. "
)

INSTRUCTION = (
    'Judge if the document is relevant to the query. Reply "yes" or "no".\n'
    'On "yes", also emit:\n'
    "<contribution>One sentence covering every core point the document "
    "contributes to the query, without elaboration.</contribution>\n"
    "<evidence>Self-contained rewrite of the query-relevant content. Rules:\n"
    "- Faithful: rephrase only; add or infer nothing.\n"
    "- Self-contained: evidence alone must fully answer the query.\n"
    "- Concise: drop query-irrelevant background.\n"
    "- Verbatim (no translation): proper nouns, terms, abbreviations, "
    "numbers, dates, code, URLs.\n"
    "- Output language: multilingual doc → query's language; else doc's language."
    "</evidence>"
)

PROMPT_TEMPLATE = (
    "<|im_start|>system\n{system}<|im_end|>\n"
    "<|im_start|>user\n"
    "<Instruct>: {instruction}\n"
    "<Query>: {query}\n"
    "<Document>: {doc}<|im_end|>\n"
    "<|im_start|>assistant\n<think>\n\n</think>\n\n"
)


def build_prompt(query: str, doc: str) -> str:
    return PROMPT_TEMPLATE.format(
        system=SYSTEM_PROMPT, instruction=INSTRUCTION, query=query, doc=doc
    )


tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="sdpa",
).eval()

yes_id = tokenizer.encode("yes", add_special_tokens=False)[0]
no_id = tokenizer.encode("no", add_special_tokens=False)[0]


@torch.no_grad()
def rerank(query: str, doc: str, max_new_tokens: int = 512):
    prompt = build_prompt(query, doc)
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

    out = model.generate(
        input_ids=input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        return_dict_in_generate=True,
        output_scores=True,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
    )

    # Relevance score = softmax over {yes, no} at the first generated token.
    first_logprobs = torch.log_softmax(out.scores[0][0].float(), dim=-1)
    yes_p = first_logprobs[yes_id].exp()
    no_p = first_logprobs[no_id].exp()
    score = (yes_p / (yes_p + no_p)).item()

    # Decoded text holds yes/no plus <contribution>...</contribution><evidence>...</evidence>
    gen_ids = out.sequences[0, input_ids.shape[1]:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True)
    return {"score": score, "text": text}


for doc in DOCUMENTS:
    print(rerank(QUERY, doc))

Expected output (one dict per document):

{"score": 0.98, "text": "yes\n<contribution>...</contribution>\n<evidence>...</evidence>"}
{"score": 0.01, "text": "no"}

For irrelevant pairs the score is close to 0 and text is just "no".

B. Sentence Transformers CrossEncoder (score only)

If you only need the score and want a drop-in CrossEncoder, the same model works directly with sentence-transformers >= 5.4.0. Note: in this mode <contribution> and <evidence> are not produced — only the calibrated relevance score.

The system prompt and instruction are baked into the model's chat_template.jinja and are not configurable — the model was trained with one fixed prompt and only that prompt produces calibrated scores. You only pass (query, document); the rest is hardcoded.

import torch
from sentence_transformers import CrossEncoder

MODEL_PATH = "infgrad/Prism-Qwen3.5-Reranker-4B"  # or any sibling repo above

ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.bfloat16})

# 1) Score (q, d) pairs. The default activation is Sigmoid, so scores are in (0, 1)
# and equal to s(q, d) = sigmoid(logit_yes - logit_no) — identical to path A above.
pairs = [(QUERY, doc) for doc in DOCUMENTS]
scores = ce.predict(pairs)
print(scores)
# array([0.98, 0.01], dtype=float32)

# 2) Rank documents directly.
ranked = ce.rank(QUERY, DOCUMENTS, return_documents=True)
for r in ranked:
    print(f"{r['score']:.3f}\t{r['corpus_id']}\t{r['text'][:80]}")

To get raw logit differences instead of [0, 1] probabilities, pass activation_fn=torch.nn.Identity() to ce.predict(...).

A note on numerical parity with path A

In fp32, paths A and B produce the same score to within ~1e-6 (verified across all five checkpoints).

In bf16 with the default batched call (batch_size > 1), CE scores can drift from path A by ~1–3% for individual pairs. The cause is bf16 SDPA: when CrossEncoder pads shorter sequences to the longest in the batch, the bf16 attention numerics differ by a few ULPs vs running each pair alone, and the difference accumulates across layers before the final sigmoid. Ranking order is unaffected. If you need bit-for-bit parity with path A:

# Option 1: keep bf16, disable batching
ce.predict(pairs, batch_size=1)

# Option 2: use fp32 (slower, larger memory)
ce = CrossEncoder(MODEL_PATH, model_kwargs={"torch_dtype": torch.float32})

Notes on usage

  • The first generated token is always yes or no — the score is well-defined even if you stop generation immediately (cheap mode: max_new_tokens=1). Generate further only when you also want contribution/evidence.
  • Inputs longer than 10K tokens may degrade — truncate the document side first.
  • Greedy decoding is fine for ranking. For diverse evidence rephrasings, use temperature=0.3-0.5.

Citation

If you use Prism-Reranker in your research, please cite:

@misc{zhang2025prismreranker,
  title  = {Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval},
  author = {Dun Zhang},
  year   = {2025},
  eprint = {2604.23734},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url    = {https://arxiv.org/abs/2604.23734},
}

Contact

Dun Zhang — dunnzhang0@gmail.com (independent researcher).

Downloads last month
81
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including infgrad/Prism-Qwen3-Reranker-4B-exp

Paper for infgrad/Prism-Qwen3-Reranker-4B-exp