Integrate with Sentence Transformers v5.4.0

Files changed (10) hide show

README.md +77 -1
chat_template.jinja +26 -0
chat_template.json +0 -3
config_sentence_transformers.json +10 -0
custom_transformer.py +40 -0
modeling.py +53 -12
modules.json +8 -0
preprocessor_config.json +1 -1
sentence_bert_config.json +24 -0
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -2,6 +2,7 @@
 ---
 pipeline_tag: text-classification
 tags:
 - vidore
 - reranker
 - qwen2_vl
@@ -135,8 +136,83 @@ Compared to `jina-reranker-v2-base-multilingual`, `jina-reranker-m0` significant
     ```
     The `relevance_score` field indicates the relevance of each document to the query, with higher scores indicating greater relevance.
-2. You can also use the `transformers` library to interact with the model programmatically.
     Before you start, install the `transformers` libraries:

 ---
 pipeline_tag: text-classification
 tags:
+- sentence-transformers
 - vidore
 - reranker
 - qwen2_vl
     ```
     The `relevance_score` field indicates the relevance of each document to the query, with higher scores indicating greater relevance.
+2. You can also use the model programmatically with the `sentence_transformers` library.
+    Firstly, install Sentence Transformers:
+    ```bash
+    pip install sentence_transformers
+    ```
+    Then load the model:
+    ```python
+    from sentence_transformers import CrossEncoder
+    model = CrossEncoder("jinaai/jina-reranker-m0", trust_remote_code=True)
+    ```
+    **A. Text-to-Text Reranking**
+    ```python
+    query = "slm markdown"
+    documents = [
+        "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
+        "数据提取么？为什么不用正则啊，你用正则不就全解决了么？",
+        "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
+        "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
+    ]
+    rankings = model.rank(query, documents)
+    print(rankings)
+    # [{'corpus_id': 0, 'score': 0.6875}, {'corpus_id': 2, 'score': 0.5938},
+    #  {'corpus_id': 3, 'score': 0.4590}, {'corpus_id': 1, 'score': 0.4434}]
+    ```
+    **B. Text-to-Image Reranking**
+    ```python
+    query = "slm markdown"
+    documents = [
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png",
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png",
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png",
+        "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp",
+    ]
+    scores = model.predict([(query, doc) for doc in documents])
+    print(scores)
+    # [0.4980 0.7813 0.4824 0.5039]
+    ```
+    **C. Image-to-Text Reranking**
+    ```python
+    query = "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
+    documents = [
+        "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.",
+        "数据提取么？为什么不用正则啊，你用正则不就全解决了么？",
+        "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold.",
+        "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar.",
+    ]
+    scores = model.predict([(query, doc) for doc in documents])
+    print(scores)
+    # [0.9805 0.7773 0.5664 0.9297]
+    ```
+    **D. Image-to-Image Reranking**
+    ```python
+    query = "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
+    documents = [
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png",
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png",
+        "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png",
+        "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp",
+    ]
+    scores = model.predict([(query, doc) for doc in documents])
+    print(scores)
+    # [0.6250 0.9922 0.8125 0.7930]
+    ```
+3. Or you can use custom methods via `trust_remote_code=True` using the `transformers` library.
     Before you start, install the `transformers` libraries:

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,26 @@

+{%- macro render(message) -%}
+    {%- if message['content'] is string -%}
+        {{ message['content'] }}
+    {%- else -%}
+        {%- for item in message['content'] -%}
+            {%- if item['type'] == 'text' -%}
+                {{ item['text'] }}
+            {%- elif item['type'] == 'image' or 'image' in item -%}
+                <|vision_start|><|image_pad|><|vision_end|>
+            {%- endif -%}
+        {%- endfor -%}
+    {%- endif -%}
+{%- endmacro -%}
+{%- set ns = namespace(doc='', query='') -%}
+{%- for message in messages -%}
+    {%- if message['role'] == 'query' -%}
+        {%- set ns.query = render(message) -%}
+    {%- elif message['role'] == 'document' -%}
+        {%- set ns.doc = render(message) -%}
+    {%- endif -%}
+{%- endfor -%}
+**Document**:
+{{ ns.doc }}
+**Query**:
+{{ ns.query }}

chat_template.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-  "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
-}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "pytorch": "2.10.0+cu128",
+    "sentence_transformers": "5.4.0"
+  },
+  "activation_fn": "torch.nn.modules.linear.Identity",
+  "default_prompt_name": null,
+  "model_type": "CrossEncoder",
+  "prompts": {}
+}

custom_transformer.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""Custom Transformer module for jina-reranker-m0 that fixes image ordering for image-image pairs.
+The Qwen2VL processor extracts images from messages in iteration order. ST creates messages
+as [query_msg, doc_msg], but the chat template renders doc-first. For single-image pairs this
+is fine, but for image-image pairs the two images get swapped. This module swaps the pair
+elements so the processor extracts images in doc-first order, matching the template rendering.
+Since both elements render as identical <|image_pad|> tokens, the role swap is invisible.
+"""
+from __future__ import annotations
+from typing import Any
+from PIL import Image
+from sentence_transformers.base.modality import is_image_url_or_path
+from sentence_transformers.base.modules.transformer import Transformer
+def _is_image(item: Any) -> bool:
+    return isinstance(item, Image.Image) or (isinstance(item, str) and is_image_url_or_path(item))
+class JinaRerankerTransformer(Transformer):
+    def preprocess(
+        self,
+        inputs: list,
+        prompt: str | None = None,
+        **kwargs,
+    ) -> dict[str, Any]:
+        # Swap image-image pairs so the processor extracts images in doc-first order,
+        # matching the chat template's doc-first rendering.
+        swapped = []
+        for item in inputs:
+            if isinstance(item, (list, tuple)) and len(item) == 2 and _is_image(item[0]) and _is_image(item[1]):
+                swapped.append((item[1], item[0]))
+            else:
+                swapped.append(item)
+        return super().preprocess(swapped, prompt=prompt, **kwargs)

modeling.py CHANGED Viewed

@@ -1,6 +1,5 @@
 import torch
 from torch import nn
-import numpy as np
 from typing import Optional, Tuple, List, Union
 from transformers import Qwen2VLForConditionalGeneration
 import logging
@@ -75,6 +74,8 @@ def formatting_prompts_func(
 class JinaVLForRanking(Qwen2VLForConditionalGeneration):
     def __init__(self, config):
         super().__init__(config)
         self.padding_side = "left"
@@ -83,11 +84,13 @@ class JinaVLForRanking(Qwen2VLForConditionalGeneration):
         # hack the lm_head to do nothing, since we only want the hidden states
         self.lm_head = nn.Identity()
         # copy the idea from `Qwen2ForRewardModel` to have a MLP layer to get the final score
         self.score = nn.Sequential(
-            nn.Linear(config.hidden_size, config.hidden_size),
             nn.ReLU(),
-            nn.Linear(config.hidden_size, self.num_labels),
         )
         # Initialize weights and apply final processing
@@ -95,14 +98,46 @@ class JinaVLForRanking(Qwen2VLForConditionalGeneration):
         self.score_token_id = 100
-    def forward(self, *args, **kwargs) -> torch.Tensor:
-        # Delete output_hidden_states from kwargs
         kwargs.pop("output_hidden_states", None)
         kwargs.pop("use_cache", None)
         assert kwargs.pop("labels", None) is None, "labels should not be passed to forward()"
         outputs = super().forward(
-            *args,
             use_cache=False,
             output_hidden_states=True,
             **kwargs,
@@ -113,9 +148,10 @@ class JinaVLForRanking(Qwen2VLForConditionalGeneration):
         # IMPORTANT: the padding token must be on the left side
         # get the hidden states of the last token and apply the linear layer
-        pooled_logits = self.score(hidden_states[:, -1])
-        return pooled_logits.squeeze(-1)
     @torch.no_grad()
     def compute_score(
@@ -211,7 +247,7 @@ class JinaVLForRanking(Qwen2VLForConditionalGeneration):
                 max_length=max_length,
             )
-            # append the reward token to the input_ids and attention_mask
             batch_size = batch["input_ids"].size(0)
             batch["input_ids"] = torch.cat(
                 [
@@ -227,14 +263,19 @@ class JinaVLForRanking(Qwen2VLForConditionalGeneration):
                 ],
                 dim=1,
             )
             # move the batch to the correct device
             batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
             scores = self.forward(**batch).view(-1).cpu().float().numpy()
-            # normalize scores to [0, 1] with sigmoid with a scale
-            scores = 1.0 / (1.0 + np.exp(-(scores - LOGIT_BIAS)))
             all_scores.extend(scores.tolist())
         if len(all_scores) == 1:

 import torch
 from torch import nn
 from typing import Optional, Tuple, List, Union
 from transformers import Qwen2VLForConditionalGeneration
 import logging
 class JinaVLForRanking(Qwen2VLForConditionalGeneration):
     def __init__(self, config):
+        # Disable weight tying before init so replacing lm_head with Identity doesn't break loading
+        config.tie_word_embeddings = False
         super().__init__(config)
         self.padding_side = "left"
         # hack the lm_head to do nothing, since we only want the hidden states
         self.lm_head = nn.Identity()
+        hidden_size = getattr(config, "hidden_size", None) or config.text_config.hidden_size
         # copy the idea from `Qwen2ForRewardModel` to have a MLP layer to get the final score
         self.score = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size),
             nn.ReLU(),
+            nn.Linear(hidden_size, self.num_labels),
         )
         # Initialize weights and apply final processing
         self.score_token_id = 100
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        pixel_values=None,
+        image_grid_thw=None,
+        video_grid_thw=None,
+        mm_token_type_ids=None,
+        **kwargs,
+    ) -> torch.Tensor:
         kwargs.pop("output_hidden_states", None)
         kwargs.pop("use_cache", None)
         assert kwargs.pop("labels", None) is None, "labels should not be passed to forward()"
+        # Auto-append score token if not already the last token, required for inference that bypasses compute_score
+        if input_ids is not None and not (input_ids[:, -1] == self.score_token_id).all():
+            batch_size = input_ids.size(0)
+            score_token = torch.full(
+                (batch_size, 1), self.score_token_id,
+                device=input_ids.device, dtype=input_ids.dtype,
+            )
+            input_ids = torch.cat([input_ids, score_token], dim=1)
+            if attention_mask is not None:
+                attention_mask = torch.cat([
+                    attention_mask,
+                    torch.ones(batch_size, 1, device=attention_mask.device, dtype=attention_mask.dtype),
+                ], dim=1)
+            if mm_token_type_ids is not None:
+                mm_token_type_ids = torch.cat([
+                    mm_token_type_ids,
+                    torch.zeros(batch_size, 1, device=mm_token_type_ids.device, dtype=mm_token_type_ids.dtype),
+                ], dim=1)
         outputs = super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            pixel_values=pixel_values,
+            image_grid_thw=image_grid_thw,
+            video_grid_thw=video_grid_thw,
+            mm_token_type_ids=mm_token_type_ids,
             use_cache=False,
             output_hidden_states=True,
             **kwargs,
         # IMPORTANT: the padding token must be on the left side
         # get the hidden states of the last token and apply the linear layer
+        pooled_logits = self.score(hidden_states[:, -1]).squeeze(-1)
+        # normalize scores to [0, 1] with sigmoid with a bias
+        return torch.sigmoid(pooled_logits - LOGIT_BIAS)
     @torch.no_grad()
     def compute_score(
                 max_length=max_length,
             )
+            # append the reward token to the input_ids, attention_mask, and mm_token_type_ids
             batch_size = batch["input_ids"].size(0)
             batch["input_ids"] = torch.cat(
                 [
                 ],
                 dim=1,
             )
+            if "mm_token_type_ids" in batch:
+                batch["mm_token_type_ids"] = torch.cat(
+                    [
+                        batch["mm_token_type_ids"],
+                        torch.zeros((batch_size, 1), device=batch["mm_token_type_ids"].device, dtype=batch["mm_token_type_ids"].dtype),
+                    ],
+                    dim=1,
+                )
             # move the batch to the correct device
             batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}
             scores = self.forward(**batch).view(-1).cpu().float().numpy()
             all_scores.extend(scores.tolist())
         if len(all_scores) == 1:

modules.json ADDED Viewed

	@@ -0,0 +1,8 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "custom_transformer.JinaRerankerTransformer"
+  }
+]

preprocessor_config.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "min_pixels": 3136,
-  "max_pixels": 12845056,
   "patch_size": 14,
   "temporal_patch_size": 2,
   "merge_size": 2,

 {
   "min_pixels": 3136,
+  "max_pixels": 602112,
   "patch_size": 14,
   "temporal_patch_size": 2,
   "merge_size": 2,

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+    "transformer_task": "feature-extraction",
+    "modality_config": {
+        "text": {
+            "method": "forward",
+            "method_output_name": null
+        },
+        "message": {
+            "method": "forward",
+            "method_output_name": null
+        }
+    },
+    "module_output_name": "scores",
+    "message_format": "structured",
+    "config_kwargs": {
+        "trust_remote_code": true,
+        "num_labels": 1
+    },
+    "processing_kwargs": {
+        "chat_template": {
+            "add_generation_prompt": false
+        }
+    }
+}

tokenizer_config.json CHANGED Viewed

@@ -130,7 +130,7 @@
     "<|video_pad|>"
   ],
   "bos_token": null,
-  "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",

     "<|video_pad|>"
   ],
   "bos_token": null,
+  "chat_template": "chat_template.jinja",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",