LCO-Embedding: Scaling Language-Centric Omnimodal Representation Learning

We are thrilled to release LCO-Embedding - a language-centric omnimodal representation learning framework and the LCO-Embedding model families!

This model implements the framework presented in the paper Scaling Language-Centric Omnimodal Representation Learning, accepted by NeurIPS 2025.

Project Page: https://huggingface.co/LCO-Embedding

Github Repository: https://github.com/LCO-Embedding/LCO-Embedding

Quick Start

Note: We are only using the thinker component of Qwen2.5 Omni and drops the talker component.

Using Sentence Transformers

Install Sentence Transformers with the multimodal extras (for image, audio, and video support):

pip install "sentence_transformers[image,audio,video]" "transformers>=5.6.0"

import torch
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "LCO-Embedding/LCO-Embedding-Omni-7B",
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2",  # pip install kernels; recommended but not mandatory
    },
)

The same "Summarize the above in one word:" instruction used in the paper is baked into the chat template, so encode() takes plain text, file paths, URLs, or multimodal dicts directly.

Text Retrieval

query = "What is the tallest mountain in the world?"
documents = [
    "Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. Its elevation of 8,848.86 metres was established by a joint Chinese-Nepali survey in 2020.",
    "K2, at 8,611 metres above sea level, is the second-highest mountain on Earth, after Mount Everest. It lies in the Karakoram range on the China-Pakistan border.",
    "Mount Kilimanjaro is a dormant volcano in Tanzania. It is the highest mountain in Africa, with its summit about 5,895 metres above sea level.",
]

query_embedding = model.encode(query)
document_embeddings = model.encode(documents)
print(model.similarity(query_embedding, document_embeddings))
# tensor([[0.6456, 0.4331, 0.4788]])

Image Retrieval

query = "How many input modalities does Qwen2.5-Omni support?"
documents = [
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/qwen2.5omni_hgf.png",
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/llama4_hgf.png",
]

query_embedding = model.encode(query)
document_embeddings = model.encode(documents, batch_size=1)
print(model.similarity(query_embedding, document_embeddings))
# tensor([[0.5745, 0.4818]])

Audio Retrieval

query = "A light piano piece"
documents = [
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/joe_hisaishi_summer.mp3",
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/jay_chou_superman_cant_fly.mp3",
]

query_embedding = model.encode(query)
document_embeddings = model.encode(documents, batch_size=1)
print(model.similarity(query_embedding, document_embeddings))
# tensor([[0.4958, 0.0964]])

Video Retrieval

# For video on smaller GPUs, cap the processor up front:
model[0].processing_kwargs.update({
    "video": {"max_pixels": 64 * 28 * 28, "do_sample_frames": True, "fps": 1},
})

query = "How to cook Mapo Tofu?"
documents = [
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/mapo_tofu.mp4",
    "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/zhajiang_noodle.mp4",
]

query_embedding = model.encode(query)
document_embeddings = model.encode(documents, batch_size=1)
print(model.similarity(query_embedding, document_embeddings))
# tensor([[0.6638, 0.4841]])

Multimodal Inputs

To embed a document that combines multiple modalities, pass a dict with any combination of "text", "image", "audio", and "video" keys instead of a single path or string:

documents = [
    {
        "text": "A cooking tutorial for Mapo Tofu",
        "video": "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/mapo_tofu.mp4",
    },
    {
        "image": "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/qwen2.5omni_hgf.png",
        "audio": "https://huggingface.co/Tevatron/OmniEmbed-v0.1/resolve/main/assets/joe_hisaishi_summer.mp3",
    },
]
document_embeddings = model.encode(documents, batch_size=1)

Using Transformers

from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

processor = Qwen2_5OmniProcessor.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B") # or add a `max_pixels = 1280*28*28' for efficient encoding
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained("LCO-Embedding/LCO-Embedding-Omni-7B",
                                                                    torch_dtype=torch.bfloat16,
                                                                    device_map="auto")

Text Batch Encodings:

texts = ["some random text", "a second random text", "a third random text"] * 30
batch_size = 8
text_prompt =  "{}\nSummarize the above text in one word:" 

all_text_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i : i + batch_size]
        batch_texts = [text_prompt.format(text) for text in batch_texts]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text":text},
                ],

            }
        ] for text in batch_texts]
        text_inputs = processor.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)
        text_inputs = processor(
        text = text_inputs,
        padding = True,
        return_tensors = "pt",
        )
        text_inputs = text_inputs.to("cuda")
        text_outputs = model(
            **text_inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_text_embeddings.append(text_outputs.to(torch.float16).cpu())

all_text_embeddings = torch.cat(all_text_embeddings, dim=0)

Image Batch Encodings:

images = [some random PIL.Image] * 100 # will be good to load them using dataloader; see MIEB evaluation pipeline
image_prompt = "\nSummarize the above image in one word:"
batch_size = 8

all_image_embeddings = []

with torch.no_grad():
    for i in tqdm(range(0, len(images), batch_size)):
        batch_images = images[i : i + batch_size]
        messages = [[
            {
                "role": "user",
                "content": [
                    {"type": "image", "image":image},
                    {"type": "text", "text": image_prompt},
                ],

            }
        ] for image in batch_images]
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        audio_inputs, image_inputs, video_inputs = process_mm_info(messages, use_audio_in_video=True)
        inputs = processor(
            text=text, 
            audio=audio_inputs, 
            images=image_inputs, 
            videos=video_inputs, 
            return_tensors="pt", 
            padding=True
        )
        inputs = inputs.to("cuda")
        image_outputs = model(
            **inputs, output_hidden_states=True, return_dict=True
        ).hidden_states[-1][:, -1, :]
        all_image_embeddings.append(image_outputs.to(torch.float16).cpu())

all_image_embeddings = torch.cat(all_image_embeddings, dim=0)

Audio Batch Encoding:

import logging
logging.getLogger("root").setLevel(logging.ERROR)
# set this to prevent getting the Qwen Omni system prompt mismatch warning.

batch_size = 4
audio_prompt = "\nSummarize the above audio in one word:"
audis = [some audios]  * 1000

all_audio_embeddings = []

with torch.no_grad():
  for i in tqdm(range(0, len(audios), batch_size)):
      torch.cuda.empty_cache()
      
      batch_audios = audios[i : i + batch_size]
      messages = [[
          {
              "role": "user",
              "content": [
                   {"type": "audio", "audio": audio},
                  {"type": "text", "text": audio_prompt},
              ],
              
          }
      ] for audio in batch_audios]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      audio_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_audio_embeddings.append(audio_outputs.to(torch.float16).cpu())
      del inputs, audio_outputs
      torch.cuda.empty_cache()
                
all_audio_embeddings = torch.cat(all_audio_embeddings, dim=0)

Video Batch Encoding:

videos = [some videos]  * 1000
video_prompt = "\nSummarize the above video in one word:"
batch_size = 4

long_video = False
# followed by some example hyperparameters to save RAM
# for long videos. Not optimal. Tune case by case.

all_video_embeddings = []
with torch.no_grad():
  for i in tqdm(range(0, len(videos), batch_size)):
      torch.cuda.empty_cache()
      
      batch_videos = videos[i : i + batch_size]
      if long_video:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                          "max_pixels": 224 * 224,
                          "fps": 1,
                          "max_frames": 10
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      else:
          messages = [[
              {
                  "role": "user",
                  "content": [
                      {
                          "type": "video", 
                          "video": video, 
                      },
                      {"type": "text", "text": video_prompt},
                  ],

              }
          ] for video in batch_videos]
      
      text = self.processor.apply_chat_template(
          messages, tokenize=False, add_generation_prompt=True
      )
      audio_inputs, image_inputs, video_inputs = process_mm_info(
          messages, use_audio_in_video=False
      )
      inputs = self.processor(
          text=text, 
          audio=audio_inputs, 
          images=image_inputs, 
          videos=video_inputs, 
          return_tensors="pt", 
          padding=True
      )
      inputs = inputs.to("cuda")
      video_outputs = self.model(
          **inputs, output_hidden_states=True, return_dict=True
      ).hidden_states[-1][:, -1, :]   
      all_video_embeddings.append(video_outputs.to(torch.float16).cpu())
      
      del inputs, video_outputs
      torch.cuda.empty_cache()
                
all_video_embeddings = torch.cat(all_video_embeddings, dim=0)

Overview

We introduce LCO-Embedding, a language-centric omnimodal representation learning method and the LCO-Embedding model families, setting a new state-of-the-art on MIEB (Massive Image Embedding Benchmark), while supporting audio and videos.

This work also introduces the Generation-Representation Scaling Law, connecting models' generative capabilities and their representation upper bound. Furthermore, we introduce SeaDoc, a challenging visual document retrieval task in Southeast Asian languages, and show that continual generative pretraining before contrastive learning raises the representation upper bound.

Evaluation Results

We evaluate LCO-Embedding with state-of-the-art embedding models, including E5-V, Voyage Multimodal 3, mmE5, and GME, on a MIEB-Lite benchmark (51 tasks) broken down by task categories.

LCO-Embedding is also SOTA on MAEB (massive audio embedding benchmark) without even training on audio. Screenshot from the MAEB paper.

Performance and efficiency comparisons of different training strategies using 3B and 7B variants of Qwen2.5-VL backbones.

Scaling relationship between generation benchmark performance (X-axis) and representation benchmark performance after language-centric contrastive learning (Y-axis).

Citation

If you find LCO-Embedding useful for your research and applications, please cite using this BibTeX:

@article{xiao2025scaling,
  title={Scaling Language-Centric Omnimodal Representation Learning},
  author={Xiao, Chenghao and Chan, Hou Pong and Zhang, Hao and Xu, Weiwen and Aljunied, Mahani and Rong, Yu},
  journal={arXiv preprint arXiv:2510.11693},
  year={2025}
}