The State of Arabic Multimodal Embedding — What a 2B Finetune Taught Us

Community Article Published April 23, 2026

TL;DR. We built an Arabic-culture visual document retrieval (VDR) dataset from UBC-NLP/PEARL-FULL, finetuned Qwen/Qwen3-VL-Embedding-2B on it, and benchmarked against every public multimodal embedding we could load. Our 2B finetune beats every off-the-shelf model we tested — including a 4× larger base — by a 3× margin on Arabic visual retrieval. But the absolute numbers (dev NDCG@10 = 0.124) also make an uncomfortable point clear: public multimodal embeddings are not ready for Arabic, and scaling parameters or stamping "multilingual" on a model doesn't fix it. This blog documents what worked, what didn't, and what we learned along the way.

Model: Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR
Dataset: Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed
Recipe: Sentence-Transformers CachedMultipleNegativesRankingLoss + MatryoshkaLoss, following Tom Aarsen's multimodal VDR training guide.

1. Why Arabic VDR is hard

Visual Document Retrieval asks a single question: given a natural-language query, find the right page out of a big corpus of document screenshots. It's the retrieval task behind every useful doc-QA product — the step before you hand context to an LLM.

On English VDR, the space is basically solved. Tom Aarsen's recent blog shows off-the-shelf 2B models at ~0.89 NDCG@10 and his own finetune hits 0.95. Switch the queries and the documents to Arabic, though, and the floor collapses:

Model	Size	English VDR¹	Arabic VDR (ours)
Qwen3-VL-Embedding-8B	8 B	0.923	0.066
tomaarsen/Qwen3-VL-Embedding-2B-vdr	2 B	0.948	0.053
Qwen3-VL-Embedding-2B (base)	2 B	0.888	0.041
llamaindex/vdr-2b-multi-v1	2 B	0.912	0.035
BAAI/BGE-VL-large	0.5 B	—	0.003

^{¹ English numbers from Tom's leaderboard on llamaindex-vdr-en-eval (300 queries, 1 500 docs). Arabic numbers from our Pearl-vdr-ar dev (999 queries, 4 995 docs).}

That's not a small gap. A state-of-the-art multimodal embedding drops from 0.95 to 0.05 when you change the query language. Scaling to 8 B gets you only a 1.6× bump on Arabic. A model whose card says "multilingual VDR" (vdr-2b-multi-v1) actually scores below the monolingual English base on our Arabic data.

So: the failure mode isn't capacity, it's representation. Current multimodal embeddings learned Arabic from scraps.

2. Building a culturally-grounded Arabic VDR dataset

To train against this, we needed an Arabic-native VDR corpus — not a machine-translated English one. We started from UBC-NLP/PEARL-FULL: 309 298 rows, 135 220 unique culturally-aligned images, across 9 categories (Architecture, Clothes, Fauna, Flora, Food, Geography, Handicrafts, LandMarks, Music) × 19 Arab countries.

To turn it into VDR-style triplets we:

Sampled 50 000 queries stratified by category (so every topic and country is represented).
Deduplicated images across all 309 k rows via augmented_caption — the same image appears in multiple Q&A rows and we want each unique image once.
Mined 4 hard negatives per query using metadata rules:
- negative_0: same category, different image (visually similar topic)
- negative_1: same country, different category (same cultural setting)
- negative_2: another same-category random
- negative_3: random from the full corpus
Split stratified by category: train = 48 002 | dev = 999 | test = 999.

The whole preprocessing pipeline used pyarrow + HfFileSystem range reads, which let us skip the image column entirely on metadata pass 1 (~14 GB of raw image data lives only in ~500 MB of metadata columns). Pass 2 fetched only the shards we actually needed. You can find the fully-preprocessed dataset at Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed.

3. Training recipe

We followed Tom's multimodal blogpost almost verbatim, on a single RTX 5090 (32 GB):

loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=2)
loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048, 1536, 1024, 512, 256, 128, 64])

args = SentenceTransformerTrainingArguments(
    num_train_epochs=3,           # v1 = 1 epoch, then continued for 2 more
    per_device_train_batch_size=64,
    learning_rate=2e-5,           # dropped to 1e-5 for continued training
    warmup_ratio=0.1,
    bf16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    gradient_checkpointing=True,
)

Three things worth flagging for anyone reproducing this on a 32 GB consumer card:

mini_batch_size=2 + gradient checkpointing keeps peak VRAM around 22 GB. Going higher (mb=4 without checkpointing) hit the 32 GB ceiling and triggered Windows WDDM page-swapping, stalling training to 100 s/step.
save_total_limit=1 — each checkpoint is 12 GB (model + optimizer + scheduler). A safe resume needs disk space, so we kept only one rolling.
Resume via resume_from_checkpoint=True works cleanly even after multi-day gaps, as long as you don't change logging_steps/save_steps between runs (you'll get benign mismatch warnings).

Training trajectory

Here's what it looks like over 3 epochs, benchmarked against three public models as horizontal reference lines:

We cleared every off-the-shelf baseline by the 10% mark of the first epoch and kept climbing. Final (v2) dev NDCG@10 = 0.1238; test NDCG@10 = 0.1309. That's a 3.0× lift over the base model (0.041) and 1.9× above the 4× larger 8 B base (0.066).

4. The Arabic VDR leaderboard

We evaluated every public multimodal embedding we could load through Sentence Transformers against our dev set — 999 Arabic queries, 4 995-doc corpus (1 positive + 4 hard negatives per query):

Full table (dev NDCG@10):

Model	Params	NDCG@10
Ours: Qwen3-VL-Embedding-2B-Arabic-VDR	2.2 B	0.1238 ⭐
Qwen3-VL-Embedding-8B	8 B	0.0657
tomaarsen/Qwen3-VL-Embedding-2B-vdr	2.2 B	0.0534
Qwen3-VL-Embedding-2B (base)	2.2 B	0.0407
nvidia/llama-nemotron-embed-vl-1b-v2	1 B	0.0376
llamaindex/vdr-2b-multi-v1	2.2 B	0.0347
BidirLM/BidirLM-Omni-2.5B-Embedding	2.5 B	0.0318
BAAI/BGE-VL-large	0.5 B	0.0026

5. What didn't work — the hard-negative experiment

Not every idea paid off. We tried a v3 run using the v2 model itself to mine harder negatives: for each training query, we embedded the full image pool and replaced the 4 metadata-based negatives with the top-4 most similar non-matching images. The hypothesis: harder negatives would push NDCG@10 from 0.12 toward the 0.18–0.22 range.

The result:

	v2 (metadata negatives)	v3 (v2-mined negatives)
Train loss (final)	13.43	11.86 (better!)
Dev NDCG@10	0.1252	0.1115 (−11%)
Test NDCG@10	0.1309	0.1201 (−8%)

Training loss dropped further than ever — the model was clearly learning something. But dev/test both got worse. Classic overfit to the hard-mined train distribution at the cost of generalization.

Likely causes (no definitive proof, but these are the suspects):

Pool too small. After deduplication our mining pool had ~10 300 unique images. With only ~40 distinct candidates per category available to draw 4 hard negatives from, mined negatives trended toward near-duplicates of positives, adding noise.
Miner too weak. v2's own NDCG@10 of 0.12 means its embedding space is still coarse for Arabic. What it called "hard" was sometimes just "semantically confused" — training on that pushes embeddings to spread everything apart, not to separate positives from true confusers.
Loss surface disruption. Hard negatives that sit very close to positives create near-vertical gradient cliffs that consumer-grade finetuning with lr=1e-5 can't navigate well.

The lesson: the classic "self-mined harder negatives" recipe from English VDR doesn't transfer for free when your base model is weak in the target language. You need either a stronger miner or a larger negative pool — ideally both.

6. What this tells us about Arabic multimodal

Five takeaways from building, training, and benchmarking this:

6.1 The Arabic gap is a representation gap, not a capacity gap

Tom's English VDR numbers (0.88–0.95) and ours on Arabic (0.04–0.12) come from the same model family, same architecture, similar dataset structure. What differs is how much Arabic text-image grounding made it into pretraining. You can't close a 7× representation gap by scaling parameters alone — we saw at most a 1.6× bump from 2 B → 8 B.

6.2 "Multilingual" is a label, not a benchmark

Two of the three "multilingual" models we tested scored below the monolingual-English Qwen3-VL-Embedding-2B on Arabic (vdr-2b-multi-v1: 0.035, BidirLM-Omni-2.5B: 0.032). Their training mixes were heavily English-skewed; the non-English components were not enough to dominate representations. If your use case is Arabic, benchmark before you trust the label.

6.3 Cross-lingual task-finetuning transfers weakly

tomaarsen/Qwen3-VL-Embedding-2B-vdr (Tom's English VDR finetune) scored 0.053 on our Arabic dev — better than its own base (0.041), but nowhere near useful. Task structure transfers a little; language representations don't.

6.4 One small monolingual dataset beats four giant multilingual ones

48 k culturally-aligned Arabic VDR samples + one 3-epoch finetune = 3× better NDCG@10 than the strongest off-the-shelf option, at 1/4 the parameters of the closest competitor (Qwen 8 B). The lever isn't scale, it isn't multilinguality, it's dedicated in-language data.

6.5 Absolute 0.12 is not "good"; it's just "best available"

Let's keep it honest. Tom's English VDR sits at 0.95. A retrieval system returning the right page in the top-10 only 20% of the time is not production-ready — it's a floor, not a ceiling. The encouraging part is how much headroom is obviously still there: our base is a model that gets 0.04 on Arabic, and every improvement from here is bottlenecked by that representation quality. A genuinely Arabic-grounded VLM would unlock the next tier.

7. Where we'd go next

In rough order of expected impact per hour of compute:

Better base model. The ceiling constraint is the 0.04 starting point. If a more Arabic-capable VLM appears (or gets pretrained), finetuning on top of it could plausibly hit 0.3+. This is the single most valuable direction.
Task prompts at training + inference. Qwen3-VL-Embedding was pretrained with instruction-following behavior; we trained with none. A simple "Given the question, retrieve the document image that answers it." prefix might be nearly free improvement.
Longer training with harder negatives — done right. Our v3 failure was about mining quality, not about the idea. With a stronger miner (the next-gen model, or CLIP-based mining across a much bigger pool like the full 135 k unique Pearl images), we'd expect positive gains.
Scale data, not parameters. A 200 k Arabic VDR corpus (4× what we used) would likely give more than 4× scale would. The field is starved of Arabic multimodal data, not model weights.

8. Reproduce it

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

# Load the pushed dataset
train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train")
dev   = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev",   split="train")

# Load the finetuned model
model = SentenceTransformer("Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR")

# Query an image
query = "ما هو الغطاء الرأس الذي يعكس الهوية والمكانة الاجتماعية كما يظهر في الصورة؟"
q_emb = model.encode_query([query])
d_emb = model.encode_document([dev[0]["image"], dev[1]["image"]])
print(model.similarity(q_emb, d_emb))

Acknowledgments

Thanks to UBC for Pearl dataset, which made the whole project possible; to Tom Aarsen for the multimodal VDR blogpost and the Sentence Transformers library; and to the Qwen team for open-sourcing a strong multimodal embedding base.

If Arabic multimodal retrieval matters to your work, grab the dataset, try the model, and — ideally — send us a stronger Arabic-grounded base to train on.

Models mentioned in this article 1

Datasets mentioned in this article 2

Comparative evaluation of GPT‑OSS‑20B vs GPT‑OSS‑120B on Arabic & ILMAAM benchmarks

August 18, 2025

Building Multimodal RAG Systems: Supercharging Retrieval with MultiModal Embeddings and LLMs

May 1, 2025

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote