The State of Arabic Multimodal Embedding — What a 2B Finetune Taught Us

Community Article Published April 23, 2026

TL;DR. We built an Arabic-culture visual document retrieval (VDR) dataset from UBC-NLP/PEARL-FULL, finetuned Qwen/Qwen3-VL-Embedding-2B on it, and benchmarked against every public multimodal embedding we could load. Our 2B finetune beats every off-the-shelf model we tested — including a 4× larger base — by a 3× margin on Arabic visual retrieval. But the absolute numbers (dev NDCG@10 = 0.124) also make an uncomfortable point clear: public multimodal embeddings are not ready for Arabic, and scaling parameters or stamping "multilingual" on a model doesn't fix it. This blog documents what worked, what didn't, and what we learned along the way.


1. Why Arabic VDR is hard

Visual Document Retrieval asks a single question: given a natural-language query, find the right page out of a big corpus of document screenshots. It's the retrieval task behind every useful doc-QA product — the step before you hand context to an LLM.

On English VDR, the space is basically solved. Tom Aarsen's recent blog shows off-the-shelf 2B models at ~0.89 NDCG@10 and his own finetune hits 0.95. Switch the queries and the documents to Arabic, though, and the floor collapses:

Model Size English VDR¹ Arabic VDR (ours)
Qwen3-VL-Embedding-8B 8 B 0.923 0.066
tomaarsen/Qwen3-VL-Embedding-2B-vdr 2 B 0.948 0.053
Qwen3-VL-Embedding-2B (base) 2 B 0.888 0.041
llamaindex/vdr-2b-multi-v1 2 B 0.912 0.035
BAAI/BGE-VL-large 0.5 B 0.003

¹ English numbers from Tom's leaderboard on llamaindex-vdr-en-eval (300 queries, 1 500 docs). Arabic numbers from our Pearl-vdr-ar dev (999 queries, 4 995 docs).

That's not a small gap. A state-of-the-art multimodal embedding drops from 0.95 to 0.05 when you change the query language. Scaling to 8 B gets you only a 1.6× bump on Arabic. A model whose card says "multilingual VDR" (vdr-2b-multi-v1) actually scores below the monolingual English base on our Arabic data.

So: the failure mode isn't capacity, it's representation. Current multimodal embeddings learned Arabic from scraps.


2. Building a culturally-grounded Arabic VDR dataset

To train against this, we needed an Arabic-native VDR corpus — not a machine-translated English one. We started from UBC-NLP/PEARL-FULL: 309 298 rows, 135 220 unique culturally-aligned images, across 9 categories (Architecture, Clothes, Fauna, Flora, Food, Geography, Handicrafts, LandMarks, Music) × 19 Arab countries.

To turn it into VDR-style triplets we:

  1. Sampled 50 000 queries stratified by category (so every topic and country is represented).
  2. Deduplicated images across all 309 k rows via augmented_caption — the same image appears in multiple Q&A rows and we want each unique image once.
  3. Mined 4 hard negatives per query using metadata rules:
    • negative_0: same category, different image (visually similar topic)
    • negative_1: same country, different category (same cultural setting)
    • negative_2: another same-category random
    • negative_3: random from the full corpus
  4. Split stratified by category: train = 48 002 | dev = 999 | test = 999.

The whole preprocessing pipeline used pyarrow + HfFileSystem range reads, which let us skip the image column entirely on metadata pass 1 (~14 GB of raw image data lives only in ~500 MB of metadata columns). Pass 2 fetched only the shards we actually needed. You can find the fully-preprocessed dataset at Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed.


3. Training recipe

We followed Tom's multimodal blogpost almost verbatim, on a single RTX 5090 (32 GB):

loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=2)
loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048, 1536, 1024, 512, 256, 128, 64])

args = SentenceTransformerTrainingArguments(
    num_train_epochs=3,           # v1 = 1 epoch, then continued for 2 more
    per_device_train_batch_size=64,
    learning_rate=2e-5,           # dropped to 1e-5 for continued training
    warmup_ratio=0.1,
    bf16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,
    gradient_checkpointing=True,
)

Three things worth flagging for anyone reproducing this on a 32 GB consumer card:

  • mini_batch_size=2 + gradient checkpointing keeps peak VRAM around 22 GB. Going higher (mb=4 without checkpointing) hit the 32 GB ceiling and triggered Windows WDDM page-swapping, stalling training to 100 s/step.
  • save_total_limit=1 — each checkpoint is 12 GB (model + optimizer + scheduler). A safe resume needs disk space, so we kept only one rolling.
  • Resume via resume_from_checkpoint=True works cleanly even after multi-day gaps, as long as you don't change logging_steps/save_steps between runs (you'll get benign mismatch warnings).

Training trajectory

Here's what it looks like over 3 epochs, benchmarked against three public models as horizontal reference lines:

training_curve_arabic_vdr

We cleared every off-the-shelf baseline by the 10% mark of the first epoch and kept climbing. Final (v2) dev NDCG@10 = 0.1238; test NDCG@10 = 0.1309. That's a 3.0× lift over the base model (0.041) and 1.9× above the 4× larger 8 B base (0.066).


4. The Arabic VDR leaderboard

We evaluated every public multimodal embedding we could load through Sentence Transformers against our dev set — 999 Arabic queries, 4 995-doc corpus (1 positive + 4 hard negatives per query):

leaderboard_arabic_vdr (2) copy

Full table (dev NDCG@10):

Model Params NDCG@10
Ours: Qwen3-VL-Embedding-2B-Arabic-VDR 2.2 B 0.1238
Qwen3-VL-Embedding-8B 8 B 0.0657
tomaarsen/Qwen3-VL-Embedding-2B-vdr 2.2 B 0.0534
Qwen3-VL-Embedding-2B (base) 2.2 B 0.0407
nvidia/llama-nemotron-embed-vl-1b-v2 1 B 0.0376
llamaindex/vdr-2b-multi-v1 2.2 B 0.0347
BidirLM/BidirLM-Omni-2.5B-Embedding 2.5 B 0.0318
BAAI/BGE-VL-large 0.5 B 0.0026

5. What didn't work — the hard-negative experiment

Not every idea paid off. We tried a v3 run using the v2 model itself to mine harder negatives: for each training query, we embedded the full image pool and replaced the 4 metadata-based negatives with the top-4 most similar non-matching images. The hypothesis: harder negatives would push NDCG@10 from 0.12 toward the 0.18–0.22 range.

The result:

v2 (metadata negatives) v3 (v2-mined negatives)
Train loss (final) 13.43 11.86 (better!)
Dev NDCG@10 0.1252 0.1115 (−11%)
Test NDCG@10 0.1309 0.1201 (−8%)

Training loss dropped further than ever — the model was clearly learning something. But dev/test both got worse. Classic overfit to the hard-mined train distribution at the cost of generalization.

Likely causes (no definitive proof, but these are the suspects):

  1. Pool too small. After deduplication our mining pool had ~10 300 unique images. With only ~40 distinct candidates per category available to draw 4 hard negatives from, mined negatives trended toward near-duplicates of positives, adding noise.
  2. Miner too weak. v2's own NDCG@10 of 0.12 means its embedding space is still coarse for Arabic. What it called "hard" was sometimes just "semantically confused" — training on that pushes embeddings to spread everything apart, not to separate positives from true confusers.
  3. Loss surface disruption. Hard negatives that sit very close to positives create near-vertical gradient cliffs that consumer-grade finetuning with lr=1e-5 can't navigate well.

The lesson: the classic "self-mined harder negatives" recipe from English VDR doesn't transfer for free when your base model is weak in the target language. You need either a stronger miner or a larger negative pool — ideally both.


6. What this tells us about Arabic multimodal

Five takeaways from building, training, and benchmarking this:

6.1 The Arabic gap is a representation gap, not a capacity gap

Tom's English VDR numbers (0.88–0.95) and ours on Arabic (0.04–0.12) come from the same model family, same architecture, similar dataset structure. What differs is how much Arabic text-image grounding made it into pretraining. You can't close a 7× representation gap by scaling parameters alone — we saw at most a 1.6× bump from 2 B → 8 B.

6.2 "Multilingual" is a label, not a benchmark

Two of the three "multilingual" models we tested scored below the monolingual-English Qwen3-VL-Embedding-2B on Arabic (vdr-2b-multi-v1: 0.035, BidirLM-Omni-2.5B: 0.032). Their training mixes were heavily English-skewed; the non-English components were not enough to dominate representations. If your use case is Arabic, benchmark before you trust the label.

6.3 Cross-lingual task-finetuning transfers weakly

tomaarsen/Qwen3-VL-Embedding-2B-vdr (Tom's English VDR finetune) scored 0.053 on our Arabic dev — better than its own base (0.041), but nowhere near useful. Task structure transfers a little; language representations don't.

6.4 One small monolingual dataset beats four giant multilingual ones

48 k culturally-aligned Arabic VDR samples + one 3-epoch finetune = 3× better NDCG@10 than the strongest off-the-shelf option, at 1/4 the parameters of the closest competitor (Qwen 8 B). The lever isn't scale, it isn't multilinguality, it's dedicated in-language data.

6.5 Absolute 0.12 is not "good"; it's just "best available"

Let's keep it honest. Tom's English VDR sits at 0.95. A retrieval system returning the right page in the top-10 only 20% of the time is not production-ready — it's a floor, not a ceiling. The encouraging part is how much headroom is obviously still there: our base is a model that gets 0.04 on Arabic, and every improvement from here is bottlenecked by that representation quality. A genuinely Arabic-grounded VLM would unlock the next tier.


7. Where we'd go next

In rough order of expected impact per hour of compute:

  1. Better base model. The ceiling constraint is the 0.04 starting point. If a more Arabic-capable VLM appears (or gets pretrained), finetuning on top of it could plausibly hit 0.3+. This is the single most valuable direction.
  2. Task prompts at training + inference. Qwen3-VL-Embedding was pretrained with instruction-following behavior; we trained with none. A simple "Given the question, retrieve the document image that answers it." prefix might be nearly free improvement.
  3. Longer training with harder negatives — done right. Our v3 failure was about mining quality, not about the idea. With a stronger miner (the next-gen model, or CLIP-based mining across a much bigger pool like the full 135 k unique Pearl images), we'd expect positive gains.
  4. Scale data, not parameters. A 200 k Arabic VDR corpus (4× what we used) would likely give more than 4× scale would. The field is starved of Arabic multimodal data, not model weights.

8. Reproduce it

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

# Load the pushed dataset
train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train")
dev   = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev",   split="train")

# Load the finetuned model
model = SentenceTransformer("Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR")

# Query an image
query = "ما هو الغطاء الرأس الذي يعكس الهوية والمكانة الاجتماعية كما يظهر في الصورة؟"
q_emb = model.encode_query([query])
d_emb = model.encode_document([dev[0]["image"], dev[1]["image"]])
print(model.similarity(q_emb, d_emb))

Acknowledgments

Thanks to UBC for Pearl dataset, which made the whole project possible; to Tom Aarsen for the multimodal VDR blogpost and the Sentence Transformers library; and to the Qwen team for open-sourcing a strong multimodal embedding base.

If Arabic multimodal retrieval matters to your work, grab the dataset, try the model, and — ideally — send us a stronger Arabic-grounded base to train on.

Community

Sign up or log in to comment