The State of Arabic Multimodal Embedding — What a 2B Finetune Taught Us
UBC-NLP/PEARL-FULL, finetuned Qwen/Qwen3-VL-Embedding-2B on it, and benchmarked against every public multimodal embedding we could load. Our 2B finetune beats every off-the-shelf model we tested — including a 4× larger base — by a 3× margin on Arabic visual retrieval. But the absolute numbers (dev NDCG@10 = 0.124) also make an uncomfortable point clear: public multimodal embeddings are not ready for Arabic, and scaling parameters or stamping "multilingual" on a model doesn't fix it. This blog documents what worked, what didn't, and what we learned along the way.
- Model:
Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR - Dataset:
Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed - Recipe: Sentence-Transformers
CachedMultipleNegativesRankingLoss+MatryoshkaLoss, following Tom Aarsen's multimodal VDR training guide.
1. Why Arabic VDR is hard
Visual Document Retrieval asks a single question: given a natural-language query, find the right page out of a big corpus of document screenshots. It's the retrieval task behind every useful doc-QA product — the step before you hand context to an LLM.
On English VDR, the space is basically solved. Tom Aarsen's recent blog shows off-the-shelf 2B models at ~0.89 NDCG@10 and his own finetune hits 0.95. Switch the queries and the documents to Arabic, though, and the floor collapses:
| Model | Size | English VDR¹ | Arabic VDR (ours) |
|---|---|---|---|
| Qwen3-VL-Embedding-8B | 8 B | 0.923 | 0.066 |
| tomaarsen/Qwen3-VL-Embedding-2B-vdr | 2 B | 0.948 | 0.053 |
| Qwen3-VL-Embedding-2B (base) | 2 B | 0.888 | 0.041 |
| llamaindex/vdr-2b-multi-v1 | 2 B | 0.912 | 0.035 |
| BAAI/BGE-VL-large | 0.5 B | — | 0.003 |
¹ English numbers from Tom's leaderboard on llamaindex-vdr-en-eval (300 queries, 1 500 docs). Arabic numbers from our Pearl-vdr-ar dev (999 queries, 4 995 docs).
That's not a small gap. A state-of-the-art multimodal embedding drops from 0.95 to 0.05 when you change the query language. Scaling to 8 B gets you only a 1.6× bump on Arabic. A model whose card says "multilingual VDR" (vdr-2b-multi-v1) actually scores below the monolingual English base on our Arabic data.
So: the failure mode isn't capacity, it's representation. Current multimodal embeddings learned Arabic from scraps.
2. Building a culturally-grounded Arabic VDR dataset
To train against this, we needed an Arabic-native VDR corpus — not a machine-translated English one. We started from UBC-NLP/PEARL-FULL: 309 298 rows, 135 220 unique culturally-aligned images, across 9 categories (Architecture, Clothes, Fauna, Flora, Food, Geography, Handicrafts, LandMarks, Music) × 19 Arab countries.
To turn it into VDR-style triplets we:
- Sampled 50 000 queries stratified by category (so every topic and country is represented).
- Deduplicated images across all 309 k rows via
augmented_caption— the same image appears in multiple Q&A rows and we want each unique image once. - Mined 4 hard negatives per query using metadata rules:
negative_0: same category, different image (visually similar topic)negative_1: same country, different category (same cultural setting)negative_2: another same-category randomnegative_3: random from the full corpus
- Split stratified by category: train = 48 002 | dev = 999 | test = 999.
The whole preprocessing pipeline used pyarrow + HfFileSystem range reads, which let us skip the image column entirely on metadata pass 1 (~14 GB of raw image data lives only in ~500 MB of metadata columns). Pass 2 fetched only the shards we actually needed. You can find the fully-preprocessed dataset at Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed.
3. Training recipe
We followed Tom's multimodal blogpost almost verbatim, on a single RTX 5090 (32 GB):
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=2)
loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048, 1536, 1024, 512, 256, 128, 64])
args = SentenceTransformerTrainingArguments(
num_train_epochs=3, # v1 = 1 epoch, then continued for 2 more
per_device_train_batch_size=64,
learning_rate=2e-5, # dropped to 1e-5 for continued training
warmup_ratio=0.1,
bf16=True,
batch_sampler=BatchSamplers.NO_DUPLICATES,
gradient_checkpointing=True,
)
Three things worth flagging for anyone reproducing this on a 32 GB consumer card:
mini_batch_size=2+ gradient checkpointing keeps peak VRAM around 22 GB. Going higher (mb=4 without checkpointing) hit the 32 GB ceiling and triggered Windows WDDM page-swapping, stalling training to 100 s/step.save_total_limit=1— each checkpoint is 12 GB (model + optimizer + scheduler). A safe resume needs disk space, so we kept only one rolling.- Resume via
resume_from_checkpoint=Trueworks cleanly even after multi-day gaps, as long as you don't changelogging_steps/save_stepsbetween runs (you'll get benign mismatch warnings).
Training trajectory
Here's what it looks like over 3 epochs, benchmarked against three public models as horizontal reference lines:
We cleared every off-the-shelf baseline by the 10% mark of the first epoch and kept climbing. Final (v2) dev NDCG@10 = 0.1238; test NDCG@10 = 0.1309. That's a 3.0× lift over the base model (0.041) and 1.9× above the 4× larger 8 B base (0.066).
4. The Arabic VDR leaderboard
We evaluated every public multimodal embedding we could load through Sentence Transformers against our dev set — 999 Arabic queries, 4 995-doc corpus (1 positive + 4 hard negatives per query):
Full table (dev NDCG@10):
| Model | Params | NDCG@10 |
|---|---|---|
| Ours: Qwen3-VL-Embedding-2B-Arabic-VDR | 2.2 B | 0.1238 ⭐ |
| Qwen3-VL-Embedding-8B | 8 B | 0.0657 |
| tomaarsen/Qwen3-VL-Embedding-2B-vdr | 2.2 B | 0.0534 |
| Qwen3-VL-Embedding-2B (base) | 2.2 B | 0.0407 |
| nvidia/llama-nemotron-embed-vl-1b-v2 | 1 B | 0.0376 |
| llamaindex/vdr-2b-multi-v1 | 2.2 B | 0.0347 |
| BidirLM/BidirLM-Omni-2.5B-Embedding | 2.5 B | 0.0318 |
| BAAI/BGE-VL-large | 0.5 B | 0.0026 |
5. What didn't work — the hard-negative experiment
Not every idea paid off. We tried a v3 run using the v2 model itself to mine harder negatives: for each training query, we embedded the full image pool and replaced the 4 metadata-based negatives with the top-4 most similar non-matching images. The hypothesis: harder negatives would push NDCG@10 from 0.12 toward the 0.18–0.22 range.
The result:
| v2 (metadata negatives) | v3 (v2-mined negatives) | |
|---|---|---|
| Train loss (final) | 13.43 | 11.86 (better!) |
| Dev NDCG@10 | 0.1252 | 0.1115 (−11%) |
| Test NDCG@10 | 0.1309 | 0.1201 (−8%) |
Training loss dropped further than ever — the model was clearly learning something. But dev/test both got worse. Classic overfit to the hard-mined train distribution at the cost of generalization.
Likely causes (no definitive proof, but these are the suspects):
- Pool too small. After deduplication our mining pool had ~10 300 unique images. With only ~40 distinct candidates per category available to draw 4 hard negatives from, mined negatives trended toward near-duplicates of positives, adding noise.
- Miner too weak. v2's own NDCG@10 of 0.12 means its embedding space is still coarse for Arabic. What it called "hard" was sometimes just "semantically confused" — training on that pushes embeddings to spread everything apart, not to separate positives from true confusers.
- Loss surface disruption. Hard negatives that sit very close to positives create near-vertical gradient cliffs that consumer-grade finetuning with
lr=1e-5can't navigate well.
The lesson: the classic "self-mined harder negatives" recipe from English VDR doesn't transfer for free when your base model is weak in the target language. You need either a stronger miner or a larger negative pool — ideally both.
6. What this tells us about Arabic multimodal
Five takeaways from building, training, and benchmarking this:
6.1 The Arabic gap is a representation gap, not a capacity gap
Tom's English VDR numbers (0.88–0.95) and ours on Arabic (0.04–0.12) come from the same model family, same architecture, similar dataset structure. What differs is how much Arabic text-image grounding made it into pretraining. You can't close a 7× representation gap by scaling parameters alone — we saw at most a 1.6× bump from 2 B → 8 B.
6.2 "Multilingual" is a label, not a benchmark
Two of the three "multilingual" models we tested scored below the monolingual-English Qwen3-VL-Embedding-2B on Arabic (vdr-2b-multi-v1: 0.035, BidirLM-Omni-2.5B: 0.032). Their training mixes were heavily English-skewed; the non-English components were not enough to dominate representations. If your use case is Arabic, benchmark before you trust the label.
6.3 Cross-lingual task-finetuning transfers weakly
tomaarsen/Qwen3-VL-Embedding-2B-vdr (Tom's English VDR finetune) scored 0.053 on our Arabic dev — better than its own base (0.041), but nowhere near useful. Task structure transfers a little; language representations don't.
6.4 One small monolingual dataset beats four giant multilingual ones
48 k culturally-aligned Arabic VDR samples + one 3-epoch finetune = 3× better NDCG@10 than the strongest off-the-shelf option, at 1/4 the parameters of the closest competitor (Qwen 8 B). The lever isn't scale, it isn't multilinguality, it's dedicated in-language data.
6.5 Absolute 0.12 is not "good"; it's just "best available"
Let's keep it honest. Tom's English VDR sits at 0.95. A retrieval system returning the right page in the top-10 only 20% of the time is not production-ready — it's a floor, not a ceiling. The encouraging part is how much headroom is obviously still there: our base is a model that gets 0.04 on Arabic, and every improvement from here is bottlenecked by that representation quality. A genuinely Arabic-grounded VLM would unlock the next tier.
7. Where we'd go next
In rough order of expected impact per hour of compute:
- Better base model. The ceiling constraint is the 0.04 starting point. If a more Arabic-capable VLM appears (or gets pretrained), finetuning on top of it could plausibly hit 0.3+. This is the single most valuable direction.
- Task prompts at training + inference. Qwen3-VL-Embedding was pretrained with instruction-following behavior; we trained with none. A simple
"Given the question, retrieve the document image that answers it."prefix might be nearly free improvement. - Longer training with harder negatives — done right. Our v3 failure was about mining quality, not about the idea. With a stronger miner (the next-gen model, or CLIP-based mining across a much bigger pool like the full 135 k unique Pearl images), we'd expect positive gains.
- Scale data, not parameters. A 200 k Arabic VDR corpus (4× what we used) would likely give more than 4× scale would. The field is starved of Arabic multimodal data, not model weights.
8. Reproduce it
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
# Load the pushed dataset
train = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "train", split="train")
dev = load_dataset("Omartificial-Intelligence-Space/Pearl-vdr-ar-train-preprocessed", "dev", split="train")
# Load the finetuned model
model = SentenceTransformer("Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR")
# Query an image
query = "ما هو الغطاء الرأس الذي يعكس الهوية والمكانة الاجتماعية كما يظهر في الصورة؟"
q_emb = model.encode_query([query])
d_emb = model.encode_document([dev[0]["image"], dev[1]["image"]])
print(model.similarity(q_emb, d_emb))
Acknowledgments
Thanks to UBC for Pearl dataset, which made the whole project possible; to Tom Aarsen for the multimodal VDR blogpost and the Sentence Transformers library; and to the Qwen team for open-sourcing a strong multimodal embedding base.
If Arabic multimodal retrieval matters to your work, grab the dataset, try the model, and — ideally — send us a stronger Arabic-grounded base to train on.

