Qwen3-VL-Embedding-2B finetuned on Arabic-culture visual document retrieval (Pearl-vdr-ar), 3 epochs total

This is a sentence-transformers model trained on the pearl-vdr-ar-train-preprocessed dataset. It maps sentences & paragraphs to a 2048-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Maximum Sequence Length: 262144 tokens
  • Output Dimensionality: 2048 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modalities: Text, Image, Video, Message
  • Training Dataset:
  • Language: ar
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'image': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'video': {'method': 'forward', 'method_output_name': 'last_hidden_state'}, 'message': {'method': 'forward', 'method_output_name': 'last_hidden_state', 'format': 'structured'}}, 'module_output_name': 'token_embeddings', 'processing_kwargs': {'chat_template': {'add_generation_prompt': True}}, 'unpad_inputs': False, 'architecture': 'Qwen3VLModel'})
  (1): Pooling({'embedding_dimension': 2048, 'pooling_mode': 'lasttoken', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
queries = [
    'ما هو الغطاء الرأس الذي يعكس الهوية والمكانة الاجتماعية كما يظهر في الصورة؟',
]
documents = [
    'Clothes',
    'Music',
    'Architecture',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.2472, 0.1607, 0.1453]])

Evaluation

Metrics

Information Retrieval

Metric pearl-ar-dev-hard pearl-ar-dev-hard-final pearl-ar-test-hard-final
cosine_accuracy@1 0.0501 0.0501 0.042
cosine_accuracy@3 0.0961 0.0961 0.1061
cosine_accuracy@5 0.1441 0.1441 0.1622
cosine_accuracy@10 0.2332 0.2332 0.2553
cosine_precision@1 0.0501 0.0501 0.042
cosine_precision@3 0.032 0.032 0.0354
cosine_precision@5 0.0288 0.0288 0.0324
cosine_precision@10 0.0233 0.0233 0.0255
cosine_recall@1 0.0501 0.0501 0.042
cosine_recall@3 0.0961 0.0961 0.1061
cosine_recall@5 0.1441 0.1441 0.1622
cosine_recall@10 0.2332 0.2332 0.2553
cosine_ndcg@10 0.1238 0.1238 0.1309
cosine_mrr@10 0.0911 0.0911 0.0934
cosine_map@100 0.1049 0.1049 0.108

Training Details

Training Dataset

pearl-vdr-ar-train-preprocessed

  • Dataset: pearl-vdr-ar-train-preprocessed at 494822e
  • Size: 48,002 training samples
  • Columns: query, image, and negative_0
  • Approximate statistics based on the first 1000 samples:
    query image negative_0
    type string image image
    details
    • min: 31 tokens
    • mean: 51.45 tokens
    • max: 90 tokens
    • min: 53x96 px
    • mean: 639x540 px
    • max: 800x798 px
    • min: 101x100 px
    • mean: 630x545 px
    • max: 800x787 px
  • Samples:
    query image negative_0
    ما هي التحديات التي تواجه الحرف التقليدية كما يظهر في الصورة، وما هي الحلول الممكنة لمواجهة هذه التحديات؟
    إذا شاركت في ورشة عمل لتعلم كيفية صنع الآلة التي يظهر في الصورة، ما هي الخطوات التي ستحتاج إلى اتباعها لصنعها بشكل صحيح؟
    كيف يختلف العزف على الآلة التي يظهر في الصورة عن العزف على الآلات الوترية الأخرى في المنطقة، وما هي الخصائص الفريدة لهذه الآلة؟
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            2048,
            1536,
            1024,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

pearl-vdr-ar-train-preprocessed

  • Dataset: pearl-vdr-ar-train-preprocessed at 494822e
  • Size: 200 evaluation samples
  • Columns: query, category, country, image, negative_0, negative_1, negative_2, and negative_3
  • Approximate statistics based on the first 200 samples:
    query category country image negative_0 negative_1 negative_2 negative_3
    type string string string image image image image image
    details
    • min: 33 tokens
    • mean: 52.18 tokens
    • max: 82 tokens
    • min: 21 tokens
    • mean: 21.8 tokens
    • max: 24 tokens
    • min: 21 tokens
    • mean: 22.21 tokens
    • max: 24 tokens
    • min: 200x135 px
    • mean: 617x552 px
    • max: 800x786 px
    • min: 168x182 px
    • mean: 606x545 px
    • max: 800x788 px
    • min: 141x139 px
    • mean: 671x539 px
    • max: 800x791 px
    • min: 177x175 px
    • mean: 649x555 px
    • max: 800x788 px
    • min: 200x150 px
    • mean: 664x519 px
    • max: 800x790 px
  • Samples:
    query category country image negative_0 negative_1 negative_2 negative_3
    ما هو الغطاء الرأس الذي يعكس الهوية والمكانة الاجتماعية كما يظهر في الصورة؟ Clothes Syria
    كيف تساهم المجوهرات في إبراز شخصية الفنانة كما يظهر في الصورة؟ Music Lebanon
    ما هي الاختلافات بين الزخارف والكتابات التي تظهر في الصورة وبين تلك الموجودة في المعالم التاريخية الأخرى في نفس البلد، وما هي العناصر الفريدة التي تميز هذه الزخارف والكتابات؟ Architecture Tunisia
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CachedMultipleNegativesRankingLoss",
        "matryoshka_dims": [
            2048,
            1536,
            1024,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • num_train_epochs: 2
  • learning_rate: 1e-05
  • warmup_steps: 0.03
  • bf16: True
  • per_device_eval_batch_size: 64
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 64
  • num_train_epochs: 2
  • max_steps: -1
  • learning_rate: 1e-05
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.03
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • per_device_eval_batch_size: 64
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss pearl-ar-dev-hard_cosine_ndcg@10 pearl-ar-dev-hard-final_cosine_ndcg@10 pearl-ar-test-hard-final_cosine_ndcg@10
-1 -1 - - 0.1059 - -
0.1012 76 14.4746 - - - -
0.2011 151 - 29.3815 0.1083 - -
0.2024 152 13.2427 - - - -
0.3036 228 13.1588 - - - -
0.4021 302 - 27.7539 0.1048 - -
0.4048 304 13.5747 - - - -
0.5060 380 13.5705 - - - -
0.6032 453 - 24.9561 0.1052 - -
0.6072 456 13.7470 - - - -
0.7084 532 14.1068 - - - -
0.8043 604 - 26.0234 0.1124 - -
0.8096 608 14.1497 - - - -
0.9108 684 14.1645 - - - -
1.0053 755 - 24.7309 0.1186 - -
1.0120 760 14.1689 - - - -
1.1132 836 13.1368 - - - -
1.2064 906 - 28.6214 0.1204 - -
1.2144 912 12.6872 - - - -
1.3156 988 13.2517 - - - -
1.4075 1057 - 27.9877 0.1223 - -
1.4168 1064 12.9871 - - - -
1.5180 1140 12.9682 - - - -
1.6085 1208 - 28.5014 0.1252 - -
1.6192 1216 12.9447 - - - -
1.7204 1292 12.8688 - - - -
1.8096 1359 - 28.3317 0.1241 - -
1.8216 1368 13.2213 - - - -
1.9228 1444 13.1693 - - - -
2.0 1502 - 28.2879 0.1238 - -
-1 -1 - - - 0.1238 0.1309

Training Time

  • Training: 8.7 hours
  • Evaluation: 1.1 hours
  • Total: 9.8 hours

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 5.4.1
  • Transformers: 5.5.0
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.10.1
  • Datasets: 4.2.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR

Collection including Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR

Papers for Omartificial-Intelligence-Space/Qwen3-VL-Embedding-2B-Arabic-VDR

Evaluation results