Basira Omni 30B v0.1 — بصيرة

Basira (بصيرة) — insight, deep vision, perception. The capacity not just to see, but to understand what is seen.

A first open-source attempt at an Arabic multimodal large language model: a QLoRA adapter on top of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning, fine-tuned on the Pearl Dataset dataset for image-grounded Arabic question answering.

This is a preview release (v0.1) — trained on a sample of the Pearl Dataset for 1 epoch. It is intended to demonstrate feasibility, not to be a production model. A larger run is planned.

Highlights

Native Arabic generation — fluent, grammatical, and contextually appropriate
🖼️ Vision-grounded — accepts images and Arabic questions, produces Arabic answers
🪶 Lightweight adapter — 1.75 GB LoRA (rank 16) on top of the 30B-parameter base; full base model is 4-bit quantized for inference

How to use

This repository contains a PEFT/LoRA adapter. You need the base model (nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning) plus this adapter on top.

import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig

BASE = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
ADAPTER = "Omartificial-Intelligence-Space/basira-omni-30b-v0.1"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_skip_modules=[
        "vision_tower", "vision_model", "vision_embed_tokens",
        "audio_tower", "audio_model",
        "multi_modal_projector", "mm_projector",
        "mamba", "lm_head",
    ],
)

processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    quantization_config=bnb,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

# Inference
image = Image.open("your_image.jpg").convert("RGB")
question = "ما الذي يظهر في الصورة؟"  # "What appears in the image?"

conv = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": question},
]}]
prompt = processor.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)

# Filter processor extras the base LM doesn't accept
keep = {"input_ids", "attention_mask", "pixel_values"}
gen_inputs = {k: v for k, v in inputs.items() if k in keep}

with torch.no_grad():
    out = model.generate(**gen_inputs, max_new_tokens=384,
                         do_sample=False, repetition_penalty=1.05)

answer = processor.tokenizer.decode(
    out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(answer)

⚠️ The model's outputs are wrapped in a chat-template structure like [{'type': 'text', 'text': '...'}] that it learned during training. Strip that wrapper at inference time if you want clean text.

Examples

Three sample image+question pairs from the held-out Pearl eval split, with model output (no ground truth shown — this is what the model produces).

1. Geography & economic development

Q: كيف يمكن أن تؤثر الأراضي الخصبة والغيل الغزير الذي يظهر في الصورة على تطور وادي ظهر؟

A: الأراضي الخصبة والغيل الغزير في وادي ظهور ساهما في تطوير الزراعة والتجارة في المنطقة. هذا أدى إلى ازدهار الاقتصاد المحلي وازدهار المدينة. كما أن هذه الموارد الطبيعية ساعدت في جذب السكان وتعزيز نمو المدينة.

2. Imagined agricultural lifestyle

Q: تخيل أنك مزارع محلي في الجزء الجنوبي من القرية الذي يظهر في الصورة. كيف تستخدم الموارد الطبيعية المتاحة لك في زراعة المحاصيل وريها؟

A: كمزارع محلي في الجزء الجنوبي من القرية، أستخدم الموارد الطبيعية المتاحة مثل مياه الري من القنوات التي تروي الحقول. أزرع الخضروات والفواكه الموسمية في الحقول الواسعة، وأهتم بترطيبها بشكل منتظم لضمان نموها الصحي. كما أستغل الأشجار والأشجار المثمرة في المنطقة لتوفير الظل والحماية للمحاصيل. هذا يساعدني على إنتاج منتجات طبيعية عالية الجودة ودعم الاقتصاد المحلي في القرية.

Limitations

This is a v0.1 preview. Known issues:

Specific entity recall is weak. The model may name a location, person, or object incorrectly even when the surrounding description is correct. More training data and epochs are expected to improve this.
Some answers are non-substantive — the model occasionally echoes the question framing rather than committing to an answer.
English questions may receive Arabic answers — the model was trained on Arabic VQA only.
Output formatting: generated text is wrapped in a [{'type': 'text', 'text': '...'}] chat-template structure. Post-process at inference time.
Inherits base model limitations. This adapter does not change Nemotron-Omni's underlying knowledge cutoff or biases — it only steers the output toward Arabic VQA-style responses.

Intended use

✅ Research on Arabic vision-language modeling
✅ Building Arabic VQA prototypes / demos
✅ Benchmarking against a reproducible Arabic VLM baseline
❌ Fact-critical applications (medical, legal, news) — the entity-recall issue makes this unsuitable
❌ Production deployment without further fine-tuning

License & attribution

This adapter is released under the NVIDIA Open Model License of the base model. By using this adapter you agree to the NVIDIA Open Model License terms.

The Pearl training dataset is licensed separately by its authors — see the dataset card.

Citation

@misc{basira-omni-2026,
  title  = {Basira Omni 30B v0.1: An Arabic Vision-Language Model},
  author = {Omartificial-Intelligence-Space},
  year   = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Omartificial-Intelligence-Space/basira-omni-30b-v0.1}
}

Please also cite the base model and dataset:

@misc{nemotron-3-nano-omni-2025,
  title = {Nemotron-3-Nano-Omni-30B-A3B-Reasoning},
  author = {NVIDIA},
  year = {2025},
  url = {https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning}
}

```bibtex
@inproceedings{alwajih-etal-2025-pearl,
    title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
    author = "Alwajih, Fakhraddin  and
      Magdy, Samar M.  and
      El Mekki, Abdellah  and
      Nacar, Omer  and
      Nafea, Youssef  and
      Abdelfadil, Safaa Taher  and
      Yahya, Abdulfattah Mohammed  and
      Luqman, Hamzah  and
      Almarwani, Nada  and
      Aloufi, Samah  and
      Qawasmeh, Baraah  and
      Atou, Houdaifa  and
      Sibaee, Serry  and
      Alsayadi, Hamzah A.  and
      Al-Dhabyani, Walid  and
      Al-shaibani, Maged S.  and
      El aatar, Aya  and
      Qandos, Nour  and
      Alhamouri, Rahaf  and
      Ahmad, Samar  and
      AL-Ghrawi, Mohammed Anwar  and
      Yacoub, Aminetou  and
      AbuHweidi, Ruwa  and
      Lemin, Vatimetou Mohamed  and
      Abdel-Salam, Reem  and
      Bashiti, Ahlam  and
      Ammar, Adel  and
      Alansari, Aisha  and
      Ashraf, Ahmed  and
      Alturayeif, Nora  and
      Alcoba Inciarte, Alcides  and
      Elmadany, AbdelRahim A.  and
      Tourad, Mohamedou Cheikh  and
      Berrada, Ismail  and
      Jarrar, Mustafa  and
      Shehata, Shady  and
      Abdul-Mageed, Muhammad",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
    pages = "23048--23079",
    ISBN = "979-8-89176-335-7"
}

Downloads last month: 9

Model tree for Omartificial-Intelligence-Space/basira-omni-30b-v0.1

Base model

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Adapter

(1)

this model

Collection including Omartificial-Intelligence-Space/basira-omni-30b-v0.1

Arab-Culture-Aligned Multimodal Embedding Models & Datasets

Collection

Where Visual Document Retrieval Goes Arabic • 4 items • Updated about 2 hours ago • 1