Basira Omni 30B v0.1 — بصيرة
Basira (بصيرة) — insight, deep vision, perception. The capacity not just to see, but to understand what is seen.
A first open-source attempt at an Arabic multimodal large language model: a QLoRA adapter on top of NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning, fine-tuned on the Pearl Dataset dataset for image-grounded Arabic question answering.
This is a preview release (v0.1) — trained on a sample of the Pearl Dataset for 1 epoch. It is intended to demonstrate feasibility, not to be a production model. A larger run is planned.
Highlights
- Native Arabic generation — fluent, grammatical, and contextually appropriate
- 🖼️ Vision-grounded — accepts images and Arabic questions, produces Arabic answers
- 🪶 Lightweight adapter — 1.75 GB LoRA (rank 16) on top of the 30B-parameter base; full base model is 4-bit quantized for inference
How to use
This repository contains a PEFT/LoRA adapter. You need the base model (nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning) plus this adapter on top.
import torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig
BASE = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning"
ADAPTER = "Omartificial-Intelligence-Space/basira-omni-30b-v0.1"
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_skip_modules=[
"vision_tower", "vision_model", "vision_embed_tokens",
"audio_tower", "audio_model",
"multi_modal_projector", "mm_projector",
"mamba", "lm_head",
],
)
processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
BASE,
quantization_config=bnb,
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="eager",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()
# Inference
image = Image.open("your_image.jpg").convert("RGB")
question = "ما الذي يظهر في الصورة؟" # "What appears in the image?"
conv = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": question},
]}]
prompt = processor.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
# Filter processor extras the base LM doesn't accept
keep = {"input_ids", "attention_mask", "pixel_values"}
gen_inputs = {k: v for k, v in inputs.items() if k in keep}
with torch.no_grad():
out = model.generate(**gen_inputs, max_new_tokens=384,
do_sample=False, repetition_penalty=1.05)
answer = processor.tokenizer.decode(
out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(answer)
⚠️ The model's outputs are wrapped in a chat-template structure like
[{'type': 'text', 'text': '...'}]that it learned during training. Strip that wrapper at inference time if you want clean text.
Examples
Three sample image+question pairs from the held-out Pearl eval split, with model output (no ground truth shown — this is what the model produces).
1. Geography & economic development
Q: كيف يمكن أن تؤثر الأراضي الخصبة والغيل الغزير الذي يظهر في الصورة على تطور وادي ظهر؟
A: الأراضي الخصبة والغيل الغزير في وادي ظهور ساهما في تطوير الزراعة والتجارة في المنطقة. هذا أدى إلى ازدهار الاقتصاد المحلي وازدهار المدينة. كما أن هذه الموارد الطبيعية ساعدت في جذب السكان وتعزيز نمو المدينة.
2. Imagined agricultural lifestyle
Q: تخيل أنك مزارع محلي في الجزء الجنوبي من القرية الذي يظهر في الصورة. كيف تستخدم الموارد الطبيعية المتاحة لك في زراعة المحاصيل وريها؟
A: كمزارع محلي في الجزء الجنوبي من القرية، أستخدم الموارد الطبيعية المتاحة مثل مياه الري من القنوات التي تروي الحقول. أزرع الخضروات والفواكه الموسمية في الحقول الواسعة، وأهتم بترطيبها بشكل منتظم لضمان نموها الصحي. كما أستغل الأشجار والأشجار المثمرة في المنطقة لتوفير الظل والحماية للمحاصيل. هذا يساعدني على إنتاج منتجات طبيعية عالية الجودة ودعم الاقتصاد المحلي في القرية.
Limitations
This is a v0.1 preview. Known issues:
- Specific entity recall is weak. The model may name a location, person, or object incorrectly even when the surrounding description is correct. More training data and epochs are expected to improve this.
- Some answers are non-substantive — the model occasionally echoes the question framing rather than committing to an answer.
- English questions may receive Arabic answers — the model was trained on Arabic VQA only.
- Output formatting: generated text is wrapped in a
[{'type': 'text', 'text': '...'}]chat-template structure. Post-process at inference time. - Inherits base model limitations. This adapter does not change Nemotron-Omni's underlying knowledge cutoff or biases — it only steers the output toward Arabic VQA-style responses.
Intended use
- ✅ Research on Arabic vision-language modeling
- ✅ Building Arabic VQA prototypes / demos
- ✅ Benchmarking against a reproducible Arabic VLM baseline
- ❌ Fact-critical applications (medical, legal, news) — the entity-recall issue makes this unsuitable
- ❌ Production deployment without further fine-tuning
License & attribution
This adapter is released under the NVIDIA Open Model License of the base model. By using this adapter you agree to the NVIDIA Open Model License terms.
The Pearl training dataset is licensed separately by its authors — see the dataset card.
Citation
@misc{basira-omni-2026,
title = {Basira Omni 30B v0.1: An Arabic Vision-Language Model},
author = {Omartificial-Intelligence-Space},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Omartificial-Intelligence-Space/basira-omni-30b-v0.1}
}
Please also cite the base model and dataset:
@misc{nemotron-3-nano-omni-2025,
title = {Nemotron-3-Nano-Omni-30B-A3B-Reasoning},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning}
}
```bibtex
@inproceedings{alwajih-etal-2025-pearl,
title = "Pearl: A Multimodal Culturally-Aware {A}rabic Instruction Dataset",
author = "Alwajih, Fakhraddin and
Magdy, Samar M. and
El Mekki, Abdellah and
Nacar, Omer and
Nafea, Youssef and
Abdelfadil, Safaa Taher and
Yahya, Abdulfattah Mohammed and
Luqman, Hamzah and
Almarwani, Nada and
Aloufi, Samah and
Qawasmeh, Baraah and
Atou, Houdaifa and
Sibaee, Serry and
Alsayadi, Hamzah A. and
Al-Dhabyani, Walid and
Al-shaibani, Maged S. and
El aatar, Aya and
Qandos, Nour and
Alhamouri, Rahaf and
Ahmad, Samar and
AL-Ghrawi, Mohammed Anwar and
Yacoub, Aminetou and
AbuHweidi, Ruwa and
Lemin, Vatimetou Mohamed and
Abdel-Salam, Reem and
Bashiti, Ahlam and
Ammar, Adel and
Alansari, Aisha and
Ashraf, Ahmed and
Alturayeif, Nora and
Alcoba Inciarte, Alcides and
Elmadany, AbdelRahim A. and
Tourad, Mohamedou Cheikh and
Berrada, Ismail and
Jarrar, Mustafa and
Shehata, Shady and
Abdul-Mageed, Muhammad",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "[https://aclanthology.org/2025.findings-emnlp.1254/](https://aclanthology.org/2025.findings-emnlp.1254/)",
pages = "23048--23079",
ISBN = "979-8-89176-335-7"
}
- Downloads last month
- 9

