You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Cosmos-Reason2-2B-Retail-Grocery-EgoExo

A BF16 LoRA adapter fine-tuned on the PRISM dataset for embodied video understanding in retail grocery environments.

Model Description

This model is a LoRA adapter for NVIDIA Cosmos-Reason2-2B, fine-tuned on 270K video SFT samples spanning 20+ tasks across egocentric and exocentric camera views in real-world retail stores.

Base Model: nvidia/Cosmos-Reason2-2B (Qwen2.5-VL architecture, 2.49B parameters)
Adapter: LoRA (rank=32, alpha=64, 49.3M trainable parameters, 1.98% of base)
Training: BF16 precision, 1 epoch on 270K samples, 4x NVIDIA RTX PRO 6000 Blackwell GPUs
Training Time: ~35 hours (7,942 steps)
Adapter Size: 67MB

Capabilities

The model is fine-tuned across four capability domains:

Domain	Tasks	Description
Embodied Reasoning (ER)	9 tasks	Next subtask prediction, task completion, action reasoning, hand interaction, multi-actor understanding
Common Sense (CS)	6 tasks	Scene VQA, environment understanding, affordance reasoning, causality, spatial reasoning
Spatial Perception (SP)	2 tasks	Relative depth reasoning, 360-degree spatial layout
Intuitive Physics (IP)	3+ tasks	Arrow-of-time detection, physics reasoning (CoT), object permanence

Performance

Fine-tuning on PRISM yields +23.8 percentage points average improvement over the zero-shot baseline across all evaluated tasks, with an average error rate reduction of 66.6%.

Domain	Baseline	PRISM	Delta
ER (9 tasks)	54.5%	90.9%	+36.4
CS (6 tasks)	80.9%	91.4%	+10.5
SP (2 tasks)	57.4%	74.5%	+17.1
IP (3 tasks)	51.7%	69.3%	+17.6
Overall	62.8%	86.6%	+23.8

Qualitative Examples

See 17 side-by-side video comparisons between the zero-shot baseline and PRISM fine-tuned model across counting, hand interaction, goal reasoning, scene understanding, domain knowledge, and spatial reasoning tasks:

View Demo Gallery

Usage

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
    "nvidia/Cosmos-Reason2-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("nvidia/Cosmos-Reason2-2B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo")

# Inference
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "video", "video": "path/to/clip.mp4", "fps": 4},
        {"type": "text", "text": "What is the person doing in this video?"}
    ]}
]
inputs = processor(messages, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Training Details

Parameter	Value
Base model	nvidia/Cosmos-Reason2-2B
Precision	BF16 (no quantization)
LoRA rank	32
LoRA alpha	64
Target modules	q, k, v, o, gate, up, down projections (language model only)
Learning rate	1e-4 (cosine schedule, 5% warmup)
Batch size	1 per GPU x 8 grad accumulation x 4 GPUs = 32 effective
Epochs	1
Training samples	270K (ego + exo video)
Video encoding	4 fps, H.264, 480p
Hardware	4x NVIDIA RTX PRO 6000 Blackwell (96GB each)

Dataset

Trained on PRISM — a multi-view retail video SFT dataset with 270K samples across 20+ capability probes from egocentric, exocentric, and 360-degree cameras in real-world grocery stores.

License

This model is a Derivative Model of nvidia/Cosmos-Reason2-2B released under the NVIDIA Open Model License. The model is commercially usable. You are free to create and distribute Derivative Models.

Citation

If you use this model, please cite:

@misc{dreamvu2026prism,
    title={PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models},
    author={DreamVu AI},
    year={2026},
    url={https://arxiv.org/abs/2603.29281}
}

Contact

For questions or commercial licensing: sales@dreamvu.ai

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

nvidia/Cosmos-Reason2-2B

Adapter

(8)

this model

Dataset used to train DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Space using DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo 1

Collection including DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Retail Grocery

Collection

Retail grocery ego-exo video datasets and fine-tuned VLMs for shopper behavior analysis, restocking, shelf monitoring, and backend operations. • 4 items • Updated 18 days ago

Paper for DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Paper • 2603.29281 • Published 21 days ago

DreamVu
/

Cosmos-Reason2-2B-Retail-Grocery-EgoExo