You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Cosmos-Reason2-2B-Retail-Grocery-EgoExo

A BF16 LoRA adapter fine-tuned on the PRISM dataset for embodied video understanding in retail grocery environments.

Links

Model Description

This model is a LoRA adapter for NVIDIA Cosmos-Reason2-2B, fine-tuned on 270K video SFT samples spanning 20+ tasks across egocentric and exocentric camera views in real-world retail stores.

  • Base Model: nvidia/Cosmos-Reason2-2B (Qwen2.5-VL architecture, 2.49B parameters)
  • Adapter: LoRA (rank=32, alpha=64, 49.3M trainable parameters, 1.98% of base)
  • Training: BF16 precision, 1 epoch on 270K samples, 4x NVIDIA RTX PRO 6000 Blackwell GPUs
  • Training Time: ~35 hours (7,942 steps)
  • Adapter Size: 67MB

Capabilities

The model is fine-tuned across four capability domains:

Domain Tasks Description
Embodied Reasoning (ER) 9 tasks Next subtask prediction, task completion, action reasoning, hand interaction, multi-actor understanding
Common Sense (CS) 6 tasks Scene VQA, environment understanding, affordance reasoning, causality, spatial reasoning
Spatial Perception (SP) 2 tasks Relative depth reasoning, 360-degree spatial layout
Intuitive Physics (IP) 3+ tasks Arrow-of-time detection, physics reasoning (CoT), object permanence

Performance

Fine-tuning on PRISM yields +23.8 percentage points average improvement over the zero-shot baseline across all evaluated tasks, with an average error rate reduction of 66.6%.

Domain Baseline PRISM Delta
ER (9 tasks) 54.5% 90.9% +36.4
CS (6 tasks) 80.9% 91.4% +10.5
SP (2 tasks) 57.4% 74.5% +17.1
IP (3 tasks) 51.7% 69.3% +17.6
Overall 62.8% 86.6% +23.8

Qualitative Examples

See 17 side-by-side video comparisons between the zero-shot baseline and PRISM fine-tuned model across counting, hand interaction, goal reasoning, scene understanding, domain knowledge, and spatial reasoning tasks:

View Demo Gallery

Usage

from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
    "nvidia/Cosmos-Reason2-2B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("nvidia/Cosmos-Reason2-2B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo")

# Inference
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "video", "video": "path/to/clip.mp4", "fps": 4},
        {"type": "text", "text": "What is the person doing in this video?"}
    ]}
]
inputs = processor(messages, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

Training Details

Parameter Value
Base model nvidia/Cosmos-Reason2-2B
Precision BF16 (no quantization)
LoRA rank 32
LoRA alpha 64
Target modules q, k, v, o, gate, up, down projections (language model only)
Learning rate 1e-4 (cosine schedule, 5% warmup)
Batch size 1 per GPU x 8 grad accumulation x 4 GPUs = 32 effective
Epochs 1
Training samples 270K (ego + exo video)
Video encoding 4 fps, H.264, 480p
Hardware 4x NVIDIA RTX PRO 6000 Blackwell (96GB each)

Dataset

Trained on PRISM — a multi-view retail video SFT dataset with 270K samples across 20+ capability probes from egocentric, exocentric, and 360-degree cameras in real-world grocery stores.

License

This model is a Derivative Model of nvidia/Cosmos-Reason2-2B released under the NVIDIA Open Model License. The model is commercially usable. You are free to create and distribute Derivative Models.

Citation

If you use this model, please cite:

@misc{dreamvu2026prism,
    title={PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models},
    author={DreamVu AI},
    year={2026},
    url={https://arxiv.org/abs/2603.29281}
}

Contact

For questions or commercial licensing: sales@dreamvu.ai

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Adapter
(8)
this model

Dataset used to train DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Space using DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo 1

Collection including DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo

Paper for DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo