Cosmos-Reason2-2B-Retail-Grocery-EgoExo
A BF16 LoRA adapter fine-tuned on the PRISM dataset for embodied video understanding in retail grocery environments.
Links
- GitHub: DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo
- Dataset: DreamVu/PRISM-100K
- Paper: arXiv:2603.29281
Model Description
This model is a LoRA adapter for NVIDIA Cosmos-Reason2-2B, fine-tuned on 270K video SFT samples spanning 20+ tasks across egocentric and exocentric camera views in real-world retail stores.
- Base Model: nvidia/Cosmos-Reason2-2B (Qwen2.5-VL architecture, 2.49B parameters)
- Adapter: LoRA (rank=32, alpha=64, 49.3M trainable parameters, 1.98% of base)
- Training: BF16 precision, 1 epoch on 270K samples, 4x NVIDIA RTX PRO 6000 Blackwell GPUs
- Training Time: ~35 hours (7,942 steps)
- Adapter Size: 67MB
Capabilities
The model is fine-tuned across four capability domains:
| Domain | Tasks | Description |
|---|---|---|
| Embodied Reasoning (ER) | 9 tasks | Next subtask prediction, task completion, action reasoning, hand interaction, multi-actor understanding |
| Common Sense (CS) | 6 tasks | Scene VQA, environment understanding, affordance reasoning, causality, spatial reasoning |
| Spatial Perception (SP) | 2 tasks | Relative depth reasoning, 360-degree spatial layout |
| Intuitive Physics (IP) | 3+ tasks | Arrow-of-time detection, physics reasoning (CoT), object permanence |
Performance
Fine-tuning on PRISM yields +23.8 percentage points average improvement over the zero-shot baseline across all evaluated tasks, with an average error rate reduction of 66.6%.
| Domain | Baseline | PRISM | Delta |
|---|---|---|---|
| ER (9 tasks) | 54.5% | 90.9% | +36.4 |
| CS (6 tasks) | 80.9% | 91.4% | +10.5 |
| SP (2 tasks) | 57.4% | 74.5% | +17.1 |
| IP (3 tasks) | 51.7% | 69.3% | +17.6 |
| Overall | 62.8% | 86.6% | +23.8 |
Qualitative Examples
See 17 side-by-side video comparisons between the zero-shot baseline and PRISM fine-tuned model across counting, hand interaction, goal reasoning, scene understanding, domain knowledge, and spatial reasoning tasks:
Usage
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel
# Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
"nvidia/Cosmos-Reason2-2B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("nvidia/Cosmos-Reason2-2B")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "DreamVu/Cosmos-Reason2-2B-Retail-Grocery-EgoExo")
# Inference
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "video", "video": "path/to/clip.mp4", "fps": 4},
{"type": "text", "text": "What is the person doing in this video?"}
]}
]
inputs = processor(messages, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))
Training Details
| Parameter | Value |
|---|---|
| Base model | nvidia/Cosmos-Reason2-2B |
| Precision | BF16 (no quantization) |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Target modules | q, k, v, o, gate, up, down projections (language model only) |
| Learning rate | 1e-4 (cosine schedule, 5% warmup) |
| Batch size | 1 per GPU x 8 grad accumulation x 4 GPUs = 32 effective |
| Epochs | 1 |
| Training samples | 270K (ego + exo video) |
| Video encoding | 4 fps, H.264, 480p |
| Hardware | 4x NVIDIA RTX PRO 6000 Blackwell (96GB each) |
Dataset
Trained on PRISM — a multi-view retail video SFT dataset with 270K samples across 20+ capability probes from egocentric, exocentric, and 360-degree cameras in real-world grocery stores.
License
This model is a Derivative Model of nvidia/Cosmos-Reason2-2B released under the NVIDIA Open Model License. The model is commercially usable. You are free to create and distribute Derivative Models.
Citation
If you use this model, please cite:
@misc{dreamvu2026prism,
title={PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models},
author={DreamVu AI},
year={2026},
url={https://arxiv.org/abs/2603.29281}
}
Contact
For questions or commercial licensing: sales@dreamvu.ai
- Downloads last month
- -