---
license: apache-2.0
library_name: torch
tags:
- robotics
- vjepa
- world-model
- computer-vision
- so100
datasets:
- rupesh386/so100-ball-cup
metrics:
- l1
---

# SO-100 JEPA-2 AC (Action-Conditioned) Model

This repository contains a V-JEPA 2-AC model trained on the SO-100 robotics dataset for the Ball-Cup task. 

## Model Overview

- **Architecture**: V-JEPA 2 with Action-Conditioned Predictor
- **Vision Foundation**: ViT-Large (1024 embed dim)
- **Task**: Robotics control / world modeling (predicting future latents based on current context and actions)
- **Dataset**: SO-100 Ball-Cup (Robotics interaction)

## Directory Structure

```text
jepa-model/
│── config.json             # Model architecture and data configuration
│── pytorch_model.bin       # Best predictor weights (state dict)
│── vision_encoder.pt       # ViT-Large vision encoder weights (~5GB)
│── README.md               # Model documentation
│── train.py                # Source code for model and training
│── training_log.csv        # Training history
│── ckpt_ep*.pt             # Training checkpoints with optimizer state
│── latest.pt               # Latest training checkpoint
```

## How to Use

### Prerequisites

```bash
pip install torch numpy torchvision
```

### Loading the Model

This model requires both the **Vision Encoder** and the **Action-Conditioned Predictor**. 

```python
import torch
import json
from train import ActionConditionedVJepa, VisionEncoder # Classes from train.py

# 1. Load Vision Encoder (ViT-L)
encoder = VisionEncoder(model_name="vit_large")
encoder.load_state_dict(torch.load("vision_encoder.pt"))
encoder.eval()

# 2. Load Predictor
with open("config.json", "r") as f:
    config = json.load(f)
predictor = ActionConditionedVJepa(config["architecture"]["predictor"])
predictor.load_state_dict(torch.load("pytorch_model.bin"))
predictor.eval()
```

## Dataset Information

The model was trained on the **SO-100 Ball-Cup Robotics Dataset**. 
- **Type**: Video-based robotics interaction demos.
- **Task**: Tracking and predicting the motion of a ball being caught in a cup by a robotic arm.
- **Observations**: Multi-view camera setup (primarily `observation.images.phone`).
- **Actions**: 6-DoF end-effector control or joint targets (Delta control).

## Training Progress (Loss Evolution)

The training process was fully tracked and logged using **Weights & Biases (WandB)**. You can view the live dashboard and detailed metrics here:

**[WandB Training Dashboard: vjepa2-ac](https://wandb.ai/rupeshgarsondiya386/vjepa2-ac?nw=nwuserrupeshgarsondiya38)**

The evolution of the L1 loss in latent space over the training period is shown below:

![Loss Evolution](loss_evolution.png)

*The plot shows a consistent decrease in both training and validation loss, indicating successful learning of the world model dynamics without significant overfitting.*

## Training Details

- **Loss**: L1 distance (Mean Absolute Error) in latent space between predicted and target embeddings.
- **Optimizer**: AdamW
- **Scheduler**: Cosine with warmup
- **Input**: 6 context frames + 6 context actions → 2 predicted future latents.
- **Resolution**: 256x256

## Citation

If you use this model, please cite the original V-JEPA work and the SO-100 dataset.

```bibtex
@article{vjepa2,
  title={V-JEPA 2: Action-Conditioned World Models for Robotics},
  author={...},
  journal={...},
  year={2024}
}
```