--- license: apache-2.0 library_name: torch tags: - robotics - vjepa - world-model - computer-vision - so100 datasets: - rupesh386/so100-ball-cup metrics: - l1 --- # SO-100 JEPA-2 AC (Action-Conditioned) Model This repository contains a V-JEPA 2-AC model trained on the SO-100 robotics dataset for the Ball-Cup task. ## Model Overview - **Architecture**: V-JEPA 2 with Action-Conditioned Predictor - **Vision Foundation**: ViT-Large (1024 embed dim) - **Task**: Robotics control / world modeling (predicting future latents based on current context and actions) - **Dataset**: SO-100 Ball-Cup (Robotics interaction) ## Directory Structure ```text jepa-model/ │── config.json # Model architecture and data configuration │── pytorch_model.bin # Best predictor weights (state dict) │── vision_encoder.pt # ViT-Large vision encoder weights (~5GB) │── README.md # Model documentation │── train.py # Source code for model and training │── training_log.csv # Training history │── ckpt_ep*.pt # Training checkpoints with optimizer state │── latest.pt # Latest training checkpoint ``` ## How to Use ### Prerequisites ```bash pip install torch numpy torchvision ``` ### Loading the Model This model requires both the **Vision Encoder** and the **Action-Conditioned Predictor**. ```python import torch import json from train import ActionConditionedVJepa, VisionEncoder # Classes from train.py # 1. Load Vision Encoder (ViT-L) encoder = VisionEncoder(model_name="vit_large") encoder.load_state_dict(torch.load("vision_encoder.pt")) encoder.eval() # 2. Load Predictor with open("config.json", "r") as f: config = json.load(f) predictor = ActionConditionedVJepa(config["architecture"]["predictor"]) predictor.load_state_dict(torch.load("pytorch_model.bin")) predictor.eval() ``` ## Dataset Information The model was trained on the **SO-100 Ball-Cup Robotics Dataset**. - **Type**: Video-based robotics interaction demos. - **Task**: Tracking and predicting the motion of a ball being caught in a cup by a robotic arm. - **Observations**: Multi-view camera setup (primarily `observation.images.phone`). - **Actions**: 6-DoF end-effector control or joint targets (Delta control). ## Training Progress (Loss Evolution) The training process was fully tracked and logged using **Weights & Biases (WandB)**. You can view the live dashboard and detailed metrics here: **[WandB Training Dashboard: vjepa2-ac](https://wandb.ai/rupeshgarsondiya386/vjepa2-ac?nw=nwuserrupeshgarsondiya38)** The evolution of the L1 loss in latent space over the training period is shown below: ![Loss Evolution](loss_evolution.png) *The plot shows a consistent decrease in both training and validation loss, indicating successful learning of the world model dynamics without significant overfitting.* ## Training Details - **Loss**: L1 distance (Mean Absolute Error) in latent space between predicted and target embeddings. - **Optimizer**: AdamW - **Scheduler**: Cosine with warmup - **Input**: 6 context frames + 6 context actions → 2 predicted future latents. - **Resolution**: 256x256 ## Citation If you use this model, please cite the original V-JEPA work and the SO-100 dataset. ```bibtex @article{vjepa2, title={V-JEPA 2: Action-Conditioned World Models for Robotics}, author={...}, journal={...}, year={2024} } ```