# SO-100 JEPA-2 AC (Action-Conditioned) Model This repository contains a V-JEPA 2-AC model trained on the SO-100 robotics dataset for the Ball-Cup task. ## Model Overview - **Architecture**: V-JEPA 2 with Action-Conditioned Predictor - **Vision Foundation**: ViT-Large (1024 embed dim) - **Task**: Robotics control / world modeling (predicting future latents based on current context and actions) - **Dataset**: SO-100 Ball-Cup (Robotics interaction) ## Directory Structure ```text jepa-model/ │── config.json # Model architecture and data configuration │── pytorch_model.bin # Best predictor weights (state dict) │── vision_encoder.pt # ViT-Large vision encoder weights (~5GB) │── README.md # Model documentation │── train.py # Source code for model and training │── training_log.csv # Training history │── ckpt_ep*.pt # Training checkpoints with optimizer state │── latest.pt # Latest training checkpoint ``` ## How to Use ### Prerequisites ```bash pip install torch numpy torchvision ``` ### Loading the Model This model requires both the **Vision Encoder** and the **Action-Conditioned Predictor**. ```python import torch import json from train import ActionConditionedVJepa, VisionEncoder # Classes from train.py # 1. Load Vision Encoder (ViT-L) encoder = VisionEncoder(model_name="vit_large") encoder.load_state_dict(torch.load("vision_encoder.pt")) encoder.eval() # 2. Load Predictor with open("config.json", "r") as f: config = json.load(f) predictor = ActionConditionedVJepa(config["architecture"]["predictor"]) predictor.load_state_dict(torch.load("pytorch_model.bin")) predictor.eval() ``` ## Training Details - **Loss**: L1 distance in latent space - **Optimizer**: AdamW - **Scheduler**: Cosine with warmup - **Input**: 6 context frames + 2 predicted frames - **Resolution**: 256x256 ## Citation If you use this model, please cite the original V-JEPA work and the SO-100 dataset. ```bibtex @article{vjepa2, title={V-JEPA 2: Action-Conditioned World Models for Robotics}, author={...}, journal={...}, year={2024} } ```