Habitat 3.0 Social Rearrangement β€” Go

Trained weights for the Pereason + Go multi-agent policy on the Social Rearrangement task from Habitat 3.0. Two embodied agents β€” a Boston Dynamics Spot robot and a humanoid β€” cooperate to rearrange objects across 37 HSSD scenes.

This model uses no inter-agent communication. Each agent independently perceives its environment through Pereason (vision) and selects skills through Go (transformer). This serves as the baseline for evaluating the impact of learned communication via Fabric.

This work is part of the thesis "Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics" by Benjamin Kubwimana.

What's in this repo

File
model.pth
The checkpoint contains the full model state dict for both agents (keys 0, 1) plus training config.

Task overview

Each episode drops two agents into an HSSD home scene with objects that need to be moved to goal locations. The task is structured as a PDDL planning problem with four subgoal stages:

  • Stage 1.1: Agent 0 (Spot) picks up its target object
  • Stage 1.2: Agent 0 places the object at the goal
  • Stage 2.1: Agent 1 (Humanoid) picks up its target object
  • Stage 2.2: Agent 1 places the object at the goal

Full success (pddl_success) requires both agents to complete all subgoals within 750 timesteps.

Architecture

Both agents use a hierarchical RL policy with two main components:

Pereason (Perception + Reasoning)

  • VLM backbone: SmolVLM2-500M-Video-Instruct (frozen, 303M params) β€” processes RGB via a ViT encoder and generates semantic tokens through a truncated language decoder (16 of 32 layers)
  • Depth encoder: Depth Anything V2 Small (trainable, 24.8M params β€” 22.1M trainable) β€” encodes depth images into geometric tokens
  • PDDL task instructions are tokenized and fed to the VLM alongside the visual input

Go (Skill Selector)

  • 8-block transformer that attends over semantic + geometric tokens via cross-attention, then selects a high-level skill
  • Outputs a categorical distribution over available actions (nav_to_obj, pick, place, etc.)
  • 8 learnable query tokens, 128 context tokens

Trainable parameters (per agent)

Module Total params Trainable params
SmolVLM2 (VLM) 303M 0 (frozen)
Depth Anything V2 24.8M 22.1M
Go (transformer) 73.5M 73.5M
Total 401.3M 95.6M

Low-level skills

Oracle navigation + learned manipulation skills (pick, place, nav_to_obj, etc.) using privileged simulator information.

Observations per agent:

  • RGB camera image (arm cam for Spot, head cam for Humanoid)
  • Depth camera image
  • Binary is_holding flag
  • GPS+compass to object start/goal positions
  • Relative GPS to the other agent

Training details

Parameter Value
Total frames ~63M
Learning rate 1e-4 (Go), 1e-5 (Depth encoder)
PPO epochs 2
Mini-batches 10
Clip param 0.2
Discount (gamma) 0.99
GAE (tau) 0.95
Entropy coef 0.001
Max grad norm 0.5
Trainer DD-PPO

Best metrics

Metric Value
Reward 23.6
Task success 69.6%
Stage 1.1 (Spot picks) 91.1%
Stage 1.2 (Spot places) 77.2%
Stage 2.1 (Human picks) 89.3%
Stage 2.2 (Human places) 80.8%
Collision rate 15.4%
Cooperation reward +2.77

Comparison with Fabric (learned communication)

Metric Go β€” No Comm Fabric Delta
Task success 69.6% 76.8% +7.2pp
Reward 23.6 25.5 +1.9
Collision rate 15.4% 10.7% -4.7pp
Cooperation reward +2.77 +2.79 +0.02

Training curve

Frames Reward Task success Collisions Cooperation
0M 0.9 0.0% 100.0% -0.41
3M 7.8 2.0% 31.2% -0.06
10M 9.5 6.2% 24.5% +0.14
25M 11.7 11.3% 19.9% +0.32
40M 14.6 26.8% 23.5% +0.81
55M 22.8 67.3% 19.6% +2.62
63M 23.6 69.6% 15.4% +2.77

How to evaluate

Requires Habitat 3.0 (v0.3.3) with habitat-baselines, habitat-sim, and the Pereason+Go policy classes.

python -u -m habitat_baselines.run \
    --config-name=social_rearrange/pereason_go \
    habitat_baselines.evaluate=True \
    habitat_baselines.eval.should_load_ckpt=True \
    habitat_baselines.eval_ckpt_path_dir=model.pth \
    habitat_baselines.test_episode_count=50 \
    habitat_baselines.num_environments=1 \
    habitat.dataset.data_path=data/datasets/hab3_episodes/val/social_rearrange.json.gz \
    habitat.dataset.scenes_dir=data/scene_datasets \
    'habitat_baselines.eval.video_option=["disk"]'

Dependencies

  • SmolVLM2-500M-Video-Instruct (downloaded automatically)
  • Depth-Anything-V2-Small-hf (downloaded automatically)
  • Habitat 3.0 with HSSD scenes and hab3_episodes dataset

Citation

@mastersthesis{kubwimana2026scalable,
  title  = {Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics},
  author = {Kubwimana, Benjamin},
  year   = {2026},
  school = {Georgia Institute of Technology},
  note   = {Model weights: \url{https://huggingface.co/edge-inference/hab3-social-rearrange-go}}
}

Built on the Habitat 3.0 platform:

@inproceedings{puig2023habitat3,
  title     = {Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots},
  author    = {Puig, Xavier and Undersander, Eric and Szot, Andrew and Cote, Mikael Dallaire and Batra, Dhruv and Berges, Vincent-Pierre and others},
  booktitle = {ICLR},
  year      = {2024}
}

License

MIT. The underlying Habitat platform and HSSD scenes have their own licenses β€” see the Habitat 3.0 repo for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train edge-inference/hab3-social-rearrange-go