DA3-XVLA v17 β€” task-31 (clean_boxing_gloves) β€” full 50k-iter checkpoint

This is a substantially improved model over the previously-uploaded v4 ckpt-100000 (JackLiu0406/da3-xvla-b1k-task31-skill-ckpt100000). It bakes in five pipeline correctness fixes that the v4 model did not have, plus a 5Γ— faster training pipeline that let us train fresh from X-VLA-Pt in ~14 hours.

⚠️ The image preprocessing has CHANGED from v4. This model expects 224Γ—224 input, NOT 504Γ—504. Updating the eval client is required β€” see EVAL_CONTRACT.json for the authoritative spec.

Task

Wash the two dusty boxing gloves from the countertop in the utility room
in the washer until they are no longer covered with dust.

Trained on 200 episodes of BEHAVIOR-1K task-index 31 (clean_boxing_gloves).

Architecture (config.json)

action_mode                = "auto"            # AutoActionSpace (NOT ee6d)
real_action_dim            = 23                # R1Pro 23-dim mixed delta/abs
max_action_dim             = 23                # native, no padding
dim_proprio                = 23                # 23-dim extracted state, NOT raw 256-dim
num_skills                 = 34                # Mark's skill_prompt_hub
skill_classifier_weight    = 0.5               # ↑ from v4's 0.1 β€” stronger aux supervision
use_progress_head          = True
progress_head_weight       = 0.1
geometry_conditioning:
  da3_model_name           = "depth-anything/DA3-LARGE"  # ← v4 was DA3-BASE
  da3_input_dim            = 1024              # ← v4 was 768 (BASE)
  da3_process_res          = 224               # ← v4 was 504 (BIG CHANGE)
  num_geometry_tokens      = 240               # ← v4 was 32 (much richer geometry)
  use_posed_da3            = True              # extrinsics + intrinsics consumed
  freeze_da3               = True

What changed vs the v4 ckpt-100000

# issue in v4 fix in v17
1 Image input at 504Γ—504 β€” DA3 process_res=504. Florence-2 then downsampled internally to 224 (its native resolution); the extra 504 detail was wasted on the VLM and made DA3 5Γ— more expensive. 224Γ—224 throughout. DA3 sees 224 natively (14Γ—14 ViT patches per view instead of 32Γ—32). Florence-2 sees 224 with no internal resize. 5Γ— faster DA3 forward AND less wasted compute.
2 Action-padding bug. Trajectory[1] = action AT current frame (already commanded), not action[idx+1] (what to do next). Caused off-by-one in closed-loop control. Padding removed. Trajectory[1] = next action to command.
3 skill_prompt_hub in the wrong optimizer group. Got 20Γ— the intended LR (2e-4 instead of 1e-5). Likely root cause of v4's weak skill conditioning. Moved to soft_prompts group with lr Γ— learning_coef = 1e-5.
4 .detach() on vlm_pooled blocked VLM gradient from the skill classifier. VLM never learned skill-discriminative features. .detach() removed. VLM specializes for both action AND skill prediction.
5 progress_head was Sigmoid + MSE β€” known gradient-vanishing combo. Saturated near constant 0.5 throughout training. Replaced with bare Linear + BCEWithLogitsLoss. Progress prediction now actually trains.
6 Extrinsics view misalignment. Parquet block order is [left_wrist, right_wrist, head]; the previous _poses_to_extrinsics mapped block_v β†’ extrinsics[v] with no reorder, so DA3 got image[head] + pose[left_wrist], etc. Explicit PARQUET_TO_VIEW = [1, 2, 0] remap. Image, intrinsic, and extrinsic now all align to [head, left_wrist, right_wrist].
7 skill_classifier_weight=0.1 Bumped to 0.5 β€” 5Γ— stronger gradient.

Critical eval-side changes from v4

⚠️ Three things WILL differ between your v4 eval client and a correct v17 eval client:

a. Image preprocessing β€” 224, not 504

# v4 client (WRONG for v17):
image_input = build_image_input_504(views)  # [B, 3, 3, 504, 504]

# v17 client (correct):
image_input = build_image_input_224(views)  # [B, 3, 3, 224, 224]
#   - resize to 224Γ—224, BICUBIC
#   - ImageNet-norm in RGB order

b. Intrinsics rescaled to 224

# v4 (504):                       v17 (224):
HEAD_K  = fx=fy=214.2, cx=cy=252  HEAD_K  = fx=fy=95.2,  cx=cy=112
WRIST_K = fx=fy=408.1, cx=cy=252  WRIST_K = fx=fy=181.4, cx=cy=112

Scale factor: 224/504 β‰ˆ 0.444.

c. The proprio is still 23-dim (same as v4), but the action_decode reference is still the proprio tensor

No change in the decode logic β€” idx_for_delta and the decoder convention are identical. Just the image side is different.

R1Pro action layout (23-dim, mixed delta/abs)

Unchanged from v4. From make_bool_mask(-3, 3, -1, 7, -1, 7, -1):

dim semantic encoding
0-2 base xyz absolute
3-5 base xyz delta delta from chunk's first action
6 base yaw absolute
7-13 left arm 7 joints delta from chunk's first action
14 left gripper absolute, [-1, +1]
15-21 right arm 7 joints delta from chunk's first action
22 right gripper absolute, [-1, +1]

Decode:

abs_action = pred.clone()
abs_action[:, :, idx_for_delta] += proprio[:, None, idx_for_delta]
# idx_for_abs dims are already absolute, no change

Required eval-batch dict (v17)

batch = {
    "input_ids":   LongTensor [B, L],              # prompt β†’ tokenizer
    "image_input": FloatTensor [B, 3, 3, 224, 224],# head, left, right (RGB, ImageNet-norm)
    "image_mask":  BoolTensor  [B, 3],
    "proprio":     FloatTensor [B, 23],            # 23-dim extracted state, NOT 256
    "domain_id":   LongTensor [B],                 # 19 (R1Pro); auto-overridden by skill_id internally
    "extrinsics":  FloatTensor [B, 3, 4, 4],       # SE(3), camera-to-base, ORDER [head, left, right]
    "intrinsics":  FloatTensor [B, 3, 3, 3],       # K at 224 resolution
    # Optional:
    # "skill_id":       LongTensor [B] | None  β†’ auto-predicts via skill_classifier
    # "skill_progress": FloatTensor [B] | None
}
pred_actions = model.generate_actions(**batch)   # [B, 30, 23]

Training context

Hardware:         8Γ— H200, DDP via Accelerate
Effective batch:  48 Γ— 8 = 384
Total iters:      50000
Wall-clock:       ~14 hours
LR schedule:
  learning_rate   = 2e-4   (peak; cosine decay to 0.3 floor)
  learning_coef   = 0.05   β†’ vlm/soft_prompt_hub/skill_prompt_hub LR = 1e-5
  freeze_steps    = 1000   (VLM frozen at lr=0)
  warmup_steps    = 3000   (linear)
  use_cosine_decay= True   (core groups)
  no_cosine_vlm   = True   (VLM holds peak after warmup)
  min_lr_ratio    = 0.3

Final losses (steps 49900-49950):
  joints_loss (jl)     = 0.064-0.233 (raw MSE Γ— JOINTS_SCALE=100)
  skill_cls_loss (sl)  = 0.000 (perfectly calibrated on training distribution)
  progress_loss (pl)   = 0.025-0.068 (BCE, post-fix #5)

Code snapshot

The exact da3_xvla/ package used to train this checkpoint lives in code/da3_xvla/ in this repo. Point eval tooling at it (--code-dir code/da3_xvla) to guarantee parity:

  • datasets/domain_handler/b1k.py β€” port of Mark's reference + posed-DA3 + skill conditioning + chunked decode + 224 intrinsics
  • datasets/utils.py β€” read_video_chunks chunked-decode helper (100-frame chunks)
  • models/modeling_xvla.py β€” XVLA model; generate_actions auto-predicts skill_id when None; progress_head is Linear (not Sequential(Linear, Sigmoid))
  • models/transformer.py β€” skill_prompt_hub selection when skill_id is provided
  • models/da3_inline.py β€” DA3 inline encoder (process_res=224)

What's NOT in this checkpoint

  • No latent segmenter (WIP, disabled).
  • No Grounding-DINO + SAM. Don't pass object_masks.
  • DA3 was frozen the entire run β€” its weights are the original DA3-LARGE.
  • Action heads were initialized from scratch (Xavier, via --scratch_action_expert); no semantic carryover from X-VLA-Pt's EE6D layout.

How to load

from huggingface_hub import snapshot_download
from da3_xvla.models.modeling_xvla import XVLA
from da3_xvla.models.configuration_xvla import XVLAConfig

local = snapshot_download("JackLiu0406/da3-xvla-b1k-task31-v17-ckpt50000")
cfg = XVLAConfig.from_pretrained(local)
# da3_input_dim is already 1024 in config; no eager-build patch needed
model = XVLA.from_pretrained(local, config=cfg)
Downloads last month
8
Video Preview
loading