DA3-XVLA v17 — task-31 (`clean_boxing_gloves`) — full 50k-iter checkpoint

This is a substantially improved model over the previously-uploaded v4 ckpt-100000 (JackLiu0406/da3-xvla-b1k-task31-skill-ckpt100000). It bakes in five pipeline correctness fixes that the v4 model did not have, plus a 5× faster training pipeline that let us train fresh from X-VLA-Pt in ~14 hours.

⚠️ The image preprocessing has CHANGED from v4. This model expects 224×224 input, NOT 504×504. Updating the eval client is required — see EVAL_CONTRACT.json for the authoritative spec.

Task

Wash the two dusty boxing gloves from the countertop in the utility room
in the washer until they are no longer covered with dust.

Trained on 200 episodes of BEHAVIOR-1K task-index 31 (clean_boxing_gloves).

Architecture (config.json)

action_mode                = "auto"            # AutoActionSpace (NOT ee6d)
real_action_dim            = 23                # R1Pro 23-dim mixed delta/abs
max_action_dim             = 23                # native, no padding
dim_proprio                = 23                # 23-dim extracted state, NOT raw 256-dim
num_skills                 = 34                # Mark's skill_prompt_hub
skill_classifier_weight    = 0.5               # ↑ from v4's 0.1 — stronger aux supervision
use_progress_head          = True
progress_head_weight       = 0.1
geometry_conditioning:
  da3_model_name           = "depth-anything/DA3-LARGE"  # ← v4 was DA3-BASE
  da3_input_dim            = 1024              # ← v4 was 768 (BASE)
  da3_process_res          = 224               # ← v4 was 504 (BIG CHANGE)
  num_geometry_tokens      = 240               # ← v4 was 32 (much richer geometry)
  use_posed_da3            = True              # extrinsics + intrinsics consumed
  freeze_da3               = True

What changed vs the v4 ckpt-100000

#	issue in v4	fix in v17
1	Image input at 504×504 — DA3 process_res=504. Florence-2 then downsampled internally to 224 (its native resolution); the extra 504 detail was wasted on the VLM and made DA3 5× more expensive.	224×224 throughout. DA3 sees 224 natively (14×14 ViT patches per view instead of 32×32). Florence-2 sees 224 with no internal resize. 5× faster DA3 forward AND less wasted compute.
2	Action-padding bug. Trajectory[1] = action AT current frame (already commanded), not action[idx+1] (what to do next). Caused off-by-one in closed-loop control.	Padding removed. Trajectory[1] = next action to command.
3	`skill_prompt_hub` in the wrong optimizer group. Got 20× the intended LR (2e-4 instead of 1e-5). Likely root cause of v4's weak skill conditioning.	Moved to `soft_prompts` group with `lr × learning_coef` = 1e-5.
4	`.detach()` on vlm_pooled blocked VLM gradient from the skill classifier. VLM never learned skill-discriminative features.	`.detach()` removed. VLM specializes for both action AND skill prediction.
5	`progress_head` was Sigmoid + MSE — known gradient-vanishing combo. Saturated near constant 0.5 throughout training.	Replaced with bare `Linear` + `BCEWithLogitsLoss`. Progress prediction now actually trains.
6	Extrinsics view misalignment. Parquet block order is `[left_wrist, right_wrist, head]`; the previous `_poses_to_extrinsics` mapped `block_v → extrinsics[v]` with no reorder, so DA3 got `image[head] + pose[left_wrist]`, etc.	Explicit `PARQUET_TO_VIEW = [1, 2, 0]` remap. Image, intrinsic, and extrinsic now all align to `[head, left_wrist, right_wrist]`.
7	`skill_classifier_weight=0.1`	Bumped to 0.5 — 5× stronger gradient.

Critical eval-side changes from v4

⚠️ Three things WILL differ between your v4 eval client and a correct v17 eval client:

a. Image preprocessing — 224, not 504

# v4 client (WRONG for v17):
image_input = build_image_input_504(views)  # [B, 3, 3, 504, 504]

# v17 client (correct):
image_input = build_image_input_224(views)  # [B, 3, 3, 224, 224]
#   - resize to 224×224, BICUBIC
#   - ImageNet-norm in RGB order

b. Intrinsics rescaled to 224

# v4 (504):                       v17 (224):
HEAD_K  = fx=fy=214.2, cx=cy=252  HEAD_K  = fx=fy=95.2,  cx=cy=112
WRIST_K = fx=fy=408.1, cx=cy=252  WRIST_K = fx=fy=181.4, cx=cy=112

Scale factor: 224/504 ≈ 0.444.

c. The proprio is still 23-dim (same as v4), but the action_decode reference is still the proprio tensor

No change in the decode logic — idx_for_delta and the decoder convention are identical. Just the image side is different.

R1Pro action layout (23-dim, mixed delta/abs)

Unchanged from v4. From make_bool_mask(-3, 3, -1, 7, -1, 7, -1):

dim	semantic	encoding
0-2	base xyz	absolute
3-5	base xyz delta	delta from chunk's first action
6	base yaw	absolute
7-13	left arm 7 joints	delta from chunk's first action
14	left gripper	absolute, [-1, +1]
15-21	right arm 7 joints	delta from chunk's first action
22	right gripper	absolute, [-1, +1]

Decode:

abs_action = pred.clone()
abs_action[:, :, idx_for_delta] += proprio[:, None, idx_for_delta]
# idx_for_abs dims are already absolute, no change

Required eval-batch dict (v17)

batch = {
    "input_ids":   LongTensor [B, L],              # prompt → tokenizer
    "image_input": FloatTensor [B, 3, 3, 224, 224],# head, left, right (RGB, ImageNet-norm)
    "image_mask":  BoolTensor  [B, 3],
    "proprio":     FloatTensor [B, 23],            # 23-dim extracted state, NOT 256
    "domain_id":   LongTensor [B],                 # 19 (R1Pro); auto-overridden by skill_id internally
    "extrinsics":  FloatTensor [B, 3, 4, 4],       # SE(3), camera-to-base, ORDER [head, left, right]
    "intrinsics":  FloatTensor [B, 3, 3, 3],       # K at 224 resolution
    # Optional:
    # "skill_id":       LongTensor [B] | None  → auto-predicts via skill_classifier
    # "skill_progress": FloatTensor [B] | None
}
pred_actions = model.generate_actions(**batch)   # [B, 30, 23]

Training context

Hardware:         8× H200, DDP via Accelerate
Effective batch:  48 × 8 = 384
Total iters:      50000
Wall-clock:       ~14 hours
LR schedule:
  learning_rate   = 2e-4   (peak; cosine decay to 0.3 floor)
  learning_coef   = 0.05   → vlm/soft_prompt_hub/skill_prompt_hub LR = 1e-5
  freeze_steps    = 1000   (VLM frozen at lr=0)
  warmup_steps    = 3000   (linear)
  use_cosine_decay= True   (core groups)
  no_cosine_vlm   = True   (VLM holds peak after warmup)
  min_lr_ratio    = 0.3

Final losses (steps 49900-49950):
  joints_loss (jl)     = 0.064-0.233 (raw MSE × JOINTS_SCALE=100)
  skill_cls_loss (sl)  = 0.000 (perfectly calibrated on training distribution)
  progress_loss (pl)   = 0.025-0.068 (BCE, post-fix #5)

Code snapshot

The exact da3_xvla/ package used to train this checkpoint lives in code/da3_xvla/ in this repo. Point eval tooling at it (--code-dir code/da3_xvla) to guarantee parity:

datasets/domain_handler/b1k.py — port of Mark's reference + posed-DA3 + skill conditioning + chunked decode + 224 intrinsics
datasets/utils.py — read_video_chunks chunked-decode helper (100-frame chunks)
models/modeling_xvla.py — XVLA model; generate_actions auto-predicts skill_id when None; progress_head is Linear (not Sequential(Linear, Sigmoid))
models/transformer.py — skill_prompt_hub selection when skill_id is provided
models/da3_inline.py — DA3 inline encoder (process_res=224)

What's NOT in this checkpoint

No latent segmenter (WIP, disabled).
No Grounding-DINO + SAM. Don't pass object_masks.
DA3 was frozen the entire run — its weights are the original DA3-LARGE.
Action heads were initialized from scratch (Xavier, via --scratch_action_expert); no semantic carryover from X-VLA-Pt's EE6D layout.

How to load

from huggingface_hub import snapshot_download
from da3_xvla.models.modeling_xvla import XVLA
from da3_xvla.models.configuration_xvla import XVLAConfig

local = snapshot_download("JackLiu0406/da3-xvla-b1k-task31-v17-ckpt50000")
cfg = XVLAConfig.from_pretrained(local)
# da3_input_dim is already 1024 in config; no eager-build patch needed
model = XVLA.from_pretrained(local, config=cfg)

Downloads last month: 8

Video Preview

Robotics

DA3-XVLA v17 — task-31 (clean_boxing_gloves) — full 50k-iter checkpoint