- DA3-XVLA v17 β task-31 (
clean_boxing_gloves) β full 50k-iter checkpoint
DA3-XVLA v17 β task-31 (clean_boxing_gloves) β full 50k-iter checkpoint
This is a substantially improved model over the previously-uploaded v4
ckpt-100000 (JackLiu0406/da3-xvla-b1k-task31-skill-ckpt100000). It bakes in
five pipeline correctness fixes that the v4 model did not have, plus a 5Γ
faster training pipeline that let us train fresh from X-VLA-Pt in ~14 hours.
β οΈ The image preprocessing has CHANGED from v4. This model expects 224Γ224 input, NOT 504Γ504. Updating the eval client is required β see
EVAL_CONTRACT.jsonfor the authoritative spec.
Task
Wash the two dusty boxing gloves from the countertop in the utility room
in the washer until they are no longer covered with dust.
Trained on 200 episodes of BEHAVIOR-1K task-index 31 (clean_boxing_gloves).
Architecture (config.json)
action_mode = "auto" # AutoActionSpace (NOT ee6d)
real_action_dim = 23 # R1Pro 23-dim mixed delta/abs
max_action_dim = 23 # native, no padding
dim_proprio = 23 # 23-dim extracted state, NOT raw 256-dim
num_skills = 34 # Mark's skill_prompt_hub
skill_classifier_weight = 0.5 # β from v4's 0.1 β stronger aux supervision
use_progress_head = True
progress_head_weight = 0.1
geometry_conditioning:
da3_model_name = "depth-anything/DA3-LARGE" # β v4 was DA3-BASE
da3_input_dim = 1024 # β v4 was 768 (BASE)
da3_process_res = 224 # β v4 was 504 (BIG CHANGE)
num_geometry_tokens = 240 # β v4 was 32 (much richer geometry)
use_posed_da3 = True # extrinsics + intrinsics consumed
freeze_da3 = True
What changed vs the v4 ckpt-100000
| # | issue in v4 | fix in v17 |
|---|---|---|
| 1 | Image input at 504Γ504 β DA3 process_res=504. Florence-2 then downsampled internally to 224 (its native resolution); the extra 504 detail was wasted on the VLM and made DA3 5Γ more expensive. | 224Γ224 throughout. DA3 sees 224 natively (14Γ14 ViT patches per view instead of 32Γ32). Florence-2 sees 224 with no internal resize. 5Γ faster DA3 forward AND less wasted compute. |
| 2 | Action-padding bug. Trajectory[1] = action AT current frame (already commanded), not action[idx+1] (what to do next). Caused off-by-one in closed-loop control. | Padding removed. Trajectory[1] = next action to command. |
| 3 | skill_prompt_hub in the wrong optimizer group. Got 20Γ the intended LR (2e-4 instead of 1e-5). Likely root cause of v4's weak skill conditioning. |
Moved to soft_prompts group with lr Γ learning_coef = 1e-5. |
| 4 | .detach() on vlm_pooled blocked VLM gradient from the skill classifier. VLM never learned skill-discriminative features. |
.detach() removed. VLM specializes for both action AND skill prediction. |
| 5 | progress_head was Sigmoid + MSE β known gradient-vanishing combo. Saturated near constant 0.5 throughout training. |
Replaced with bare Linear + BCEWithLogitsLoss. Progress prediction now actually trains. |
| 6 | Extrinsics view misalignment. Parquet block order is [left_wrist, right_wrist, head]; the previous _poses_to_extrinsics mapped block_v β extrinsics[v] with no reorder, so DA3 got image[head] + pose[left_wrist], etc. |
Explicit PARQUET_TO_VIEW = [1, 2, 0] remap. Image, intrinsic, and extrinsic now all align to [head, left_wrist, right_wrist]. |
| 7 | skill_classifier_weight=0.1 |
Bumped to 0.5 β 5Γ stronger gradient. |
Critical eval-side changes from v4
β οΈ Three things WILL differ between your v4 eval client and a correct v17 eval client:
a. Image preprocessing β 224, not 504
# v4 client (WRONG for v17):
image_input = build_image_input_504(views) # [B, 3, 3, 504, 504]
# v17 client (correct):
image_input = build_image_input_224(views) # [B, 3, 3, 224, 224]
# - resize to 224Γ224, BICUBIC
# - ImageNet-norm in RGB order
b. Intrinsics rescaled to 224
# v4 (504): v17 (224):
HEAD_K = fx=fy=214.2, cx=cy=252 HEAD_K = fx=fy=95.2, cx=cy=112
WRIST_K = fx=fy=408.1, cx=cy=252 WRIST_K = fx=fy=181.4, cx=cy=112
Scale factor: 224/504 β 0.444.
c. The proprio is still 23-dim (same as v4), but the action_decode reference is still the proprio tensor
No change in the decode logic β idx_for_delta and the decoder convention are identical. Just the image side is different.
R1Pro action layout (23-dim, mixed delta/abs)
Unchanged from v4. From make_bool_mask(-3, 3, -1, 7, -1, 7, -1):
| dim | semantic | encoding |
|---|---|---|
| 0-2 | base xyz | absolute |
| 3-5 | base xyz delta | delta from chunk's first action |
| 6 | base yaw | absolute |
| 7-13 | left arm 7 joints | delta from chunk's first action |
| 14 | left gripper | absolute, [-1, +1] |
| 15-21 | right arm 7 joints | delta from chunk's first action |
| 22 | right gripper | absolute, [-1, +1] |
Decode:
abs_action = pred.clone()
abs_action[:, :, idx_for_delta] += proprio[:, None, idx_for_delta]
# idx_for_abs dims are already absolute, no change
Required eval-batch dict (v17)
batch = {
"input_ids": LongTensor [B, L], # prompt β tokenizer
"image_input": FloatTensor [B, 3, 3, 224, 224],# head, left, right (RGB, ImageNet-norm)
"image_mask": BoolTensor [B, 3],
"proprio": FloatTensor [B, 23], # 23-dim extracted state, NOT 256
"domain_id": LongTensor [B], # 19 (R1Pro); auto-overridden by skill_id internally
"extrinsics": FloatTensor [B, 3, 4, 4], # SE(3), camera-to-base, ORDER [head, left, right]
"intrinsics": FloatTensor [B, 3, 3, 3], # K at 224 resolution
# Optional:
# "skill_id": LongTensor [B] | None β auto-predicts via skill_classifier
# "skill_progress": FloatTensor [B] | None
}
pred_actions = model.generate_actions(**batch) # [B, 30, 23]
Training context
Hardware: 8Γ H200, DDP via Accelerate
Effective batch: 48 Γ 8 = 384
Total iters: 50000
Wall-clock: ~14 hours
LR schedule:
learning_rate = 2e-4 (peak; cosine decay to 0.3 floor)
learning_coef = 0.05 β vlm/soft_prompt_hub/skill_prompt_hub LR = 1e-5
freeze_steps = 1000 (VLM frozen at lr=0)
warmup_steps = 3000 (linear)
use_cosine_decay= True (core groups)
no_cosine_vlm = True (VLM holds peak after warmup)
min_lr_ratio = 0.3
Final losses (steps 49900-49950):
joints_loss (jl) = 0.064-0.233 (raw MSE Γ JOINTS_SCALE=100)
skill_cls_loss (sl) = 0.000 (perfectly calibrated on training distribution)
progress_loss (pl) = 0.025-0.068 (BCE, post-fix #5)
Code snapshot
The exact da3_xvla/ package used to train this checkpoint lives in
code/da3_xvla/ in this repo. Point eval tooling at it
(--code-dir code/da3_xvla) to guarantee parity:
datasets/domain_handler/b1k.pyβ port of Mark's reference + posed-DA3 + skill conditioning + chunked decode + 224 intrinsicsdatasets/utils.pyβread_video_chunkschunked-decode helper (100-frame chunks)models/modeling_xvla.pyβ XVLA model;generate_actionsauto-predicts skill_id when None; progress_head isLinear(notSequential(Linear, Sigmoid))models/transformer.pyβ skill_prompt_hub selection when skill_id is providedmodels/da3_inline.pyβ DA3 inline encoder (process_res=224)
What's NOT in this checkpoint
- No latent segmenter (WIP, disabled).
- No Grounding-DINO + SAM. Don't pass
object_masks. - DA3 was frozen the entire run β its weights are the original DA3-LARGE.
- Action heads were initialized from scratch (Xavier, via
--scratch_action_expert); no semantic carryover from X-VLA-Pt's EE6D layout.
How to load
from huggingface_hub import snapshot_download
from da3_xvla.models.modeling_xvla import XVLA
from da3_xvla.models.configuration_xvla import XVLAConfig
local = snapshot_download("JackLiu0406/da3-xvla-b1k-task31-v17-ckpt50000")
cfg = XVLAConfig.from_pretrained(local)
# da3_input_dim is already 1024 in config; no eager-build patch needed
model = XVLA.from_pretrained(local, config=cfg)
- Downloads last month
- 8