Robotics
English
starVLA
vla
vision-language-action
qwen3-vl
flow-matching
pi-zero
manipulation
bridge
rt-1
oxe

Qwen3VL-PI_v3-Bridge-RT-1

A Vision-Language-Action (VLA) model from the StarVLA project, combining a Qwen3-VL-4B-Instruct backbone with a layer-wise cross-attention flow-matching action head (QwenPI_v3). The model is co-trained on the Bridge V2 and RT-1 / Fractal slices of the Open X-Embodiment (OXE) collection, and is evaluated on the SimplerEnv WidowX benchmark.

QwenPI_v3 is StarVLA's open-weight realisation of the Ο€β‚€.β‚… recipe:

  1. Layer-wise cross-DiT flow-matching action head β€” every VLM layer's hidden state participates in cross-attention with the action DiT, instead of consuming only the last-layer feature.
  2. Compressed Action DiT β€” per-layer LayerNorm + Linear projectors compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent, shrinking the action-head footprint by ~6Γ— while preserving the layer-wise interaction.
  3. Discretised-state language injection β€” proprioceptive state is quantised into 256 bins and appended to the instruction as plain tokens ([STATE] <bins> [ACTION]), so the VLM can attend to robot state with no additional encoder.

Model Summary

Architecture QwenPI_v3 (Qwen3-VL + layer-wise cross-DiT flow-matching head)
VLM backbone Qwen3-VL-4B-Instruct
Action head Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads)
Action chunk 16 steps
Action / state dim 7 / 7 (delta end-effector)
Image resolution 224 Γ— 224, single 3rd-person view
Inference timesteps 4 (flow matching)
Total parameters β‰ˆ 5.07 B
License MIT
Codebase starVLA/starVLA

Parameter breakdown

Module Parameters Share
qwen_vl_interface (Qwen3-VL-4B) 4,437,815,808 87.5 %
action_model (layer-wise FM DiT, hidden 1024) 538,678,305 10.6 %
project_layers (per-layer 2560 β†’ 1024 projectors) 94,593,024 1.9 %
Total 5,071,087,137 100 %

Training Data

Co-training mixture bridge_rt_1 (1 : 1 sampling):

Dataset Embodiment Source
bridge_orig_1.0.0_lerobot WidowX IPEC-COMMUNITY/bridge_orig_lerobot
fractal20220817_data_0.1.0_lerobot (RT-1) Google Robot IPEC-COMMUNITY/fractal20220817_data_lerobot
  • Action representation: delta end-effector (7-d, gripper included)
  • Image observation: single primary RGB view, resized to 224 Γ— 224
  • Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe

Total steps 100,000 (released checkpoints up to 60k)
Warm-up steps 5,000
Per-device batch size 24
Hardware 8 Γ— NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision bf16, mixed-precision + gradient checkpointing
Optimizer AdamW (β₁ = 0.9, Ξ²β‚‚ = 0.95, Ξ΅ = 1e-8, wd = 1e-8)
LR (base / VLM) 1e-5
LR (action head) 1e-4
LR scheduler cosine_with_min_lr (min lr 5e-7)
Gradient clipping 1.0
Flow-matching noise Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999
Repeated diffusion steps 8
Frozen modules none (full fine-tuning)
Attention impl. FlashAttention-2

The exact training config is preserved in config.yaml / config.full.yaml, and the launch script in run_oxe_train.sh.


Evaluation β€” SimplerEnv WidowX

Following the standard SimplerEnv WidowX protocol on four pick-and-place tasks (24 episodes per task per run). Numbers are success rates (↑).

Step PutCarrotOnPlate PutEggplantInBasket PutSpoonOnTableCloth StackGreenCubeOnYellowCube Average
40k 0.688 0.917 0.750 0.333 0.672
50k 0.625 1.000 0.792 0.375 0.698
60k 0.667 1.000 0.750 0.167 0.646

Best average: 69.8 % at the 50k checkpoint (steps_50000_pytorch_model.pt), which we ship as the recommended checkpoint.

For comparison with other StarVLA frameworks on the same bridge_rt_1 mixture and protocol see the StarVLA Model Zoo.


Repository Layout

.
β”œβ”€β”€ README.md                 # this model card
β”œβ”€β”€ config.yaml               # minimal training config
β”œβ”€β”€ config.full.yaml          # fully resolved training config
β”œβ”€β”€ run_oxe_train.sh          # launch script used for this run
β”œβ”€β”€ dataset_statistics.json   # per-dataset action/state normalisation stats
β”œβ”€β”€ summary.jsonl             # training step summary
β”œβ”€β”€ success_summary/          # SimplerEnv evaluation logs and plots
β”‚   β”œβ”€β”€ success_summary.csv
β”‚   β”œβ”€β”€ raw_success.txt
β”‚   └── success_plot.png
└── checkpoints/
    β”œβ”€β”€ steps_50000_pytorch_model.pt   # ← recommended checkpoint
    └── ...                            # per-step evaluation logs

How to Use

This checkpoint is consumed directly by the StarVLA training / evaluation stack. Clone StarVLA and load the checkpoint with the framework name QwenPI_v3:

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")

policy = load_framework_from_checkpoint(
    framework_name="QwenPI_v3",
    config_path=f"{ckpt_dir}/config.full.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (16 Γ— 7)

For end-to-end SimplerEnv evaluation see examples/SimplerEnv.


Intended Use & Limitations

Intended use. Research on vision-language-action models, manipulation policy learning, and as a baseline for Ο€-style flow-matching action heads on top of open-weight VLMs.

Out-of-scope / limitations.

  • Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE action space β€” generalisation to other embodiments / action spaces is not guaranteed.
  • Single 224 Γ— 224 third-person view; no wrist camera, no depth.
  • Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots has not been validated by the released checkpoint.
  • Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
  • Not safety-tuned. Do not deploy on physical robots without an external safety layer.

Citation

If you use this checkpoint, please cite StarVLA:

@article{starvla2026,
  title   = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
  author  = {StarVLA Community},
  journal = {arXiv preprint arXiv:2604.05014},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.05014}
}

And the underlying VLM backbone:

@misc{qwen3vl,
  title  = {Qwen3-VL},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
}

Acknowledgements

Downloads last month
6
Video Preview
loading

Model tree for StarVLA/Qwen3VL-PI_v3-Bridge-RT_1

Finetuned
(246)
this model

Datasets used to train StarVLA/Qwen3VL-PI_v3-Bridge-RT_1

Collection including StarVLA/Qwen3VL-PI_v3-Bridge-RT_1

Paper for StarVLA/Qwen3VL-PI_v3-Bridge-RT_1