Qwen3VL-PI_v3-Bridge-RT-1
A Vision-Language-Action (VLA) model from the StarVLA
project, combining a Qwen3-VL-4B-Instruct backbone with a layer-wise
cross-attention flow-matching action head (QwenPI_v3). The model is
co-trained on the Bridge V2
and RT-1 / Fractal
slices of the Open X-Embodiment (OXE) collection, and is evaluated on the
SimplerEnv WidowX benchmark.
QwenPI_v3 is StarVLA's open-weight realisation of the Οβ.β
recipe:
- Layer-wise cross-DiT flow-matching action head β every VLM layer's hidden state participates in cross-attention with the action DiT, instead of consuming only the last-layer feature.
- Compressed Action DiT β per-layer
LayerNorm + Linearprojectors compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent, shrinking the action-head footprint by ~6Γ while preserving the layer-wise interaction. - Discretised-state language injection β proprioceptive state is
quantised into 256 bins and appended to the instruction as plain tokens
(
[STATE] <bins> [ACTION]), so the VLM can attend to robot state with no additional encoder.
Model Summary
| Architecture | QwenPI_v3 (Qwen3-VL + layer-wise cross-DiT flow-matching head) |
| VLM backbone | Qwen3-VL-4B-Instruct |
| Action head | Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads) |
| Action chunk | 16 steps |
| Action / state dim | 7 / 7 (delta end-effector) |
| Image resolution | 224 Γ 224, single 3rd-person view |
| Inference timesteps | 4 (flow matching) |
| Total parameters | β 5.07 B |
| License | MIT |
| Codebase | starVLA/starVLA |
Parameter breakdown
| Module | Parameters | Share |
|---|---|---|
qwen_vl_interface (Qwen3-VL-4B) |
4,437,815,808 | 87.5 % |
action_model (layer-wise FM DiT, hidden 1024) |
538,678,305 | 10.6 % |
project_layers (per-layer 2560 β 1024 projectors) |
94,593,024 | 1.9 % |
| Total | 5,071,087,137 | 100 % |
Training Data
Co-training mixture bridge_rt_1 (1 : 1 sampling):
| Dataset | Embodiment | Source |
|---|---|---|
bridge_orig_1.0.0_lerobot |
WidowX | IPEC-COMMUNITY/bridge_orig_lerobot |
fractal20220817_data_0.1.0_lerobot (RT-1) |
Google Robot | IPEC-COMMUNITY/fractal20220817_data_lerobot |
- Action representation: delta end-effector (7-d, gripper included)
- Image observation: single primary RGB view, resized to 224 Γ 224
- Per-dataset normalisation statistics are stored in
dataset_statistics.json.
Training Recipe
| Total steps | 100,000 (released checkpoints up to 60k) |
| Warm-up steps | 5,000 |
| Per-device batch size | 24 |
| Hardware | 8 Γ NVIDIA H100 / A100 (DeepSpeed ZeRO-2) |
| Precision | bf16, mixed-precision + gradient checkpointing |
| Optimizer | AdamW (Ξ²β = 0.9, Ξ²β = 0.95, Ξ΅ = 1e-8, wd = 1e-8) |
| LR (base / VLM) | 1e-5 |
| LR (action head) | 1e-4 |
| LR scheduler | cosine_with_min_lr (min lr 5e-7) |
| Gradient clipping | 1.0 |
| Flow-matching noise | Ξ²-distribution (Ξ±=1.5, Ξ²=1.0), s = 0.999 |
| Repeated diffusion steps | 8 |
| Frozen modules | none (full fine-tuning) |
| Attention impl. | FlashAttention-2 |
The exact training config is preserved in
config.yaml / config.full.yaml, and the
launch script in run_oxe_train.sh.
Evaluation β SimplerEnv WidowX
Following the standard SimplerEnv WidowX protocol on four pick-and-place tasks (24 episodes per task per run). Numbers are success rates (β).
| Step | PutCarrotOnPlate | PutEggplantInBasket | PutSpoonOnTableCloth | StackGreenCubeOnYellowCube | Average |
|---|---|---|---|---|---|
| 40k | 0.688 | 0.917 | 0.750 | 0.333 | 0.672 |
| 50k | 0.625 | 1.000 | 0.792 | 0.375 | 0.698 |
| 60k | 0.667 | 1.000 | 0.750 | 0.167 | 0.646 |
Best average: 69.8 % at the 50k checkpoint
(steps_50000_pytorch_model.pt),
which we ship as the recommended checkpoint.
For comparison with other StarVLA frameworks on the same bridge_rt_1
mixture and protocol see the StarVLA Model Zoo.
Repository Layout
.
βββ README.md # this model card
βββ config.yaml # minimal training config
βββ config.full.yaml # fully resolved training config
βββ run_oxe_train.sh # launch script used for this run
βββ dataset_statistics.json # per-dataset action/state normalisation stats
βββ summary.jsonl # training step summary
βββ success_summary/ # SimplerEnv evaluation logs and plots
β βββ success_summary.csv
β βββ raw_success.txt
β βββ success_plot.png
βββ checkpoints/
βββ steps_50000_pytorch_model.pt # β recommended checkpoint
βββ ... # per-step evaluation logs
How to Use
This checkpoint is consumed directly by the StarVLA training / evaluation
stack. Clone StarVLA and load the checkpoint with the framework name
QwenPI_v3:
git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.
from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint
ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")
policy = load_framework_from_checkpoint(
framework_name="QwenPI_v3",
config_path=f"{ckpt_dir}/config.full.yaml",
checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (16 Γ 7)
For end-to-end SimplerEnv evaluation see
examples/SimplerEnv.
Intended Use & Limitations
Intended use. Research on vision-language-action models, manipulation policy learning, and as a baseline for Ο-style flow-matching action heads on top of open-weight VLMs.
Out-of-scope / limitations.
- Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE action space β generalisation to other embodiments / action spaces is not guaranteed.
- Single 224 Γ 224 third-person view; no wrist camera, no depth.
- Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots has not been validated by the released checkpoint.
- Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
- Not safety-tuned. Do not deploy on physical robots without an external safety layer.
Citation
If you use this checkpoint, please cite StarVLA:
@article{starvla2026,
title = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
author = {StarVLA Community},
journal = {arXiv preprint arXiv:2604.05014},
year = {2026},
url = {https://arxiv.org/abs/2604.05014}
}
And the underlying VLM backbone:
@misc{qwen3vl,
title = {Qwen3-VL},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
}
Acknowledgements
- Qwen Team for the Qwen3-VL backbone.
- Physical Intelligence for the
Οβ / Οβ.β
flow-matching action-head recipe that inspired
QwenPI_v3. - Open X-Embodiment and IPEC-COMMUNITY for the LeRobot conversions of Bridge V2 and RT-1.
- SimplerEnv for the evaluation protocol.
- Downloads last month
- 6
Model tree for StarVLA/Qwen3VL-PI_v3-Bridge-RT_1
Base model
Qwen/Qwen3-VL-4B-Instruct