Qwen3VL-PI_v3-Bridge-RT-1

A Vision-Language-Action (VLA) model from the StarVLA project, combining a Qwen3-VL-4B-Instruct backbone with a layer-wise cross-attention flow-matching action head (QwenPI_v3). The model is co-trained on the Bridge V2 and RT-1 / Fractal slices of the Open X-Embodiment (OXE) collection, and is evaluated on the SimplerEnv WidowX benchmark.

QwenPI_v3 is StarVLA's open-weight realisation of the π₀.₅ recipe:

Layer-wise cross-DiT flow-matching action head — every VLM layer's hidden state participates in cross-attention with the action DiT, instead of consuming only the last-layer feature.
Compressed Action DiT — per-layer LayerNorm + Linear projectors compress the 2560-d Qwen3-VL hidden states down to a 1024-d DiT latent, shrinking the action-head footprint by ~6× while preserving the layer-wise interaction.
Discretised-state language injection — proprioceptive state is quantised into 256 bins and appended to the instruction as plain tokens ([STATE] <bins> [ACTION]), so the VLM can attend to robot state with no additional encoder.

Model Summary


Architecture	`QwenPI_v3` (Qwen3-VL + layer-wise cross-DiT flow-matching head)
VLM backbone	`Qwen3-VL-4B-Instruct`
Action head	Layer-wise Flow-Matching DiT (36 layers, 1024 hidden, 16 heads)
Action chunk	16 steps
Action / state dim	7 / 7 (delta end-effector)
Image resolution	224 × 224, single 3rd-person view
Inference timesteps	4 (flow matching)
Total parameters	≈ 5.07 B
License	MIT
Codebase	starVLA/starVLA

Parameter breakdown

Module	Parameters	Share
`qwen_vl_interface` (Qwen3-VL-4B)	4,437,815,808	87.5 %
`action_model` (layer-wise FM DiT, hidden 1024)	538,678,305	10.6 %
`project_layers` (per-layer 2560 → 1024 projectors)	94,593,024	1.9 %
Total	5,071,087,137	100 %

Training Data

Co-training mixture bridge_rt_1 (1 : 1 sampling):

Dataset	Embodiment	Source
`bridge_orig_1.0.0_lerobot`	WidowX	IPEC-COMMUNITY/bridge_orig_lerobot
`fractal20220817_data_0.1.0_lerobot` (RT-1)	Google Robot	IPEC-COMMUNITY/fractal20220817_data_lerobot

Action representation: delta end-effector (7-d, gripper included)
Image observation: single primary RGB view, resized to 224 × 224
Per-dataset normalisation statistics are stored in dataset_statistics.json.

Training Recipe


Total steps	100,000 (released checkpoints up to 60k)
Warm-up steps	5,000
Per-device batch size	24
Hardware	8 × NVIDIA H100 / A100 (DeepSpeed ZeRO-2)
Precision	bf16, mixed-precision + gradient checkpointing
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-8, wd = 1e-8)
LR (base / VLM)	1e-5
LR (action head)	1e-4
LR scheduler	`cosine_with_min_lr` (min lr 5e-7)
Gradient clipping	1.0
Flow-matching noise	β-distribution (α=1.5, β=1.0), s = 0.999
Repeated diffusion steps	8
Frozen modules	none (full fine-tuning)
Attention impl.	FlashAttention-2

The exact training config is preserved in config.yaml / config.full.yaml, and the launch script in run_oxe_train.sh.

Evaluation — SimplerEnv WidowX

Following the standard SimplerEnv WidowX protocol on four pick-and-place tasks (24 episodes per task per run). Numbers are success rates (↑).

Step	PutCarrotOnPlate	PutEggplantInBasket	PutSpoonOnTableCloth	StackGreenCubeOnYellowCube	Average
40k	0.688	0.917	0.750	0.333	0.672
50k	0.625	1.000	0.792	0.375	0.698
60k	0.667	1.000	0.750	0.167	0.646

Best average: 69.8 % at the 50k checkpoint (steps_50000_pytorch_model.pt), which we ship as the recommended checkpoint.

For comparison with other StarVLA frameworks on the same bridge_rt_1 mixture and protocol see the StarVLA Model Zoo.

Repository Layout

.
├── README.md                 # this model card
├── config.yaml               # minimal training config
├── config.full.yaml          # fully resolved training config
├── run_oxe_train.sh          # launch script used for this run
├── dataset_statistics.json   # per-dataset action/state normalisation stats
├── summary.jsonl             # training step summary
├── success_summary/          # SimplerEnv evaluation logs and plots
│   ├── success_summary.csv
│   ├── raw_success.txt
│   └── success_plot.png
└── checkpoints/
    ├── steps_50000_pytorch_model.pt   # ← recommended checkpoint
    └── ...                            # per-step evaluation logs

How to Use

This checkpoint is consumed directly by the StarVLA training / evaluation stack. Clone StarVLA and load the checkpoint with the framework name QwenPI_v3:

git clone https://github.com/starVLA/starVLA.git
cd starVLA
# Follow installation instructions in the StarVLA README.

from huggingface_hub import snapshot_download
from starVLA.model.framework.tools import load_framework_from_checkpoint

ckpt_dir = snapshot_download("StarVLA/Qwen3VL-PI_v3-Bridge-RT-1")

policy = load_framework_from_checkpoint(
    framework_name="QwenPI_v3",
    config_path=f"{ckpt_dir}/config.full.yaml",
    checkpoint_path=f"{ckpt_dir}/checkpoints/steps_50000_pytorch_model.pt",
)
# policy.predict_action(images, instruction, state) -> action chunk (16 × 7)

For end-to-end SimplerEnv evaluation see examples/SimplerEnv.

Intended Use & Limitations

Intended use. Research on vision-language-action models, manipulation policy learning, and as a baseline for π-style flow-matching action heads on top of open-weight VLMs.

Out-of-scope / limitations.

Trained only on Bridge (WidowX) + RT-1 (Google Robot) with a 7-d delta-EE action space — generalisation to other embodiments / action spaces is not guaranteed.
Single 224 × 224 third-person view; no wrist camera, no depth.
Evaluated only on SimplerEnv WidowX simulation; behaviour on real robots has not been validated by the released checkpoint.
Inherits any biases / failure modes of the underlying Qwen3-VL-4B model.
Not safety-tuned. Do not deploy on physical robots without an external safety layer.

Citation

If you use this checkpoint, please cite StarVLA:

@article{starvla2026,
  title   = {StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing},
  author  = {StarVLA Community},
  journal = {arXiv preprint arXiv:2604.05014},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.05014}
}

And the underlying VLM backbone:

@misc{qwen3vl,
  title  = {Qwen3-VL},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct}
}

Acknowledgements

Qwen Team for the Qwen3-VL backbone.
Physical Intelligence for the π₀ / π₀.₅ flow-matching action-head recipe that inspired QwenPI_v3.
Open X-Embodiment and IPEC-COMMUNITY for the LeRobot conversions of Bridge V2 and RT-1.
SimplerEnv for the evaluation protocol.