Foundry-VLM-1.3B-200M

A 1.3B parameter vision-language model trained on 200M image-caption samples, part of the VLA Foundry collection.

Model Description

Architecture: ViT encoder (12 layers, 768 hidden dim, patch size 14, pixel-shuffle 2x) + Transformer decoder (24 layers, 2048 hidden dim, 16 heads)
Parameters: 1.3B (non-embedding)
Processor: SmolVLM2
Training data: 200M image-caption pairs from DataComp-DR-1B
LR schedule: Warmup + constant for 165M samples, then 35M samples of cosine decay
LLM backbone: Initialized from Foundry-LLM-1.2B-800B

Continuation of Foundry-VLM-1.3B-165M with an additional 35M samples of cosine-decayed training. Used as the vision-language backbone for the Foundry-VLA-1.7B action models.

Evaluation Results

COCO-val captioning:

BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-L	CIDEr
58.64	38.62	24.49	15.57	38.17	55.14

Usage

git clone https://github.com/TRI-ML/vla_foundry.git
cd vla_foundry
pip install -e .

from vla_foundry.models.base_model import BaseModel
model = BaseModel.from_pretrained("TRI-ML/Foundry-VLM-1.3B-200M")

Collection including TRI-ML/Foundry-VLM-1.3B-200M

vla_foundry

Collection

VLA Foundry: pretrained LLM, VLM, and VLA checkpoints. • 8 items • Updated 4 days ago • 3

Paper for TRI-ML/Foundry-VLM-1.3B-200M

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Paper • 2604.19728 • Published 4 days ago

TRI-ML
/

Foundry-VLM-1.3B-200M