SAM 2.1 Hiera-Tiny — full video tracking pipeline (ONNX)

ONNX export of facebook/sam2.1-hiera-tiny including the memory modules (memory encoder + memory attention + object pointers) that make SAM2 a real video tracker — unlike image-only exports, no application-level propagation heuristics are needed.

Exported from transformers (Sam2VideoModel, v5.11) and numerically validated against propagate_in_video_iterator: worst per-frame mask IoU vs the PyTorch reference 0.9967 on a synthetic motion clip, including early frames where the fixed-shape memory bank is padded.

Graphs (fp32, float I/O)

file	inputs	outputs
`onnx/vision_encoder.onnx`	`pixel_values [1,3,1024,1024]`	`feats0 [1,32,256,256]`, `feats1 [1,64,128,128]`, `feats2 [1,256,64,64]` (raw), `feats2_no_mem` (conditioning-frame variant), `vision_pos_embed [1,256,64,64]`
`onnx/mask_decoder.onnx`	`feats0`, `feats1`, `feats2_cond`, `input_points [1,1,N,2]`, `input_labels [1,1,N] int32`	`low_res_mask [1,1,256,256]`, `high_res_mask [1,1,1024,1024]`, `iou [1,1]`, `object_score_logits [1,1,1]`, `object_pointer [1,1,256]` (occlusion-gated, in-graph)
`onnx/memory_encoder.onnx`	`feats2`, `high_res_mask`, `object_score_logits [1,1]`, `binarize` (scalar: 1 for point-prompted frames)	`memory_tokens [4096,1,64]`, `memory_pos [4096,1,64]`
`onnx/memory_attention.onnx`	`current_vision_features [4096,1,256]`, `current_vision_position_embeddings [4096,1,256]`, `memory [28736,1,64]`, `memory_pos [28736,1,64]`	`conditioned_feats [1,256,64,64]`
`onnx/pointer_tpos.onnx`	`normalized_diffs [P]`	`pointer_pos [P,64]`

constants.json carries the memory temporal positional encoding table (7×64), normalization constants, and shape parameters.

Tracking loop

Seed frame (user clicks): vision_encoder → decoder on feats2_no_mem with points → mask, object pointer → memory_encoder (binarize=1) → conditioning memory.
Every later frame: vision_encoder → assemble memory bank (conditioning memory uses temporal-PE row 6; up to 6 recent frame memories use rows offset-1; pad to 7 blocks by duplicating the most recent) + 16 object pointers split into 4×64 tokens with pointer_tpos positional encoding (offsets normalized by min(total_frames,16)-1; pad by duplication) → memory_attention → decoder with a single padding point (label −1) → mask, pointer → memory_encoder (binarize=0) → push memory.

Occlusion is handled by the model: when object_score_logits ≤ 0 the mask is suppressed in-graph and the object pointer falls back to the learned no-object pointer; tracking recovers automatically when the object reappears.

Used by FuzzPuppy's Video Object Tracker, which runs this pipeline fully in-browser on WebGPU.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for square-zero-labs/sam2.1-tiny-video-onnx

Base model

facebook/sam2.1-hiera-tiny

Quantized

(2)

this model