SAM 2.1 Hiera-Tiny β€” full video tracking pipeline (ONNX)

ONNX export of facebook/sam2.1-hiera-tiny including the memory modules (memory encoder + memory attention + object pointers) that make SAM2 a real video tracker β€” unlike image-only exports, no application-level propagation heuristics are needed.

Exported from transformers (Sam2VideoModel, v5.11) and numerically validated against propagate_in_video_iterator: worst per-frame mask IoU vs the PyTorch reference 0.9967 on a synthetic motion clip, including early frames where the fixed-shape memory bank is padded.

Graphs (fp32, float I/O)

file inputs outputs
onnx/vision_encoder.onnx pixel_values [1,3,1024,1024] feats0 [1,32,256,256], feats1 [1,64,128,128], feats2 [1,256,64,64] (raw), feats2_no_mem (conditioning-frame variant), vision_pos_embed [1,256,64,64]
onnx/mask_decoder.onnx feats0, feats1, feats2_cond, input_points [1,1,N,2], input_labels [1,1,N] int32 low_res_mask [1,1,256,256], high_res_mask [1,1,1024,1024], iou [1,1], object_score_logits [1,1,1], object_pointer [1,1,256] (occlusion-gated, in-graph)
onnx/memory_encoder.onnx feats2, high_res_mask, object_score_logits [1,1], binarize (scalar: 1 for point-prompted frames) memory_tokens [4096,1,64], memory_pos [4096,1,64]
onnx/memory_attention.onnx current_vision_features [4096,1,256], current_vision_position_embeddings [4096,1,256], memory [28736,1,64], memory_pos [28736,1,64] conditioned_feats [1,256,64,64]
onnx/pointer_tpos.onnx normalized_diffs [P] pointer_pos [P,64]

constants.json carries the memory temporal positional encoding table (7Γ—64), normalization constants, and shape parameters.

Tracking loop

  1. Seed frame (user clicks): vision_encoder β†’ decoder on feats2_no_mem with points β†’ mask, object pointer β†’ memory_encoder (binarize=1) β†’ conditioning memory.
  2. Every later frame: vision_encoder β†’ assemble memory bank (conditioning memory uses temporal-PE row 6; up to 6 recent frame memories use rows offset-1; pad to 7 blocks by duplicating the most recent) + 16 object pointers split into 4Γ—64 tokens with pointer_tpos positional encoding (offsets normalized by min(total_frames,16)-1; pad by duplication) β†’ memory_attention β†’ decoder with a single padding point (label βˆ’1) β†’ mask, pointer β†’ memory_encoder (binarize=0) β†’ push memory.

Occlusion is handled by the model: when object_score_logits ≀ 0 the mask is suppressed in-graph and the object pointer falls back to the learned no-object pointer; tracking recovers automatically when the object reappears.

Used by FuzzPuppy's Video Object Tracker, which runs this pipeline fully in-browser on WebGPU.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for square-zero-labs/sam2.1-tiny-video-onnx

Quantized
(2)
this model