Instructions to use square-zero-labs/sam2.1-tiny-video-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sam2
How to use square-zero-labs/sam2.1-tiny-video-onnx with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(square-zero-labs/sam2.1-tiny-video-onnx) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(square-zero-labs/sam2.1-tiny-video-onnx) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
SAM 2.1 Hiera-Tiny β full video tracking pipeline (ONNX)
ONNX export of facebook/sam2.1-hiera-tiny including the memory modules (memory encoder + memory attention + object pointers) that make SAM2 a real video tracker β unlike image-only exports, no application-level propagation heuristics are needed.
Exported from transformers (Sam2VideoModel, v5.11) and numerically
validated against propagate_in_video_iterator: worst per-frame mask IoU vs
the PyTorch reference 0.9967 on a synthetic motion clip, including early
frames where the fixed-shape memory bank is padded.
Graphs (fp32, float I/O)
| file | inputs | outputs |
|---|---|---|
onnx/vision_encoder.onnx |
pixel_values [1,3,1024,1024] |
feats0 [1,32,256,256], feats1 [1,64,128,128], feats2 [1,256,64,64] (raw), feats2_no_mem (conditioning-frame variant), vision_pos_embed [1,256,64,64] |
onnx/mask_decoder.onnx |
feats0, feats1, feats2_cond, input_points [1,1,N,2], input_labels [1,1,N] int32 |
low_res_mask [1,1,256,256], high_res_mask [1,1,1024,1024], iou [1,1], object_score_logits [1,1,1], object_pointer [1,1,256] (occlusion-gated, in-graph) |
onnx/memory_encoder.onnx |
feats2, high_res_mask, object_score_logits [1,1], binarize (scalar: 1 for point-prompted frames) |
memory_tokens [4096,1,64], memory_pos [4096,1,64] |
onnx/memory_attention.onnx |
current_vision_features [4096,1,256], current_vision_position_embeddings [4096,1,256], memory [28736,1,64], memory_pos [28736,1,64] |
conditioned_feats [1,256,64,64] |
onnx/pointer_tpos.onnx |
normalized_diffs [P] |
pointer_pos [P,64] |
constants.json carries the memory temporal positional encoding table
(7Γ64), normalization constants, and shape parameters.
Tracking loop
- Seed frame (user clicks):
vision_encoderβ decoder onfeats2_no_memwith points β mask, object pointer βmemory_encoder(binarize=1) β conditioning memory. - Every later frame:
vision_encoderβ assemble memory bank (conditioning memory uses temporal-PE row 6; up to 6 recent frame memories use rowsoffset-1; pad to 7 blocks by duplicating the most recent) + 16 object pointers split into 4Γ64 tokens withpointer_tpospositional encoding (offsets normalized bymin(total_frames,16)-1; pad by duplication) βmemory_attentionβ decoder with a single padding point (label β1) β mask, pointer βmemory_encoder(binarize=0) β push memory.
Occlusion is handled by the model: when object_score_logits β€ 0 the mask is
suppressed in-graph and the object pointer falls back to the learned
no-object pointer; tracking recovers automatically when the object reappears.
Used by FuzzPuppy's Video Object Tracker, which runs this pipeline fully in-browser on WebGPU.
- Downloads last month
- -
Model tree for square-zero-labs/sam2.1-tiny-video-onnx
Base model
facebook/sam2.1-hiera-tiny