Support Transformers 5.6, dynamic-resolution images, temporal video embedder, vLLM parity

by cuichenx - opened 15 days ago

←

NVIDIA org 15 days ago

Summary

Port modeling_nemotron_h.py to Transformers 5.6 by mirroring the upstream transformers/models/nemotron_h implementation (new Cache API, GradientCheckpointingLayer, eager/SDPA/FA2 dispatch, NemotronHMoE).
image_processing.py: replace the fixed-tile/thumbnail tiler with a dynamic-resolution path (per-image patch budget bounded by min_num_patches / max_num_patches / max_model_len, antialiased bicubic resize); update preprocessor_config.json to match.
modeling.py + processing.py: add a 3D temporal video_embedder hot-swapped into RADIO's patch_generator, expand <video> into one tubelet per T frames, and route pixel_values_videos through extract_video_feature.
Match vLLM inference numerically: align RADIO CPE interpolation to align_corners=False, expose llm_config as text_config, accept OpenAI-style list content in chat_template.jinja, and emit timestamps with vLLM's int(idx)*int(1000/fps)/1000 formula.

cuichenx changed pull request status to open 13 days ago

DanialMT changed pull request status to merged 12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment