Support Transformers 5.6, dynamic-resolution images, temporal video embedder, vLLM parity

#2
by cuichenx - opened
NVIDIA org

Summary

  • Port modeling_nemotron_h.py to Transformers 5.6 by mirroring the upstream transformers/models/nemotron_h implementation (new Cache API, GradientCheckpointingLayer, eager/SDPA/FA2 dispatch, NemotronHMoE).
  • image_processing.py: replace the fixed-tile/thumbnail tiler with a dynamic-resolution path (per-image patch budget bounded by min_num_patches / max_num_patches / max_model_len, antialiased bicubic resize); update preprocessor_config.json to match.
  • modeling.py + processing.py: add a 3D temporal video_embedder hot-swapped into RADIO's patch_generator, expand <video> into one tubelet per T frames, and route pixel_values_videos through extract_video_feature.
  • Match vLLM inference numerically: align RADIO CPE interpolation to align_corners=False, expose llm_config as text_config, accept OpenAI-style list content in chat_template.jinja, and emit timestamps with vLLM's int(idx)*int(1000/fps)/1000 formula.
cuichenx changed pull request status to open
DanialMT changed pull request status to merged

Sign up or log in to comment