Support Transformers 5.6, dynamic-resolution images, temporal video embedder, vLLM parity
#2
by cuichenx - opened
Summary
- Port
modeling_nemotron_h.pyto Transformers 5.6 by mirroring the upstreamtransformers/models/nemotron_himplementation (new Cache API,GradientCheckpointingLayer, eager/SDPA/FA2 dispatch,NemotronHMoE). image_processing.py: replace the fixed-tile/thumbnail tiler with a dynamic-resolution path (per-image patch budget bounded bymin_num_patches/max_num_patches/max_model_len, antialiased bicubic resize); updatepreprocessor_config.jsonto match.modeling.py+processing.py: add a 3D temporalvideo_embedderhot-swapped into RADIO'spatch_generator, expand<video>into one tubelet perTframes, and routepixel_values_videosthroughextract_video_feature.- Match vLLM inference numerically: align RADIO CPE interpolation to
align_corners=False, exposellm_configastext_config, accept OpenAI-style list content inchat_template.jinja, and emit timestamps with vLLM'sint(idx)*int(1000/fps)/1000formula.
cuichenx changed pull request status to open
DanialMT changed pull request status to merged