MTP-layer weights?

#1
by CosmicRaisins - opened

Any plans on publishing an NVFP4 quant with MTP-layer weights?

StepFun org

Yes, this is in plan, we will update a version with NVFP4 + MTP

For my understanding, MTP layers are available in the GUFF variants?

For my understanding, MTP layers are available in the GUFF variants?

Yeah, but NVFP4 + MTP will be faster on NVIDIA hardware than a comparably sized GGUF + MTP (and potentially more accurate as well).

StepFun org

Update: the HF checkpoint has now been updated, so stepfun-ai/Step-3.7-Flash-NVFP4 should work with vLLM MTP speculative decoding directly.

NVFP4 + MTP

The Step-3.7-Flash-NVFP4 checkpoint has been updated with MTP draft layers and now supports vLLM speculative decoding with:

--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

On GPQA Diamond avg@16, NVFP4 + MTP matches quality within statistical noise compared with the same NVFP4 checkpoint without MTP: 77.81% vs. 78.41% item accuracy over 3168 records.

On a GB200 TP=4 vLLM setup with GPQA-style long-reasoning streaming prompts (~250 token prompt, ~1.6K token completion), NVFP4 + MTP improves aggregate decode throughput:

Concurrency NVFP4 + MTP NVFP4 no-MTP Speedup
8 1309 tok/s 1155 tok/s 1.13x
32 4391 tok/s 3480 tok/s 1.26x
64 8229 tok/s 5667 tok/s 1.45x

This makes the NVFP4 checkpoint a practical option for high-throughput long-reasoning workloads while keeping the original NVFP4 model weights unchanged.

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

Same ask here, we want to benchmarking on 2x RTX pro 6000

StepFun org

I will take a look at running on rtx pro 6k.

StepFun org

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

We validated the latest stepfun-ai/Step-3.7-Flash-NVFP4 checkpoint with vLLM on an RTX 6000D / SM120 system. It starts and generates with the vllm/vllm-openai:stepfun37 container using multi-GPU tensor parallelism — we tested both TP=2 and TP=4 with MTP enabled.

The serve command we validated:

export MODEL_DIR=/path/to/Step-3.7-Flash-NVFP4
export FLASHINFER_WORKSPACE_BASE=/tmp/flashinfer-step37
mkdir -p "$FLASHINFER_WORKSPACE_BASE"

python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 8765
--model "$MODEL_DIR"
--served-model-name step3p7
--quantization modelopt
--kv-cache-dtype fp8
--tensor-parallel-size 2
--max-model-len 8192
--gpu-memory-utilization 0.9
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--async-scheduling
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Change --tensor-parallel-size to match your GPU count. For a no-MTP run, remove the --speculative-config line.

Caveat: SM120/SM121 falls back to the Marlin weight-only FP4 kernel rather than GB200 native FP4 compute, so it is functional but performance can differ from GB200.
If you still hit the Step3VLProcessor fallback specifically, please share the full startup log plus the exact vLLM version / container and we can dig in.

can you share a vllm working config please i tried everything here, from your modelcard . The latest nightly with b12x isnt starting and your own docker image complains and fallsback on Step3VLProcessor error.
im on 2x 6000 rtx pro cards 😀

We validated the latest stepfun-ai/Step-3.7-Flash-NVFP4 checkpoint with vLLM on an RTX 6000D / SM120 system. It starts and generates with the vllm/vllm-openai:stepfun37 container using multi-GPU tensor parallelism — we tested both TP=2 and TP=4 with MTP enabled.

The serve command we validated:

export MODEL_DIR=/path/to/Step-3.7-Flash-NVFP4
export FLASHINFER_WORKSPACE_BASE=/tmp/flashinfer-step37
mkdir -p "$FLASHINFER_WORKSPACE_BASE"

python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 8765
--model "$MODEL_DIR"
--served-model-name step3p7
--quantization modelopt
--kv-cache-dtype fp8
--tensor-parallel-size 2
--max-model-len 8192
--gpu-memory-utilization 0.9
--enable-expert-parallel
--disable-cascade-attn
--reasoning-parser step3p5
--enable-auto-tool-choice
--tool-call-parser step3p5
--trust-remote-code
--async-scheduling
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Change --tensor-parallel-size to match your GPU count. For a no-MTP run, remove the --speculative-config line.

Caveat: SM120/SM121 falls back to the Marlin weight-only FP4 kernel rather than GB200 native FP4 compute, so it is functional but performance can differ from GB200.
If you still hit the Step3VLProcessor fallback specifically, please share the full startup log plus the exact vLLM version / container and we can dig in.

Thank you so much ! ill retry with your container and nightly a i saw the PR was merged!

validated it is working and running on marlin without previous error ! thank you and your team for making sure the whole of blackwell family can run your foundation models !

@huangyu-nv From testing some workloads for a while my conclusion is this is a great model!

It seems intent on insisting its claude from anthopic but it runs great after the changes below.

My running optimised config on 2x 6000 RTX PRO:

The b12x fallback (llm nightly) doesnt work atm until a fix for SWIGLUSTEP support to B12X is implemented so your stuck on marlin .

Comments on args:
max-num-batched-tokens : this value makes the stupid 20x60000=~7GB vision encoder startup to pass its "safety" check. someone decided 20 images worst case test for vision encoder was a good idea. why ??? test 8 images maybe not 20 . and no you cant oom and make backend crash cause you sent more then 20 images and how is that relevant as a startup safety check on boot... someone didnt cybersecurity cook here

why not 256k context length? the users never exceeds this number. unless its a 200+ page document ingested. (you shouldnt be doing that, and teach your clients the better way) we gain concurrency on kvcache aswell which is better. and 131k is well enough for agent harnesses i run 6 profiles that spin up sub agents just fine for advanced tasks.

mm limit per prompt : limit users to 3 images per prompt. its great for context to send images but i rarely send more than 3 . the width and height is just limit thats a high ress image enough to read by agents.

MTP 2:
Dropping num_spec 3 → 2:

num_spec=3: avg draft acceptance ~50–73%, position-3 at 0.24–0.52
num_spec=2: avg draft acceptance ~74–94%, with good windows hitting 92.9%, 94.2%

This means Throughput is better at mtp 2 over 3.

num_spec=3: ~70–95 tok/s single stream
num_spec=2: ~105–166 tok/s single stream, ~150–172 tok/s at 2 concurrent

That 3rd speculative token gets thrown away half the time (50%) and the 2x 120a GPU's compute it only to have it thrown away (waste).

So num_spec=2 is faster and higher-acceptance for 2tp sm120 cards at current date and fix on your image.

Mean acceptance length barely moved (≈2.87 → ≈2.62, ~9% fewer tokens/step), but throughput jumped 30–70%. That means the 3rd speculative token was adding almost nothing in accepted length while costing a full extra draft+verify per step, net negative.

speculative token counts should be increased only when the acceptance length is high; otherwise performance may be negatively affected. num_spec=3 was sitting in that penalty zone for this model on sm120 and the same goes for qwen 120b model from my testing.

args:

  • "/data/hf/models/models--stepfun-ai--Step-3.7-Flash-NVFP4/snapshots/4275532ffd9a9496ff36b7a2dc4a9db1048da438"
  • "--served-model-name=primary"
  • "--host=0.0.0.0"
  • "--port=8000"
  • "--quantization=modelopt"
  • "--kv-cache-dtype=fp8"
  • "--tensor-parallel-size=2"
  • "--max-model-len=131072"
  • "--max-num-batched-tokens=60000"
  • "--max-num-seqs=50"
  • "--enable-prefix-caching"
  • "--gpu-memory-utilization=0.9"
  • "--limit-mm-per-prompt"
  • '{"image": {"count": 3, "width": 1024, "height": 1024}}'
  • "--enable-expert-parallel"
  • "--disable-cascade-attn"
  • "--reasoning-parser=step3p5"
  • "--enable-auto-tool-choice"
  • "--tool-call-parser=step3p5"
  • "--trust-remote-code"
  • "--async-scheduling"
  • "--speculative-config"
  • '{"method":"mtp","num_speculative_tokens":2}'
  • "--override-generation-config"
  • '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

Sign up or log in to comment