Broken MTP in newer version

#1
by DColt - opened

Hello.

I used this model yesterday (up to commit 976e620d91cbe5d154a7bd9ee7d9d8b1b4f858bb) with a perfectly fine MTP running.

Starting this morning with the newest commit (e347d86b2a6cba4b54ea6f87ca247f60439eed07) that's no longer working.

Running on a DGX spark with the following recepie

# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm model in native NVFP4 prism format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm

solo_only: true

# Container image to use
container: vllm-node-tf5

# Mods
mods:
  - mods/fix-qwen3.6-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 32768

# Environment variables
env: []

# The vLLM serve command template
command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --trust-remote-code \
    --quantization compressed-tensors \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --load-format fastsafetensors \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --default-chat-template-kwargs '{{"enable_thinking": true, "preserve_thinking": true}}' \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
    --enable-prefix-caching

And launched with ./run-recipe.sh qwen3.6-35b-a3b-nvfp4 --port 8081 --solo --served-model Qwen3.6 --max_num_seqs 10 --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

image

Will fix asap

Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.

I am running the latest vllm/flashinfer

Commit hash matches (9e3d8b9d) — wheels are up to date.
All flashinfer wheels are up to date — skipping download.
FlashInfer wheels ready.
Commit hash matches (9a6a66f3b) — wheels are up to date.
All vllm wheels are up to date — skipping download.

It runs fine if i exclude the --speculative-config but with it i still get the same error.

Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.

What recepie do you use? In case it's something in my setup.

Hi friend,

I just validated end-to-end with a fresh download and docker image.

● Here's the full stack from the vllm image that just validated the HF model end-to-end with MTP:

┌────────────────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ Component │ Version │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ vLLM │ 0.19.2rc1.dev86+g9a6a66f3b.d20260421.cu132 (commit 9a6a66f3b, built 2026-04-21) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ PyTorch │ 2.11.0+cu130 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Transformers │ 5.5.4 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Tokenizers │ 0.22.2 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ FlashInfer │ 0.6.8 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ compressed-tensors │ 0.15.0.1 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Triton │ 3.6.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ fastsafetensors │ 0.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ safetensors │ 0.7.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ huggingface-hub │ 1.11.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ accelerate │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ xformers │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Ray │ 2.55.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Python │ 3.12.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (nvcc) │ 13.2, V13.2.51 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (torch links) │ 13.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ cuDNN │ 91900 (9.19.0) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ GPU │ NVIDIA GB10 (DGX Spark, Grace-Blackwell) │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘

Runtime env set by the recipe:

  • VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • TRITON_CACHE_DIR=/tmp/triton_cache

vLLM serve flags (critical ones):

  • --attention-backend flashinfer
  • --moe-backend flashinfer_cutlass
  • --kv-cache-dtype fp8
  • --load-format fastsafetensors
  • --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Environment

export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TRITON_CACHE_DIR="/tmp/triton_cache"

Serve

vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
--host 0.0.0.0
--port 8000
--max-model-len 262144
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.9
--tensor-parallel-size 1
--kv-cache-dtype fp8
--attention-backend flashinfer
--moe-backend flashinfer_cutlass
--enable-prefix-caching
--load-format fastsafetensors
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--chat-template unsloth.jinja
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Notes for whoever's reproducing it:

  • --chat-template unsloth.jinja refers to the file dropped into $WORKSPACE_DIR by the mods/fix-qwen3.5-chat-template mod (cp chat_template.jinja $WORKSPACE_DIR/unsloth.jinja). If they don't apply that mod, they either need to pass a path to their own jinja or drop the flag.
  • --trust-remote-code is required because the architecture is Qwen3_5MoeForConditionalGeneration (multimodal wrapper).
  • --moe-backend flashinfer_cutlass is important — Marlin on NVFP4-MoE has path gaps that can misfire on MTP.
  • --speculative-config JSON must be single-quoted at the shell so the {"..."} reaches vLLM intact.

Good luck sir,

By using your flags and env config, it does work!

Now the I just have to revert back to my original and verify what is the breaking flag.

Huge tank you for both the quant and time to help here 💝

From tweaking my original receipt:
Either: export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass" or --moe-backend flashinfer_cutlass is needed to not have the error on startup.

Sign up or log in to comment