Broken MTP in newer version
Hello.
I used this model yesterday (up to commit 976e620d91cbe5d154a7bd9ee7d9d8b1b4f858bb) with a perfectly fine MTP running.
Starting this morning with the newest commit (e347d86b2a6cba4b54ea6f87ca247f60439eed07) that's no longer working.
Running on a DGX spark with the following recepie
# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm model in native NVFP4 prism format
recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8
# HuggingFace model to download (optional, for --download-model)
model: rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm
solo_only: true
# Container image to use
container: vllm-node-tf5
# Mods
mods:
- mods/fix-qwen3.6-chat-template
# Default settings (can be overridden via CLI)
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.7
max_model_len: 262144
max_num_batched_tokens: 32768
# Environment variables
env: []
# The vLLM serve command template
command: |
vllm serve rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm \
--host {host} \
--port {port} \
--max-model-len {max_model_len} \
--max-num-batched-tokens {max_num_batched_tokens} \
--gpu-memory-utilization {gpu_memory_utilization} \
--trust-remote-code \
--quantization compressed-tensors \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--default-chat-template-kwargs '{{"enable_thinking": true, "preserve_thinking": true}}' \
--generation-config auto \
--override-generation-config '{{"temperature": 0.7, "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
--enable-prefix-caching
And launched with ./run-recipe.sh qwen3.6-35b-a3b-nvfp4 --port 8081 --solo --served-model Qwen3.6 --max_num_seqs 10 --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Will fix asap
Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.
I am running the latest vllm/flashinfer
Commit hash matches (9e3d8b9d) — wheels are up to date.
All flashinfer wheels are up to date — skipping download.
FlashInfer wheels ready.
Commit hash matches (9a6a66f3b) — wheels are up to date.
All vllm wheels are up to date — skipping download.
It runs fine if i exclude the --speculative-config but with it i still get the same error.
Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.
What recepie do you use? In case it's something in my setup.
Hi friend,
I just validated end-to-end with a fresh download and docker image.
● Here's the full stack from the vllm image that just validated the HF model end-to-end with MTP:
┌────────────────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ Component │ Version │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ vLLM │ 0.19.2rc1.dev86+g9a6a66f3b.d20260421.cu132 (commit 9a6a66f3b, built 2026-04-21) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ PyTorch │ 2.11.0+cu130 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Transformers │ 5.5.4 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Tokenizers │ 0.22.2 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ FlashInfer │ 0.6.8 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ compressed-tensors │ 0.15.0.1 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Triton │ 3.6.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ fastsafetensors │ 0.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ safetensors │ 0.7.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ huggingface-hub │ 1.11.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ accelerate │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ xformers │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Ray │ 2.55.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Python │ 3.12.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (nvcc) │ 13.2, V13.2.51 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (torch links) │ 13.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ cuDNN │ 91900 (9.19.0) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ GPU │ NVIDIA GB10 (DGX Spark, Grace-Blackwell) │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘
Runtime env set by the recipe:
- VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- TRITON_CACHE_DIR=/tmp/triton_cache
vLLM serve flags (critical ones):
- --attention-backend flashinfer
- --moe-backend flashinfer_cutlass
- --kv-cache-dtype fp8
- --load-format fastsafetensors
- --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Environment
export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TRITON_CACHE_DIR="/tmp/triton_cache"
Serve
vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
--host 0.0.0.0
--port 8000
--max-model-len 262144
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.9
--tensor-parallel-size 1
--kv-cache-dtype fp8
--attention-backend flashinfer
--moe-backend flashinfer_cutlass
--enable-prefix-caching
--load-format fastsafetensors
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--chat-template unsloth.jinja
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Notes for whoever's reproducing it:
- --chat-template unsloth.jinja refers to the file dropped into $WORKSPACE_DIR by the mods/fix-qwen3.5-chat-template mod (cp chat_template.jinja $WORKSPACE_DIR/unsloth.jinja). If they don't apply that mod, they either need to pass a path to their own jinja or drop the flag.
- --trust-remote-code is required because the architecture is Qwen3_5MoeForConditionalGeneration (multimodal wrapper).
- --moe-backend flashinfer_cutlass is important — Marlin on NVFP4-MoE has path gaps that can misfire on MTP.
- --speculative-config JSON must be single-quoted at the shell so the {"..."} reaches vLLM intact.
Good luck sir,
By using your flags and env config, it does work!
Now the I just have to revert back to my original and verify what is the breaking flag.
Huge tank you for both the quant and time to help here 💝
From tweaking my original receipt:
Either: export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass" or --moe-backend flashinfer_cutlass is needed to not have the error on startup.
