Hello.

I used this model yesterday (up to commit 976e620d91cbe5d154a7bd9ee7d9d8b1b4f858bb) with a perfectly fine MTP running.

Starting this morning with the newest commit (e347d86b2a6cba4b54ea6f87ca247f60439eed07) that's no longer working.

Running on a DGX spark with the following recepie

# Recipe: Qwen/Qwen3.6-35B-A3B-FP8
# rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm model in native NVFP4 prism format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm

solo_only: true

# Container image to use
container: vllm-node-tf5

# Mods
mods:
  - mods/fix-qwen3.6-chat-template

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  gpu_memory_utilization: 0.7
  max_model_len: 262144
  max_num_batched_tokens: 32768

# Environment variables
env: []

# The vLLM serve command template
command: |
  vllm serve rdtand/Qwen3.6-35B-A3B-PrismQuant-4.75bit-vllm \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --trust-remote-code \
    --quantization compressed-tensors \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --load-format fastsafetensors \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --default-chat-template-kwargs '{{"enable_thinking": true, "preserve_thinking": true}}' \
    --generation-config auto \
    --override-generation-config '{{"temperature": 0.7,  "top_p": 0.8, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}}' \
    --enable-prefix-caching

And launched with ./run-recipe.sh qwen3.6-35b-a3b-nvfp4 --port 8081 --solo --served-model Qwen3.6 --max_num_seqs 10 --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

rdtand

Owner 23 days ago

Will fix asap

rdtand

Owner 23 days ago

Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.

DColt

23 days ago

I am running the latest vllm/flashinfer

Commit hash matches (9e3d8b9d) — wheels are up to date.
All flashinfer wheels are up to date — skipping download.
FlashInfer wheels ready.
Commit hash matches (9a6a66f3b) — wheels are up to date.
All vllm wheels are up to date — skipping download.

It runs fine if i exclude the --speculative-config but with it i still get the same error.

DColt

23 days ago

Can you try updating again? Also, refresh your vllm/flashinfer. This version runs locally for me with eugr’s latest vllm spark runner.

What recepie do you use? In case it's something in my setup.

rdtand

Owner 23 days ago

Hi friend,

I just validated end-to-end with a fresh download and docker image.

● Here's the full stack from the vllm image that just validated the HF model end-to-end with MTP:

┌────────────────────┬─────────────────────────────────────────────────────────────────────────────────┐
│ Component │ Version │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ vLLM │ 0.19.2rc1.dev86+g9a6a66f3b.d20260421.cu132 (commit 9a6a66f3b, built 2026-04-21) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ PyTorch │ 2.11.0+cu130 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Transformers │ 5.5.4 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Tokenizers │ 0.22.2 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ FlashInfer │ 0.6.8 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ compressed-tensors │ 0.15.0.1 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Triton │ 3.6.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ fastsafetensors │ 0.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ safetensors │ 0.7.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ huggingface-hub │ 1.11.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ accelerate │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ xformers │ (not installed) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Ray │ 2.55.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ Python │ 3.12.3 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (nvcc) │ 13.2, V13.2.51 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ CUDA (torch links) │ 13.0 │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ cuDNN │ 91900 (9.19.0) │
├────────────────────┼─────────────────────────────────────────────────────────────────────────────────┤
│ GPU │ NVIDIA GB10 (DGX Spark, Grace-Blackwell) │
└────────────────────┴─────────────────────────────────────────────────────────────────────────────────┘

Runtime env set by the recipe:

VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
TRITON_CACHE_DIR=/tmp/triton_cache

vLLM serve flags (critical ones):

--attention-backend flashinfer
--moe-backend flashinfer_cutlass
--kv-cache-dtype fp8
--load-format fastsafetensors
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Environment

export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export TRITON_CACHE_DIR="/tmp/triton_cache"

Serve

vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm
--host 0.0.0.0
--port 8000
--max-model-len 262144
--max-num-batched-tokens 16384
--gpu-memory-utilization 0.9
--tensor-parallel-size 1
--kv-cache-dtype fp8
--attention-backend flashinfer
--moe-backend flashinfer_cutlass
--enable-prefix-caching
--load-format fastsafetensors
--trust-remote-code
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--chat-template unsloth.jinja
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Notes for whoever's reproducing it:

--chat-template unsloth.jinja refers to the file dropped into $WORKSPACE_DIR by the mods/fix-qwen3.5-chat-template mod (cp chat_template.jinja $WORKSPACE_DIR/unsloth.jinja). If they don't apply that mod, they either need to pass a path to their own jinja or drop the flag.
--trust-remote-code is required because the architecture is Qwen3_5MoeForConditionalGeneration (multimodal wrapper).
--moe-backend flashinfer_cutlass is important — Marlin on NVFP4-MoE has path gaps that can misfire on MTP.
--speculative-config JSON must be single-quoted at the shell so the {"..."} reaches vLLM intact.

Good luck sir,

DColt

23 days ago

By using your flags and env config, it does work!

Now the I just have to revert back to my original and verify what is the breaking flag.

Huge tank you for both the quant and time to help here 💝

DColt

23 days ago

From tweaking my original receipt:
Either: export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass" or --moe-backend flashinfer_cutlass is needed to not have the error on startup.

rdtand
/

Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm

Broken MTP in newer version

Environment

Serve