Thanks for a full attention quant!
Many quants just qunatize everything uniformly, thank you for taking the time to preserve the attention!
I got the model running with vllm using the following command.
vllm serve \
/models/cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \
--served-model-name Qwen3.5-122B-A10B \
--trust-remote-code \
--gpu_memory_utilization 0.95 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--host 127.0.0.1 \
--port 8090
However, adding --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ causes vllm to crash.
Are you able to run vllm with mtp? If so, could you share you command line?
That's what currently working for me:
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
uv run vllm serve \
--model ~/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit \
--served-model-name Qwen3.5-122B-A10B \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--gpu-memory-utilization 0.95 \
--swap-space 16 \
--max-num-seqs 32 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--language-model-only \
--enable-expert-parallel \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen35_coder \
--host 127.0.0.1 \
--port 15647
I used vllm/transfomer installation as described here.
Also, I had a lot of issues with tool calling, so I applied this patch: https://github.com/vllm-project/vllm/pull/35347
No, it's working. Getting around 120 t/sec. Tool calls working fine in OpenCode.
Thanks for linking to the qwen35_coder tool-call-parser. I had no luck getting speculative-config to run on ROCm, will have to fiddle more with vllm installation to get it working.
I was debugging mtp issues and noticed that the mtp attention layers were not excluded from quantization, so they were quantized to INT4:
- mtp.layers.0.self_attn.q_proj
- mtp.layers.0.self_attn.k_proj
- mtp.layers.0.self_attn.v_proj
- mtp.layers.0.self_attn.o_proj
I noticed that you have updated three files on Hugging Face. Could you please let me know whether the update will improve the model's intelligence or its speed? Thank you.
Yes, the MTP attention layers was initially quantized, which was a mistake.
I updated MTP layers attention layers to be at BF16, for speculative decoding to work, and to slightly increase draft acceptance rate.
Thank you! Still no luck with MTP on ROCm, looks like it's just not compatible at this time.