Thanks for a full attention quant!

by anikifoss - opened Feb 26

Discussion

anikifoss

Feb 26

•

edited Feb 26

Many quants just qunatize everything uniformly, thank you for taking the time to preserve the attention!

I got the model running with vllm using the following command.

vllm serve \
  /models/cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \
  --served-model-name Qwen3.5-122B-A10B \
  --trust-remote-code \
  --gpu_memory_utilization 0.95 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --host 127.0.0.1 \
  --port 8090

However, adding --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ causes vllm to crash.

Are you able to run vllm with mtp? If so, could you share you command line?

tschunschi

Feb 26

•

edited Feb 26

That's what currently working for me:

export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

uv run vllm serve \
  --model ~/models/cyankiwi-Qwen3.5-122B-A10B-AWQ-4bit \
  --served-model-name Qwen3.5-122B-A10B \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \   
  --gpu-memory-utilization 0.95 \  
  --swap-space 16 \  
  --max-num-seqs 32 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
  --language-model-only \
  --enable-expert-parallel \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen35_coder \
  --host 127.0.0.1 \
  --port 15647

I used vllm/transfomer installation as described here.

Also, I had a lot of issues with tool calling, so I applied this patch: https://github.com/vllm-project/vllm/pull/35347

No, it's working. Getting around 120 t/sec. Tool calls working fine in OpenCode.

anikifoss

Feb 26

Thanks for linking to the qwen35_coder tool-call-parser. I had no luck getting speculative-config to run on ROCm, will have to fiddle more with vllm installation to get it working.

anikifoss changed discussion status to closed Mar 1

anikifoss

Mar 1

I was debugging mtp issues and noticed that the mtp attention layers were not excluded from quantization, so they were quantized to INT4:

mtp.layers.0.self_attn.q_proj
mtp.layers.0.self_attn.k_proj
mtp.layers.0.self_attn.v_proj
mtp.layers.0.self_attn.o_proj

anikifoss changed discussion status to open Mar 2

xuzhang

Mar 2

I noticed that you have updated three files on Hugging Face. Could you please let me know whether the update will improve the model's intelligence or its speed? Thank you.

cpatonn

cyankiwi org Mar 2

Yes, the MTP attention layers was initially quantized, which was a mistake.

I updated MTP layers attention layers to be at BF16, for speculative decoding to work, and to slightly increase draft acceptance rate.

anikifoss

Mar 9

Thank you! Still no luck with MTP on ROCm, looks like it's just not compatible at this time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment