Quant HAS issues + results with vLLM on 8x 3090

#1
by dehnhaide - opened

Excellent quant, thank you cyankiwi!

Got it working on 8x RTX 3090 (I know, overkill for this small dense model) with the command listed below.
Average speed: 55-60 tks

===================================
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=6
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=SYS
export OMP_NUM_THREADS=6

vllm serve cyankiwi/gemma-4-31B-it-AWQ-8bit --served-model-name "cyankiwi/gemma-4-31B-it-AWQ-8bit"
--tensor-parallel-size 8
--max-model-len 192768
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 2048
--tool-call-parser gemma4
--enable-auto-tool-choice
--reasoning-parser gemma4
--host 0.0.0.0 --port 5005
--disable-uvicorn-access-log
--limit-mm-per-prompt '{"image":4}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":64}'
--trust-remote-code
--enable-prefix-caching
--disable-custom-all-reduce

However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...

Screenshot from 2026-04-03 12-47-49

dehnhaide changed discussion title from Excellent quant + results with vLLM on 8x 3090 to Good quant BUT with issues + results with vLLM on 8x 3090
dehnhaide changed discussion title from Good quant BUT with issues + results with vLLM on 8x 3090 to Quant HAS issues + results with vLLM on 8x 3090

Update: I have tried the same with llama.cpp and "unsloth/gemma-4-31B-it-UD_Q6_K_XL" and works flawless.
Now I suspect it might be something not ok with the quant. Can you please check on your to see if I ramble?

image

However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...

Screenshot from 2026-04-03 12-47-49

Which version of OpenCode do you have? In my case the model doesn't work in the Build mode. I suppose because Open Code can't parse its chat template

cyankiwi org

Thank you for raising this with me. Google has updated the chat template 2 days ago, which I've just pulled into this model. Please try again and let me know if it still occurs.

Sign up or log in to comment