Quant HAS issues + results with vLLM on 8x 3090
Excellent quant, thank you cyankiwi!
Got it working on 8x RTX 3090 (I know, overkill for this small dense model) with the command listed below.
Average speed: 55-60 tks
===================================
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=6
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=SYS
export OMP_NUM_THREADS=6
vllm serve cyankiwi/gemma-4-31B-it-AWQ-8bit --served-model-name "cyankiwi/gemma-4-31B-it-AWQ-8bit"
--tensor-parallel-size 8
--max-model-len 192768
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 2048
--tool-call-parser gemma4
--enable-auto-tool-choice
--reasoning-parser gemma4
--host 0.0.0.0 --port 5005
--disable-uvicorn-access-log
--limit-mm-per-prompt '{"image":4}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":64}'
--trust-remote-code
--enable-prefix-caching
--disable-custom-all-reduce
However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...
However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...
Which version of OpenCode do you have? In my case the model doesn't work in the Build mode. I suppose because Open Code can't parse its chat template
Thank you for raising this with me. Google has updated the chat template 2 days ago, which I've just pulled into this model. Please try again and let me know if it still occurs.

