Quant HAS issues + results with vLLM on 8x 3090

by dehnhaide - opened 12 days ago

Excellent quant, thank you cyankiwi!

Got it working on 8x RTX 3090 (I know, overkill for this small dense model) with the command listed below.
Average speed: 55-60 tks

===================================
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=6
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=SYS
export OMP_NUM_THREADS=6

vllm serve cyankiwi/gemma-4-31B-it-AWQ-8bit --served-model-name "cyankiwi/gemma-4-31B-it-AWQ-8bit"
--tensor-parallel-size 8
--max-model-len 192768
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 2048
--tool-call-parser gemma4
--enable-auto-tool-choice
--reasoning-parser gemma4
--host 0.0.0.0 --port 5005
--disable-uvicorn-access-log
--limit-mm-per-prompt '{"image":4}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":64}'
--trust-remote-code
--enable-prefix-caching
--disable-custom-all-reduce

dehnhaide

12 days ago

However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...

dehnhaide changed discussion title from Excellent quant + results with vLLM on 8x 3090 to Good quant BUT with issues + results with vLLM on 8x 3090 12 days ago

dehnhaide changed discussion title from Good quant BUT with issues + results with vLLM on 8x 3090 to Quant HAS issues + results with vLLM on 8x 3090 12 days ago

dehnhaide

12 days ago

Update: I have tried the same with llama.cpp and "unsloth/gemma-4-31B-it-UD_Q6_K_XL" and works flawless.
Now I suspect it might be something not ok with the quant. Can you please check on your to see if I ramble?

alexcardo

4 days ago

However, for the quality of model ... something is really off (with Opencode). Asked it to "Create a simple flask application with a simple HTML, CSS and JS frontend. It should manage todos." and it loops indefinitely since it doubles like crazy on html tags (names + "<<"). Not sure if it's the quant or... the model...

Which version of OpenCode do you have? In my case the model doesn't work in the Build mode. I suppose because Open Code can't parse its chat template

cpatonn

cyankiwi org 3 days ago

Thank you for raising this with me. Google has updated the chat template 2 days ago, which I've just pulled into this model. Please try again and let me know if it still occurs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment