Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior
I'm using this together with OpenCode. I just tested it quickly and it unfortunately keeps on calling the same command (stuck in an endless loop).
On a second try, giving it clear instructions for what it should implement, it sort of hallucinates todo steps and just checks them off. So it's really not behaving the way I'd hope it to behave.
btw maybe it's comparing apples to oranges because not moe and different architecture, but I haven't seen this with your qwen3.5-27b-AWQ-BF16-INT4 model.
I'm using vllm/vllm-openai:gemma4-cu130 and here's my vllm yaml config
name: gemma-4-26B-A4B-it-AWQ-4bit
description: 'Gemma 4 26B MoE (3.8B active) AWQ INT4. Apache 2.0 license. Native function
calling, thinking mode, vision. 128 experts top-8 routing. Frontier-level reasoning
and coding per size class.
'
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
container_image: vllm/vllm-openai:gemma4-cu130
environment:
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
OMP_NUM_THREADS: '8'
SAFETENSORS_FAST_GPU: '1'
CUDA_MODULE_LOADING: LAZY
VLLM_LOGGING_LEVEL: INFO
TORCH_COMPILE_THREADS: '4'
TORCHINDUCTOR_COMPILE_THREADS: '4'
MAX_JOBS: '4'
serve_args:
host: 0.0.0.0
port: 8000
trust-remote-code: true
dtype: auto
gpu-memory-utilization: 0.9
tensor-parallel-size: 1
kv-cache-dtype: auto
max-model-len: 120000
max-num-seqs: 4
max-num-batched-tokens: 16384
enable-prefix-caching: true
performance-mode: throughput
compilation-config: "{\"cudagraph_mode\": \"piecewise\",\n \"cudagraph_capture_sizes\"\
: [1, 2, 3, 4],\n \"inductor_compile_config\": {\n \"combo_kernels\": false,\n\
\ \"benchmark_combo_kernel\": false}}"
reasoning-parser: gemma4
enable-auto-tool-choice: true
tool-call-parser: gemma4
served-model-name: gemma-4-26B-A4B-it-AWQ-4bit
I'm seeing looping and failing tool calls as well. I've been back and for with gemini and copilot trying to get a jinja template that will let this work but no luck yet.
I should note, the tools I'm using are OpenClaw and Anthropic's Claude Code.
Edit I had a Jinja template here I thought fixed it but it doesn't.
Not sure if it's this models issue or vllms gemma4 parsers. Either it's unusable sadly and I'm back to lmstudio and qwen3-coder-next.
Should have RTFM I guess.
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch
This page mentions a special Jinja template to get from vllms GitHub.
Very initial testing is MUCH better.
Maybe it has something to do with this commit in the official repo that was made yesterday: https://huggingface.co/google/gemma-4-26B-A4B-it/commit/1db3cff1840c2ae59759d8e842ff37831cf8cb63
I've tried the vllm jinja modified 2 days ago, I've tried the official google jinja modified 1 day ago and results are not exactly good. Better but not good.
Maybe it's OpenClaws fault but they can't handle it.
gemma-4-31b, on the other hand CAN handle anything I ask of openclaw, but I cant fit that in my GPU so I'm running it from Googles free tier api, but that's probably running full precision AND its a smarter model so not exactly fair comparison.
Thank you for letting me know. Could you please try again with the recent chat_template.jinja and tokenizer_config.json?
I will look into this and potentially requantize using the new chat_template.jinja and tokenizer_config.json.
Thanks cpatonn. This seems to be an issue with llama.cpp as well (or at least was) so I don't think it's just a YOU thing.
I have cleared my ~/.cache/huggingface dir for this model so that it pulls down the latest changes you mentioned and I removed my custom flag for the other jinja template.
I will test it this evening and tomorrow and let you know how it works out.
Sadly it seems to not have helped.
Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".
I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.
Who knows.