Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior

by wijjjj - opened 6 days ago

•

I'm using this together with OpenCode. I just tested it quickly and it unfortunately keeps on calling the same command (stuck in an endless loop).
On a second try, giving it clear instructions for what it should implement, it sort of hallucinates todo steps and just checks them off. So it's really not behaving the way I'd hope it to behave.

btw maybe it's comparing apples to oranges because not moe and different architecture, but I haven't seen this with your qwen3.5-27b-AWQ-BF16-INT4 model.

I'm using vllm/vllm-openai:gemma4-cu130 and here's my vllm yaml config

name: gemma-4-26B-A4B-it-AWQ-4bit
description: 'Gemma 4 26B MoE (3.8B active) AWQ INT4. Apache 2.0 license. Native function
  calling, thinking mode, vision. 128 experts top-8 routing. Frontier-level reasoning
  and coding per size class.

  '
model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
container_image: vllm/vllm-openai:gemma4-cu130
environment:
  PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
  OMP_NUM_THREADS: '8'
  SAFETENSORS_FAST_GPU: '1'
  CUDA_MODULE_LOADING: LAZY
  VLLM_LOGGING_LEVEL: INFO
  TORCH_COMPILE_THREADS: '4'
  TORCHINDUCTOR_COMPILE_THREADS: '4'
  MAX_JOBS: '4'
serve_args:
  host: 0.0.0.0
  port: 8000
  trust-remote-code: true
  dtype: auto
  gpu-memory-utilization: 0.9
  tensor-parallel-size: 1
  kv-cache-dtype: auto
  max-model-len: 120000
  max-num-seqs: 4
  max-num-batched-tokens: 16384
  enable-prefix-caching: true
  performance-mode: throughput
  compilation-config: "{\"cudagraph_mode\": \"piecewise\",\n \"cudagraph_capture_sizes\"\
    : [1, 2, 3, 4],\n \"inductor_compile_config\": {\n   \"combo_kernels\": false,\n\
    \   \"benchmark_combo_kernel\": false}}"
  reasoning-parser: gemma4
  enable-auto-tool-choice: true
  tool-call-parser: gemma4
  served-model-name: gemma-4-26B-A4B-it-AWQ-4bit

wijjjj changed discussion title from Use in OpenCode via vLLM: Endless Tool Calling Loops to Use in OpenCode via vLLM: Endless Tool Calling Loops & undesirable behavior 6 days ago

Jiber

6 days ago

I'm seeing looping and failing tool calls as well. I've been back and for with gemini and copilot trying to get a jinja template that will let this work but no luck yet.

Jiber

6 days ago

•

edited 4 days ago

I should note, the tools I'm using are OpenClaw and Anthropic's Claude Code.
Edit I had a Jinja template here I thought fixed it but it doesn't.
Not sure if it's this models issue or vllms gemma4 parsers. Either it's unusable sadly and I'm back to lmstudio and qwen3-coder-next.

Jiber

4 days ago

Should have RTFM I guess.
https://docs.vllm.ai/projects/recipes/en/latest/Google/Gemma4.html#full-featured-server-launch

This page mentions a special Jinja template to get from vllms GitHub.

Very initial testing is MUCH better.

Lobliqua

4 days ago

Maybe it has something to do with this commit in the official repo that was made yesterday: https://huggingface.co/google/gemma-4-26B-A4B-it/commit/1db3cff1840c2ae59759d8e842ff37831cf8cb63

Jiber

4 days ago

I've tried the vllm jinja modified 2 days ago, I've tried the official google jinja modified 1 day ago and results are not exactly good. Better but not good.
Maybe it's OpenClaws fault but they can't handle it.
gemma-4-31b, on the other hand CAN handle anything I ask of openclaw, but I cant fit that in my GPU so I'm running it from Googles free tier api, but that's probably running full precision AND its a smarter model so not exactly fair comparison.

cpatonn

cyankiwi org 3 days ago

Thank you for letting me know. Could you please try again with the recent chat_template.jinja and tokenizer_config.json?

I will look into this and potentially requantize using the new chat_template.jinja and tokenizer_config.json.

Jiber

2 days ago

Thanks cpatonn. This seems to be an issue with llama.cpp as well (or at least was) so I don't think it's just a YOU thing.

I have cleared my ~/.cache/huggingface dir for this model so that it pulls down the latest changes you mentioned and I removed my custom flag for the other jinja template.

I will test it this evening and tomorrow and let you know how it works out.

Jiber

2 days ago

Sadly it seems to not have helped.

Tool calls are being made incorrectly still.
Even simple ones through openclaw like "cat the contents of file x to the chat".

I don't have much luck with this model in gguf format on lmstudio either... Maybe it's just not smart enough for openclaw OR there's some big Jinja issue google hasn't fixed.

Who knows.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment