guides/CONTEXT_LENGTH_FIX.md · joshuaeric/vllm-tool-calling-guide at main

vllm-tool-calling-guide / guides /CONTEXT_LENGTH_FIX.md

Joshua Odmark

Initial release: VLLM tool calling guide for open source models

634c038 2 months ago

3.63 kB

	# The Critical Context Length Fix

	This is the #1 issue people hit when deploying tool calling with VLLM.

	## The Problem

	VLLM often defaults to context windows of 16K-32K tokens. This seems fine for normal chat, but tool calling needs significantly more context:

	```
	System prompt: 3,000 - 5,000 tokens
	Tool definitions (per tool): 500 - 2,000 tokens
	× 5-10 tools: 2,500 - 20,000 tokens
	User message: 100 - 1,000 tokens
	Previous conversation: 1,000 - 10,000 tokens
	Tool responses: 2,000 - 20,000 tokens
	Safety margin: 5,000 tokens
	─────────────────────────────────────────────────
	Total needed: 13,600 - 61,000 tokens
	```

	With a 16K context window, the model runs out of space mid-generation. The tool call gets silently truncated — you see incomplete JSON, missing arguments, or the model suddenly stops generating.

	## Symptoms

	- Tool calls end mid-JSON: `{"name": "get_weather", "arguments": {"loc`
	- Model stops generating after the first tool call in a multi-step workflow
	- Second or third tool call in a conversation is always malformed
	- Model "forgets" tool definitions and responds with plain text
	- Works fine with 1-2 tools but fails with 5+

	## The Fix

	```bash
	# BEFORE (broken)
	python -m vllm.entrypoints.openai.api_server \
	--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
	--max-model-len 16384 # Default — too small

	# AFTER (working)
	python -m vllm.entrypoints.openai.api_server \
	--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
	--max-model-len 131072 # 128K — full model support
	--max-num-seqs 4 # Reduce concurrency to fit KV cache
	--max-num-batched-tokens 132000
	--gpu-memory-utilization 0.90
	```

	## Memory Math

	The tradeoff is between context length and concurrent requests:

	### 70B FP8 Model on 96GB GPU

	```
	Model weights (FP8): ~40 GB
	Available for KV cache: ~46 GB (at 0.90 utilization)

	KV cache per token per request:
	70B model ≈ 0.5 KB/token

	Cost per concurrent request at 128K context:
	128,000 × 0.5 KB = 64 MB per layer × 80 layers ≈ 5 GB

	Max concurrent requests:
	46 GB ÷ ~11.5 GB/request ≈ 4 requests

	→ Use --max-num-seqs 4
	```

	### 12B FP8 Model on 96GB GPU

	```
	Model weights (FP8): ~15 GB
	Available for KV cache: ~71 GB

	→ Use --max-num-seqs 8 (or more)
	```

	## Concurrency vs Context Tradeoff

	\| Context Length \| Max Seqs (70B) \| Max Seqs (12B) \| Tool Calling \|
	\|---------------\|---------------\|---------------\|-------------\|
	\| 16K \| 16 \| 32+ \| Broken for multi-tool \|
	\| 32K \| 8 \| 16 \| Marginal \|
	\| 64K \| 6 \| 12 \| Good for simple workflows \|
	\| 128K \| 4 \| 8 \| Reliable for complex workflows \|

	Recommendation: Always use 128K. The reduced concurrency is worth it. If you need more throughput, use a smaller model rather than reducing context.

	## Why This Isn't Obvious

	1. VLLM doesn't warn you when context is too small — it just generates truncated output
	2. The default `max-model-len` varies by model and may not match the model's actual capability
	3. Simple tests with 1-2 tools often pass even at 16K, so the issue only appears in production
	4. The truncation looks like a model quality issue, not a configuration issue

	## Verification

	After increasing context, verify with:

	```bash
	curl http://localhost:8000/v1/models \| python -m json.tool
	```

	Check the `max_model_len` field in the response to confirm it's set to 131072.