| # The Critical Context Length Fix |
|
|
| **This is the #1 issue people hit when deploying tool calling with VLLM.** |
|
|
| ## The Problem |
|
|
| VLLM often defaults to context windows of 16K-32K tokens. This seems fine for normal chat, but tool calling needs significantly more context: |
|
|
| ``` |
| System prompt: 3,000 - 5,000 tokens |
| Tool definitions (per tool): 500 - 2,000 tokens |
| Γ 5-10 tools: 2,500 - 20,000 tokens |
| User message: 100 - 1,000 tokens |
| Previous conversation: 1,000 - 10,000 tokens |
| Tool responses: 2,000 - 20,000 tokens |
| Safety margin: 5,000 tokens |
| βββββββββββββββββββββββββββββββββββββββββββββββββ |
| Total needed: 13,600 - 61,000 tokens |
| ``` |
|
|
| With a 16K context window, the model runs out of space mid-generation. The tool call gets **silently truncated** β you see incomplete JSON, missing arguments, or the model suddenly stops generating. |
|
|
| ## Symptoms |
|
|
| - Tool calls end mid-JSON: `{"name": "get_weather", "arguments": {"loc` |
| - Model stops generating after the first tool call in a multi-step workflow |
| - Second or third tool call in a conversation is always malformed |
| - Model "forgets" tool definitions and responds with plain text |
| - Works fine with 1-2 tools but fails with 5+ |
|
|
| ## The Fix |
|
|
| ```bash |
| # BEFORE (broken) |
| python -m vllm.entrypoints.openai.api_server \ |
| --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \ |
| --max-model-len 16384 # Default β too small |
| |
| # AFTER (working) |
| python -m vllm.entrypoints.openai.api_server \ |
| --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \ |
| --max-model-len 131072 # 128K β full model support |
| --max-num-seqs 4 # Reduce concurrency to fit KV cache |
| --max-num-batched-tokens 132000 |
| --gpu-memory-utilization 0.90 |
| ``` |
|
|
| ## Memory Math |
|
|
| The tradeoff is between context length and concurrent requests: |
|
|
| ### 70B FP8 Model on 96GB GPU |
|
|
| ``` |
| Model weights (FP8): ~40 GB |
| Available for KV cache: ~46 GB (at 0.90 utilization) |
| |
| KV cache per token per request: |
| 70B model β 0.5 KB/token |
| |
| Cost per concurrent request at 128K context: |
| 128,000 Γ 0.5 KB = 64 MB per layer Γ 80 layers β 5 GB |
| |
| Max concurrent requests: |
| 46 GB Γ· ~11.5 GB/request β 4 requests |
| |
| β Use --max-num-seqs 4 |
| ``` |
|
|
| ### 12B FP8 Model on 96GB GPU |
|
|
| ``` |
| Model weights (FP8): ~15 GB |
| Available for KV cache: ~71 GB |
| |
| β Use --max-num-seqs 8 (or more) |
| ``` |
|
|
| ## Concurrency vs Context Tradeoff |
|
|
| | Context Length | Max Seqs (70B) | Max Seqs (12B) | Tool Calling | |
| |---------------|---------------|---------------|-------------| |
| | 16K | 16 | 32+ | Broken for multi-tool | |
| | 32K | 8 | 16 | Marginal | |
| | 64K | 6 | 12 | Good for simple workflows | |
| | **128K** | **4** | **8** | **Reliable for complex workflows** | |
|
|
| **Recommendation:** Always use 128K. The reduced concurrency is worth it. If you need more throughput, use a smaller model rather than reducing context. |
|
|
| ## Why This Isn't Obvious |
|
|
| 1. VLLM doesn't warn you when context is too small β it just generates truncated output |
| 2. The default `max-model-len` varies by model and may not match the model's actual capability |
| 3. Simple tests with 1-2 tools often pass even at 16K, so the issue only appears in production |
| 4. The truncation looks like a model quality issue, not a configuration issue |
|
|
| ## Verification |
|
|
| After increasing context, verify with: |
|
|
| ```bash |
| curl http://localhost:8000/v1/models | python -m json.tool |
| ``` |
|
|
| Check the `max_model_len` field in the response to confirm it's set to 131072. |
|
|