File size: 3,627 Bytes
634c038
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# The Critical Context Length Fix

**This is the #1 issue people hit when deploying tool calling with VLLM.**

## The Problem

VLLM often defaults to context windows of 16K-32K tokens. This seems fine for normal chat, but tool calling needs significantly more context:

```
System prompt:                  3,000 - 5,000 tokens
Tool definitions (per tool):    500 - 2,000 tokens
  Γ— 5-10 tools:                2,500 - 20,000 tokens
User message:                   100 - 1,000 tokens
Previous conversation:          1,000 - 10,000 tokens
Tool responses:                 2,000 - 20,000 tokens
Safety margin:                  5,000 tokens
─────────────────────────────────────────────────
Total needed:                   13,600 - 61,000 tokens
```

With a 16K context window, the model runs out of space mid-generation. The tool call gets **silently truncated** β€” you see incomplete JSON, missing arguments, or the model suddenly stops generating.

## Symptoms

- Tool calls end mid-JSON: `{"name": "get_weather", "arguments": {"loc`
- Model stops generating after the first tool call in a multi-step workflow
- Second or third tool call in a conversation is always malformed
- Model "forgets" tool definitions and responds with plain text
- Works fine with 1-2 tools but fails with 5+

## The Fix

```bash
# BEFORE (broken)
python -m vllm.entrypoints.openai.api_server \
  --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
  --max-model-len 16384        # Default β€” too small

# AFTER (working)
python -m vllm.entrypoints.openai.api_server \
  --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
  --max-model-len 131072       # 128K β€” full model support
  --max-num-seqs 4             # Reduce concurrency to fit KV cache
  --max-num-batched-tokens 132000
  --gpu-memory-utilization 0.90
```

## Memory Math

The tradeoff is between context length and concurrent requests:

### 70B FP8 Model on 96GB GPU

```
Model weights (FP8):           ~40 GB
Available for KV cache:        ~46 GB (at 0.90 utilization)

KV cache per token per request:
  70B model β‰ˆ 0.5 KB/token

Cost per concurrent request at 128K context:
  128,000 Γ— 0.5 KB = 64 MB per layer Γ— 80 layers β‰ˆ 5 GB

Max concurrent requests:
  46 GB Γ· ~11.5 GB/request β‰ˆ 4 requests

β†’ Use --max-num-seqs 4
```

### 12B FP8 Model on 96GB GPU

```
Model weights (FP8):           ~15 GB
Available for KV cache:        ~71 GB

β†’ Use --max-num-seqs 8 (or more)
```

## Concurrency vs Context Tradeoff

| Context Length | Max Seqs (70B) | Max Seqs (12B) | Tool Calling |
|---------------|---------------|---------------|-------------|
| 16K | 16 | 32+ | Broken for multi-tool |
| 32K | 8 | 16 | Marginal |
| 64K | 6 | 12 | Good for simple workflows |
| **128K** | **4** | **8** | **Reliable for complex workflows** |

**Recommendation:** Always use 128K. The reduced concurrency is worth it. If you need more throughput, use a smaller model rather than reducing context.

## Why This Isn't Obvious

1. VLLM doesn't warn you when context is too small β€” it just generates truncated output
2. The default `max-model-len` varies by model and may not match the model's actual capability
3. Simple tests with 1-2 tools often pass even at 16K, so the issue only appears in production
4. The truncation looks like a model quality issue, not a configuration issue

## Verification

After increasing context, verify with:

```bash
curl http://localhost:8000/v1/models | python -m json.tool
```

Check the `max_model_len` field in the response to confirm it's set to 131072.