# Model Comparison for Tool Calling

Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB).

## Full Comparison Table

| | Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B |
|---|---|---|---|---|
| **Model ID** | `NousResearch/Hermes-3-Llama-3.1-70B-FP8` | `nvidia/Llama-3.3-70B-Instruct-FP8` | `RedHatAI/Qwen2-72B-Instruct-FP8` | `RedHatAI/Mistral-Nemo-Instruct-2407-FP8` |
| **Size** | 70B | 70B | 72B | 12B |
| **Quantization** | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 |
| **VLLM Parser** | `hermes` | `llama3_json` | `hermes` | `mistral` |
| **Context Window** | 128K | 128K | 128K | 128K |
| **Speed** | 25-35 tok/s | 60-90 tok/s | 60-90 tok/s | 100-150 tok/s |
| **VRAM Usage** | ~40GB | ~40GB | ~45GB | ~15GB |
| **Tool Call Quality** | Excellent | Excellent | Very Good | Good |
| **Multi-Tool** | Excellent | Good | Good | Fair |
| **JSON Compliance** | Very High | High | High | Medium |
| **Open WebUI** | No | Yes | Yes | Yes |
| **Multilingual** | Good | Good | Excellent | Good |

## Detailed Notes

### Hermes-3-Llama-3.1-70B-FP8

**Best for: Tool calling quality and reliability**

- Purpose-built for function calling by NousResearch
- Uses ChatML format with XML `<tool_call>` tags — the most reliable format for structured output
- Slowest of the 70B models due to `compressed-tensors` quantization (doesn't use native Blackwell FP8)
- Does NOT work with Open WebUI for tool calling (format incompatibility)
- Best at handling complex multi-step workflows with many tools
- Lowest hallucination rate for tool names and parameters

### Llama-3.3-70B-Instruct-FP8

**Best for: Open WebUI and general use**

- Official NVIDIA FP8 quantization — fastest 70B model on Blackwell
- Works out of the box with Open WebUI, no custom configuration
- Native FP8 (`fp8_e4m3`) leverages Blackwell's hardware acceleration
- Tool calling quality is nearly as good as Hermes-3 for most tasks
- Better at general conversation alongside tool use

### Qwen2-72B-Instruct-FP8

**Best for: Multilingual tool calling**

- Strongest multilingual support (Chinese, Japanese, Korean, European languages)
- Good reasoning capabilities alongside tool calling
- Uses `hermes` parser despite not being a Hermes model (ChatML-compatible)
- FP8 KV cache support saves VRAM
- Slightly larger memory footprint than Llama models

### Mistral-Nemo-Instruct-2407-FP8

**Best for: Fast iteration and development**

- Extremely fast: 100-150 tok/s (3-5x faster than 70B models)
- Very low memory: ~15GB leaves room for other processes
- Good enough for simple tool calling (1-3 tools)
- Struggles with complex multi-step workflows
- Great for testing and prototyping before deploying 70B models

## Recommendations by Use Case

| Use Case | Recommended Model | Why |
|----------|------------------|-----|
| Production tool calling | Hermes-3 70B | Best reliability and accuracy |
| Open WebUI deployment | Llama-3.3 70B | Works out of the box |
| Multilingual applications | Qwen2 72B | Best language coverage |
| Development/testing | Mistral-Nemo 12B | Fastest iteration speed |
| Multi-step workflows | Hermes-3 70B | Best at complex orchestration |
| Simple single-tool calls | Any | All models handle basic tools well |
| Memory-constrained | Mistral-Nemo 12B | Only 15GB VRAM |

## Memory Budget (96GB GPU)

```
Hermes-3 70B FP8:
  Model weights:  ~40GB
  KV cache (128K): ~45GB (4 concurrent requests)
  Overhead:        ~5GB
  Total:           ~90GB ← Fits on 96GB

Mistral-Nemo 12B FP8:
  Model weights:  ~15GB
  KV cache (128K): ~20GB (8 concurrent requests)
  Overhead:        ~3GB
  Total:           ~38GB ← Leaves 58GB free
```