| # Model Comparison for Tool Calling |
|
|
| Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB). |
|
|
| ## Full Comparison Table |
|
|
| | | Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B | |
| |---|---|---|---|---| |
| | **Model ID** | `NousResearch/Hermes-3-Llama-3.1-70B-FP8` | `nvidia/Llama-3.3-70B-Instruct-FP8` | `RedHatAI/Qwen2-72B-Instruct-FP8` | `RedHatAI/Mistral-Nemo-Instruct-2407-FP8` | |
| | **Size** | 70B | 70B | 72B | 12B | |
| | **Quantization** | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 | |
| | **VLLM Parser** | `hermes` | `llama3_json` | `hermes` | `mistral` | |
| | **Context Window** | 128K | 128K | 128K | 128K | |
| | **Speed** | 25-35 tok/s | 60-90 tok/s | 60-90 tok/s | 100-150 tok/s | |
| | **VRAM Usage** | ~40GB | ~40GB | ~45GB | ~15GB | |
| | **Tool Call Quality** | Excellent | Excellent | Very Good | Good | |
| | **Multi-Tool** | Excellent | Good | Good | Fair | |
| | **JSON Compliance** | Very High | High | High | Medium | |
| | **Open WebUI** | No | Yes | Yes | Yes | |
| | **Multilingual** | Good | Good | Excellent | Good | |
|
|
| ## Detailed Notes |
|
|
| ### Hermes-3-Llama-3.1-70B-FP8 |
|
|
| **Best for: Tool calling quality and reliability** |
|
|
| - Purpose-built for function calling by NousResearch |
| - Uses ChatML format with XML `<tool_call>` tags — the most reliable format for structured output |
| - Slowest of the 70B models due to `compressed-tensors` quantization (doesn't use native Blackwell FP8) |
| - Does NOT work with Open WebUI for tool calling (format incompatibility) |
| - Best at handling complex multi-step workflows with many tools |
| - Lowest hallucination rate for tool names and parameters |
|
|
| ### Llama-3.3-70B-Instruct-FP8 |
|
|
| **Best for: Open WebUI and general use** |
|
|
| - Official NVIDIA FP8 quantization — fastest 70B model on Blackwell |
| - Works out of the box with Open WebUI, no custom configuration |
| - Native FP8 (`fp8_e4m3`) leverages Blackwell's hardware acceleration |
| - Tool calling quality is nearly as good as Hermes-3 for most tasks |
| - Better at general conversation alongside tool use |
|
|
| ### Qwen2-72B-Instruct-FP8 |
|
|
| **Best for: Multilingual tool calling** |
|
|
| - Strongest multilingual support (Chinese, Japanese, Korean, European languages) |
| - Good reasoning capabilities alongside tool calling |
| - Uses `hermes` parser despite not being a Hermes model (ChatML-compatible) |
| - FP8 KV cache support saves VRAM |
| - Slightly larger memory footprint than Llama models |
|
|
| ### Mistral-Nemo-Instruct-2407-FP8 |
|
|
| **Best for: Fast iteration and development** |
|
|
| - Extremely fast: 100-150 tok/s (3-5x faster than 70B models) |
| - Very low memory: ~15GB leaves room for other processes |
| - Good enough for simple tool calling (1-3 tools) |
| - Struggles with complex multi-step workflows |
| - Great for testing and prototyping before deploying 70B models |
|
|
| ## Recommendations by Use Case |
|
|
| | Use Case | Recommended Model | Why | |
| |----------|------------------|-----| |
| | Production tool calling | Hermes-3 70B | Best reliability and accuracy | |
| | Open WebUI deployment | Llama-3.3 70B | Works out of the box | |
| | Multilingual applications | Qwen2 72B | Best language coverage | |
| | Development/testing | Mistral-Nemo 12B | Fastest iteration speed | |
| | Multi-step workflows | Hermes-3 70B | Best at complex orchestration | |
| | Simple single-tool calls | Any | All models handle basic tools well | |
| | Memory-constrained | Mistral-Nemo 12B | Only 15GB VRAM | |
|
|
| ## Memory Budget (96GB GPU) |
|
|
| ``` |
| Hermes-3 70B FP8: |
| Model weights: ~40GB |
| KV cache (128K): ~45GB (4 concurrent requests) |
| Overhead: ~5GB |
| Total: ~90GB ← Fits on 96GB |
| |
| Mistral-Nemo 12B FP8: |
| Model weights: ~15GB |
| KV cache (128K): ~20GB (8 concurrent requests) |
| Overhead: ~3GB |
| Total: ~38GB ← Leaves 58GB free |
| ``` |
|
|