# Model Comparison for Tool Calling Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB). ## Full Comparison Table | | Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B | |---|---|---|---|---| | **Model ID** | `NousResearch/Hermes-3-Llama-3.1-70B-FP8` | `nvidia/Llama-3.3-70B-Instruct-FP8` | `RedHatAI/Qwen2-72B-Instruct-FP8` | `RedHatAI/Mistral-Nemo-Instruct-2407-FP8` | | **Size** | 70B | 70B | 72B | 12B | | **Quantization** | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 | | **VLLM Parser** | `hermes` | `llama3_json` | `hermes` | `mistral` | | **Context Window** | 128K | 128K | 128K | 128K | | **Speed** | 25-35 tok/s | 60-90 tok/s | 60-90 tok/s | 100-150 tok/s | | **VRAM Usage** | ~40GB | ~40GB | ~45GB | ~15GB | | **Tool Call Quality** | Excellent | Excellent | Very Good | Good | | **Multi-Tool** | Excellent | Good | Good | Fair | | **JSON Compliance** | Very High | High | High | Medium | | **Open WebUI** | No | Yes | Yes | Yes | | **Multilingual** | Good | Good | Excellent | Good | ## Detailed Notes ### Hermes-3-Llama-3.1-70B-FP8 **Best for: Tool calling quality and reliability** - Purpose-built for function calling by NousResearch - Uses ChatML format with XML `` tags — the most reliable format for structured output - Slowest of the 70B models due to `compressed-tensors` quantization (doesn't use native Blackwell FP8) - Does NOT work with Open WebUI for tool calling (format incompatibility) - Best at handling complex multi-step workflows with many tools - Lowest hallucination rate for tool names and parameters ### Llama-3.3-70B-Instruct-FP8 **Best for: Open WebUI and general use** - Official NVIDIA FP8 quantization — fastest 70B model on Blackwell - Works out of the box with Open WebUI, no custom configuration - Native FP8 (`fp8_e4m3`) leverages Blackwell's hardware acceleration - Tool calling quality is nearly as good as Hermes-3 for most tasks - Better at general conversation alongside tool use ### Qwen2-72B-Instruct-FP8 **Best for: Multilingual tool calling** - Strongest multilingual support (Chinese, Japanese, Korean, European languages) - Good reasoning capabilities alongside tool calling - Uses `hermes` parser despite not being a Hermes model (ChatML-compatible) - FP8 KV cache support saves VRAM - Slightly larger memory footprint than Llama models ### Mistral-Nemo-Instruct-2407-FP8 **Best for: Fast iteration and development** - Extremely fast: 100-150 tok/s (3-5x faster than 70B models) - Very low memory: ~15GB leaves room for other processes - Good enough for simple tool calling (1-3 tools) - Struggles with complex multi-step workflows - Great for testing and prototyping before deploying 70B models ## Recommendations by Use Case | Use Case | Recommended Model | Why | |----------|------------------|-----| | Production tool calling | Hermes-3 70B | Best reliability and accuracy | | Open WebUI deployment | Llama-3.3 70B | Works out of the box | | Multilingual applications | Qwen2 72B | Best language coverage | | Development/testing | Mistral-Nemo 12B | Fastest iteration speed | | Multi-step workflows | Hermes-3 70B | Best at complex orchestration | | Simple single-tool calls | Any | All models handle basic tools well | | Memory-constrained | Mistral-Nemo 12B | Only 15GB VRAM | ## Memory Budget (96GB GPU) ``` Hermes-3 70B FP8: Model weights: ~40GB KV cache (128K): ~45GB (4 concurrent requests) Overhead: ~5GB Total: ~90GB ← Fits on 96GB Mistral-Nemo 12B FP8: Model weights: ~15GB KV cache (128K): ~20GB (8 concurrent requests) Overhead: ~3GB Total: ~38GB ← Leaves 58GB free ```