guides/MODEL_COMPARISON.md · joshuaeric/vllm-tool-calling-guide at main

vllm-tool-calling-guide / guides /MODEL_COMPARISON.md

Joshua Odmark

Mistral acquired by RedHat, updating references

2e4500a 2 months ago

3.72 kB

	# Model Comparison for Tool Calling

	Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB).

	## Full Comparison Table

	\| \| Hermes-3 70B \| Llama-3.3 70B \| Qwen2 72B \| Mistral-Nemo 12B \|
	\|---\|---\|---\|---\|---\|
	\| Model ID \| `NousResearch/Hermes-3-Llama-3.1-70B-FP8` \| `nvidia/Llama-3.3-70B-Instruct-FP8` \| `RedHatAI/Qwen2-72B-Instruct-FP8` \| `RedHatAI/Mistral-Nemo-Instruct-2407-FP8` \|
	\| Size \| 70B \| 70B \| 72B \| 12B \|
	\| Quantization \| FP8 (compressed-tensors) \| FP8 (native e4m3) \| FP8 \| FP8 \|
	\| VLLM Parser \| `hermes` \| `llama3_json` \| `hermes` \| `mistral` \|
	\| Context Window \| 128K \| 128K \| 128K \| 128K \|
	\| Speed \| 25-35 tok/s \| 60-90 tok/s \| 60-90 tok/s \| 100-150 tok/s \|
	\| VRAM Usage \| ~40GB \| ~40GB \| ~45GB \| ~15GB \|
	\| Tool Call Quality \| Excellent \| Excellent \| Very Good \| Good \|
	\| Multi-Tool \| Excellent \| Good \| Good \| Fair \|
	\| JSON Compliance \| Very High \| High \| High \| Medium \|
	\| Open WebUI \| No \| Yes \| Yes \| Yes \|
	\| Multilingual \| Good \| Good \| Excellent \| Good \|

	## Detailed Notes

	### Hermes-3-Llama-3.1-70B-FP8

	Best for: Tool calling quality and reliability

	- Purpose-built for function calling by NousResearch
	- Uses ChatML format with XML `<tool_call>` tags — the most reliable format for structured output
	- Slowest of the 70B models due to `compressed-tensors` quantization (doesn't use native Blackwell FP8)
	- Does NOT work with Open WebUI for tool calling (format incompatibility)
	- Best at handling complex multi-step workflows with many tools
	- Lowest hallucination rate for tool names and parameters

	### Llama-3.3-70B-Instruct-FP8

	Best for: Open WebUI and general use

	- Official NVIDIA FP8 quantization — fastest 70B model on Blackwell
	- Works out of the box with Open WebUI, no custom configuration
	- Native FP8 (`fp8_e4m3`) leverages Blackwell's hardware acceleration
	- Tool calling quality is nearly as good as Hermes-3 for most tasks
	- Better at general conversation alongside tool use

	### Qwen2-72B-Instruct-FP8

	Best for: Multilingual tool calling

	- Strongest multilingual support (Chinese, Japanese, Korean, European languages)
	- Good reasoning capabilities alongside tool calling
	- Uses `hermes` parser despite not being a Hermes model (ChatML-compatible)
	- FP8 KV cache support saves VRAM
	- Slightly larger memory footprint than Llama models

	### Mistral-Nemo-Instruct-2407-FP8

	Best for: Fast iteration and development

	- Extremely fast: 100-150 tok/s (3-5x faster than 70B models)
	- Very low memory: ~15GB leaves room for other processes
	- Good enough for simple tool calling (1-3 tools)
	- Struggles with complex multi-step workflows
	- Great for testing and prototyping before deploying 70B models

	## Recommendations by Use Case

	\| Use Case \| Recommended Model \| Why \|
	\|----------\|------------------\|-----\|
	\| Production tool calling \| Hermes-3 70B \| Best reliability and accuracy \|
	\| Open WebUI deployment \| Llama-3.3 70B \| Works out of the box \|
	\| Multilingual applications \| Qwen2 72B \| Best language coverage \|
	\| Development/testing \| Mistral-Nemo 12B \| Fastest iteration speed \|
	\| Multi-step workflows \| Hermes-3 70B \| Best at complex orchestration \|
	\| Simple single-tool calls \| Any \| All models handle basic tools well \|
	\| Memory-constrained \| Mistral-Nemo 12B \| Only 15GB VRAM \|

	## Memory Budget (96GB GPU)

	```
	Hermes-3 70B FP8:
	Model weights: ~40GB
	KV cache (128K): ~45GB (4 concurrent requests)
	Overhead: ~5GB
	Total: ~90GB ← Fits on 96GB

	Mistral-Nemo 12B FP8:
	Model weights: ~15GB
	KV cache (128K): ~20GB (8 concurrent requests)
	Overhead: ~3GB
	Total: ~38GB ← Leaves 58GB free
	```