Instructions to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF", dtype="auto")

llama-cpp-python

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF",
	filename="Hermes-2-Pro-Llama-3-8B_Q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

SGLang

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with Ollama:
```
ollama run hf.co/SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
```

Unsloth Studio

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with Docker Model Runner:
```
docker model run hf.co/SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M
```

Lemonade

How to use SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Hermes-2-Pro-Llama-3-8B-GGUF-Q4_K_M

List all available models

lemonade list

Quantized Hermes 2 Pro Models

This repository provides quantized GGUF versions of Hermes 2 Pro model. Hermes 2 Pro is an upgraded version of Nous Hermes 2, trained on a cleaned OpenHermes 2.5 dataset plus a new in-house Function Calling and JSON Mode dataset. These 4-bit and 5-bit quantized variants retain the original model’s strengths excels at general tasks, structured JSON outputs, and reliable function calling (90% accuracy in Fireworks.AI evals). With a special system prompt, multi-turn function calling, and new single-token tags like and , it’s optimized for agentic parsing and streaming.

Model Overview

Original Model: Meta-Llama-3-8B
Quantized Versions:
- Q4_K_M (4-bit quantization)
- Q5_K_M (5-bit quantization)
Architecture: Decoder-only transformer
Base Model: Hermes-2-Pro-Llama-3-8B
Modalities: Text only
Developer: Nous Research
License: Llama 3 Community License Agreement
Language: English

Quantization Details

Q4_K_M Version

Approx. ~75% size reduction
Lower memory footprint (~4.58 GB)
Best suited for deployment on edge devices or low-resource GPUs
Slight performance degradation in complex reasoning scenarios

Q5_K_M Version

Approx. ~71% size reduction
Higher fidelity (~5.38 GB)
Better performance retention, recommended when quality is a priority.

Key Features

Retrained on a cleaned OpenHermes-2.5 dataset with added Function-Calling & JSON-Mode data.
Strong Function Calling performance (≈90% in partnered evaluation) and structured JSON output accuracy (≈84%).
Uses ChatML prompt format and a special tool_use chat template to produce multi-turn, machine-parsable tool calls.
Adds single-token markers to help streaming/agent parsing: , , (and closing tags).

Usage

Hermes 2 Pro — Llama-3 8B is ideal for building agents that require reliable function calling, structured JSON outputs, and strong reasoning. Its 8B size balances capability with efficiency, making it suitable for research, prototyping, and real-world applications.

llama.cpp (text-only)

./llama-cli -hf SandLogicTechnologies/Hermes-2-Pro-GGUF -p "Write a python script designed for adding to a library on data cleaning"

Model Data

Pretraining Overview

Hermes 2 Pro — Llama-3 8B was trained on a refined version of the OpenHermes-2.5 dataset, combined with a custom Function Calling and JSON Mode corpus developed in-house. The data mix includes high-quality web content, code, reasoning tasks, STEM material, and multilingual samples. This targeted training enables the model to excel not only at general conversation but also at structured output generation and reliable tool use.

Recommended Use Cases

Function Calling & Tool Use
Powering agentic workflows where the model selects and invokes external tools or APIs using reliable JSON-based calls.
Structured JSON Outputs
Generating machine-readable responses that conform to a schema, useful for automation, integration with services, and structured data extraction.
Resource-conscious Deployment
The 8B parameter size makes it suitable for smaller GPUs and cloud environments, balancing performance with accessibility.
Low-resource deployment
Low-resource deployment runs AI models efficiently on limited hardware like CPUs, edge devices, or small GPUs.

Acknowledgments

These quantized models are based on the original work by the NousResearch development team.

Special thanks to:

The NousResearch team for developing and releasing the Hermes-2-Pro-Llama-3-8B model.
Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.

Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month: 20

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

5-bit

Model tree for SandLogicTechnologies/Hermes-2-Pro-Llama-3-8B-GGUF

Base model

NousResearch/Meta-Llama-3-8B

Finetuned

NousResearch/Hermes-2-Pro-Llama-3-8B

Quantized

(57)

this model

SandLogicTechnologies
/

Hermes-2-Pro-Llama-3-8B-GGUF