Instructions to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF", filename="mellum2-claude-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Use Docker
docker model run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
- Ollama
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Ollama:
ollama run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
- Unsloth Studio
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF to start chatting
- Pi
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Docker Model Runner:
docker model run hf.co/yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
- Lemonade
How to use yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF-Q4_K_M
List all available models
lemonade list
- โ๏ธ Mellum2-12B-A2.5B-Reasoning-Distill (GGUF) โ๏ธ
โ๏ธ Mellum2-12B-A2.5B-Reasoning-Distill (GGUF) โ๏ธ
๐งโ๐ป A fast little coding brain โ local AI for everyone
12B total params, only 2.5B active per token. This is a Mixture-of-Experts model, so it runs like a ~2.5B model but thinks like a 12B one. ๐ Built on JetBrains' Mellum 2 (a from-scratch software-engineering model) and tuned on Claude Opus 4.6 / 4.7 / 4.8 reasoning traces โ it reasons step-by-step in
<think>blocks, then answers. ๐ง ๐ป All local, all yours, no API, no cloud. And it's seriously fast.
โก Blazing fast โ measured, not marketing ๐๏ธ๐จ
~440 tokens/sec on a single RTX 5090 at Q4_K_M (--n-gpu-layers 99 -fa on) โ and generation quality
holds up: correct, coherent code and clean step-by-step reasoning. ๐ฏ
You get big-model answers at small-model speed. Why so quick? It's a Mixture-of-Experts (only 2.5B of the 12B params fire per token) with a compact 98K vocab โ so it generates several times faster than a dense model its size, with no draft / speculative model needed. ๐
| Hardware | Quant | Generation speed (measured) |
|---|---|---|
| RTX 5090 (32 GB) | Q4_K_M | ~440 tok/s โก |
๐ฆ Pick your size (GGUF quants)
| Quant | Size | Vibe |
|---|---|---|
| ๐ข Q2_K | 5.0 GB | tiniest โ runs almost anywhere |
| ๐ต Q4_K_M | 8.1 GB | the sweet spot ๐ (recommended) |
| ๐ฃ Q6_K | 10.9 GB | near-lossless |
| โช Q8_0 | 12.9 GB | basically full quality |
๐ก It's a Mixture-of-Experts: all 64 experts live on disk/VRAM (so size is for the whole 12B), but only 8 fire per token โ that's why it's so quick.
๐งฎ "Will it fit?" โ rough VRAM guide
Mellum2 has a tiny KV cache (GQA with just 4 KV heads, and sliding-window attention on 3 of every 4 layers) โ so context is rarely the limiter. Pick the quant that fits your VRAM and you'll have plenty of room for long context (max is 131K). Rough numbers ๐ค (weights + ~2 GB overhead):
| Your VRAM / unified mem | Best quant that fits | Context headroom |
|---|---|---|
| 8 GB | ๐ข Q2_K | comfy (long ctx still fits) |
| 12 GB | ๐ต Q4_K_M | lots |
| 16 GB | ๐ฃ Q6_K / โช Q8_0 | lots |
| 24 GB+ | โช Q8_0 | up to 131K ๐ |
๐ก Apple Silicon / iGPUs with unified memory count too โ same idea, just slower than a dGPU. ๐ก Tight on room? Drop a quant or use a
q4_0KV cache for even more context.
๐ How to run it (super easy)
Option A โ llama.cpp (recommended) ๐ฆ
- Grab a quant above (e.g.
โฆ-Q4_K_M.gguf) andllama-serverfrom llama.cpp.โ ๏ธ Needs a recent llama.cpp that supports the
mellum2architecture (a mid-2026 build or newer). Older builds fail withunknown architecture: 'mellum2'. - Run a server (Windows
.batshown โ tweak--port,--ctx-sizeto taste):
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\mellum2-claude-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap ^
-fa on ^
--jinja --reasoning-format deepseek ^
--temp 0.6 --top-p 0.95 --top-k 20 ^
--host 0.0.0.0 --port 18080
pause
- Open
http://localhost:18080and chat. ๐ (Tip: bump--ctx-sizeโ the KV cache is small, so go big.)
Option B โ one-click apps ๐ฑ๏ธ
Works in LM Studio, Jan, Ollama, etc. โ just import the GGUF, pick your quant, go. ๐พ
(Make sure the app ships a recent llama.cpp that knows mellum2.)
๐ง Thinking mode
This model thinks natively in <think> โฆ </think> blocks. The chat template handles it automatically;
the --reasoning-format deepseek flag tells llama.cpp to surface the reasoning cleanly.
Recommended sampling: temp 0.6, top_p 0.95, top_k 20 (JetBrains' official settings for the Thinking model).
๐ Prefer raw transformers? (click)
This repo ships GGUF (for llama.cpp). To run in raw transformers you need non-GGUF weights โ point
midat the original JetBrains checkpoint (or your own merged fp16). Needstransformers >= 5.8(themellumarchitecture is built in โ notrust_remote_code). It's a plain text CausalLM (not multimodal).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
mid = "JetBrains/Mellum2-12B-A2.5B-Thinking" # GGUF won't load here โ use non-GGUF weights
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16,
device_map="auto", attn_implementation="sdpa")
msgs = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True,
enable_thinking=True)
inputs = inputs.to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95, top_k=20)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=False))
โก Why no MTP / draft model?
JetBrains' Mellum 2 uses Multi-Token Prediction as a training-time objective only โ it isn't exported to the released weights, so there's no draft to ship. You don't need one: with just 2.5B active params and a compact vocab, it already generates extremely fast (~440 tok/s on a 5090 at Q4_K_M). ๐๏ธ
๐งฉ What is this, exactly?
- Base:
JetBrains/Mellum2-12B-A2.5B-Thinkingโ a from-scratch, software-engineering-focused Mixture-of-Experts model (12.15B total / 2.5B active, 28 layers, 64 experts with 8 active, 131K context). JetBrains already did SFT + RLVR on it; this is a light extra LoRA distillation pass on top. - This fine-tune: a low-intensity QLoRA-style pass over the attention projections, distilling Claude Opus reasoning style into the model. It keeps Mellum 2's coding/agent strengths while nudging the reasoning voice toward Opus. ๐ก
โ ๏ธ Good to know
- Coding-first: Mellum 2 is built for code generation, editing, debugging, tool-calls and agents. Great at programming + structured reasoning; it's not a general-knowledge encyclopedia.
- Reduced refusals: the distillation data omits safety hedging, so it refuses less than a typical aligned chat model. It is not safety-aligned โ add your own guardrails for production. Use responsibly. ๐
- The reasoning is stylistic synthetic CoT โ great for structure, but double-check facts and numbers.
- English-centric (handles other languages, but English is strongest).
๐ Data & License
- Base model:
JetBrains/Mellum2-12B-A2.5B-Thinking, released under Apache-2.0. - Training data: built on the public, Apache-2.0 dataset
angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k, augmented with additional Opus 4.8-generated reasoning samples I curated and mixed in. - Personal/hobby project โ shared as-is, no warranty. Have fun! ๐พโจ
- Downloads last month
- 996
2-bit
4-bit
6-bit
8-bit
Model tree for yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF
Base model
JetBrains/Mellum2-12B-A2.5B-Thinking