โš™๏ธ Mellum2-12B-A2.5B-Reasoning-Distill (GGUF) โš™๏ธ

๐Ÿง‘โ€๐Ÿ’ป A fast little coding brain โ€” local AI for everyone

12B total params, only 2.5B active per token. This is a Mixture-of-Experts model, so it runs like a ~2.5B model but thinks like a 12B one. ๐Ÿš€ Built on JetBrains' Mellum 2 (a from-scratch software-engineering model) and tuned on Claude Opus 4.6 / 4.7 / 4.8 reasoning traces โ€” it reasons step-by-step in <think> blocks, then answers. ๐Ÿง ๐Ÿ’ป All local, all yours, no API, no cloud. And it's seriously fast.


โšก Blazing fast โ€” measured, not marketing ๐ŸŽ๏ธ๐Ÿ’จ

~440 tokens/sec on a single RTX 5090 at Q4_K_M (--n-gpu-layers 99 -fa on) โ€” and generation quality holds up: correct, coherent code and clean step-by-step reasoning. ๐ŸŽฏ

You get big-model answers at small-model speed. Why so quick? It's a Mixture-of-Experts (only 2.5B of the 12B params fire per token) with a compact 98K vocab โ€” so it generates several times faster than a dense model its size, with no draft / speculative model needed. ๐Ÿ’š

Hardware Quant Generation speed (measured)
RTX 5090 (32 GB) Q4_K_M ~440 tok/s โšก

๐Ÿ“ฆ Pick your size (GGUF quants)

Quant Size Vibe
๐ŸŸข Q2_K 5.0 GB tiniest โ€” runs almost anywhere
๐Ÿ”ต Q4_K_M 8.1 GB the sweet spot ๐Ÿ‘Œ (recommended)
๐ŸŸฃ Q6_K 10.9 GB near-lossless
โšช Q8_0 12.9 GB basically full quality

๐Ÿ’ก It's a Mixture-of-Experts: all 64 experts live on disk/VRAM (so size is for the whole 12B), but only 8 fire per token โ€” that's why it's so quick.


๐Ÿงฎ "Will it fit?" โ€” rough VRAM guide

Mellum2 has a tiny KV cache (GQA with just 4 KV heads, and sliding-window attention on 3 of every 4 layers) โ€” so context is rarely the limiter. Pick the quant that fits your VRAM and you'll have plenty of room for long context (max is 131K). Rough numbers ๐Ÿค“ (weights + ~2 GB overhead):

Your VRAM / unified mem Best quant that fits Context headroom
8 GB ๐ŸŸข Q2_K comfy (long ctx still fits)
12 GB ๐Ÿ”ต Q4_K_M lots
16 GB ๐ŸŸฃ Q6_K / โšช Q8_0 lots
24 GB+ โšช Q8_0 up to 131K ๐ŸŽ‰

๐Ÿ’ก Apple Silicon / iGPUs with unified memory count too โ€” same idea, just slower than a dGPU. ๐Ÿ’ก Tight on room? Drop a quant or use a q4_0 KV cache for even more context.


๐Ÿš€ How to run it (super easy)

Option A โ€” llama.cpp (recommended) ๐Ÿฆ™

  1. Grab a quant above (e.g. โ€ฆ-Q4_K_M.gguf) and llama-server from llama.cpp.

    โš ๏ธ Needs a recent llama.cpp that supports the mellum2 architecture (a mid-2026 build or newer). Older builds fail with unknown architecture: 'mellum2'.

  2. Run a server (Windows .bat shown โ€” tweak --port, --ctx-size to taste):
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\mellum2-claude-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap ^
  -fa on ^
  --jinja --reasoning-format deepseek ^
  --temp 0.6 --top-p 0.95 --top-k 20 ^
  --host 0.0.0.0 --port 18080
pause
  1. Open http://localhost:18080 and chat. ๐ŸŽ‰ (Tip: bump --ctx-size โ€” the KV cache is small, so go big.)

Option B โ€” one-click apps ๐Ÿ–ฑ๏ธ

Works in LM Studio, Jan, Ollama, etc. โ€” just import the GGUF, pick your quant, go. ๐Ÿพ (Make sure the app ships a recent llama.cpp that knows mellum2.)

๐Ÿง  Thinking mode

This model thinks natively in <think> โ€ฆ </think> blocks. The chat template handles it automatically; the --reasoning-format deepseek flag tells llama.cpp to surface the reasoning cleanly. Recommended sampling: temp 0.6, top_p 0.95, top_k 20 (JetBrains' official settings for the Thinking model).

๐Ÿ Prefer raw transformers? (click)

This repo ships GGUF (for llama.cpp). To run in raw transformers you need non-GGUF weights โ€” point mid at the original JetBrains checkpoint (or your own merged fp16). Needs transformers >= 5.8 (the mellum architecture is built in โ€” no trust_remote_code). It's a plain text CausalLM (not multimodal).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

mid = "JetBrains/Mellum2-12B-A2.5B-Thinking"   # GGUF won't load here โ€” use non-GGUF weights
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, dtype=torch.bfloat16,
                                             device_map="auto", attn_implementation="sdpa")

msgs = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True,
                                 enable_thinking=True)
inputs = inputs.to(model.device)
out = model.generate(inputs, max_new_tokens=512, temperature=0.6, top_p=0.95, top_k=20)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=False))

โšก Why no MTP / draft model?

JetBrains' Mellum 2 uses Multi-Token Prediction as a training-time objective only โ€” it isn't exported to the released weights, so there's no draft to ship. You don't need one: with just 2.5B active params and a compact vocab, it already generates extremely fast (~440 tok/s on a 5090 at Q4_K_M). ๐ŸŽ๏ธ


๐Ÿงฉ What is this, exactly?

  • Base: JetBrains/Mellum2-12B-A2.5B-Thinking โ€” a from-scratch, software-engineering-focused Mixture-of-Experts model (12.15B total / 2.5B active, 28 layers, 64 experts with 8 active, 131K context). JetBrains already did SFT + RLVR on it; this is a light extra LoRA distillation pass on top.
  • This fine-tune: a low-intensity QLoRA-style pass over the attention projections, distilling Claude Opus reasoning style into the model. It keeps Mellum 2's coding/agent strengths while nudging the reasoning voice toward Opus. ๐Ÿ’ก

โš ๏ธ Good to know

  • Coding-first: Mellum 2 is built for code generation, editing, debugging, tool-calls and agents. Great at programming + structured reasoning; it's not a general-knowledge encyclopedia.
  • Reduced refusals: the distillation data omits safety hedging, so it refuses less than a typical aligned chat model. It is not safety-aligned โ€” add your own guardrails for production. Use responsibly. ๐Ÿ™
  • The reasoning is stylistic synthetic CoT โ€” great for structure, but double-check facts and numbers.
  • English-centric (handles other languages, but English is strongest).

๐Ÿ“š Data & License

Downloads last month
996
GGUF
Model size
12B params
Architecture
mellum
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yuxinlu1/Mellum2-12B-A2.5B-Claude-4.6-4.8-Opus-Thinking-GGUF

Quantized
(30)
this model