Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0

GGUF Q8_0 weight-quantized variant of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 optimised for use with TurboQuant KV cache compression via a dedicated llama.cpp fork.

Important: TurboQuant KV cache types (planar3, iso3) are not available in upstream llama.cpp, standard Ollama, or LM Studio. They require a specific llama.cpp fork. The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).

Overview

This model combines two independent compression techniques:

Technique What it does Requirement
GGUF Q8_0 weight quantization Reduces model size from ~240 GB (BF16) to ~120.0 GB Any llama.cpp-compatible runtime
TurboQuant KV cache compression β€” random rotation + Lloyd-Max scalar quantization (--cache-type-k planar3 --cache-type-v planar3) Block-diagonal rotations / random rotation for compressed KV cache llama-cpp-turboquant fork only

Quickstart

Option A β€” With TurboQuant KV cache (fork required)

You must build from the TurboQuant-enabled llama.cpp fork:

# Clone and build the fork
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache

# CUDA (Windows/Linux)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Run with TurboQuant KV cache
./build/bin/llama-cli -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

# Or run as a server
./build/bin/llama-server -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa --jinja

Option B β€” With standard llama.cpp / LM Studio / Ollama

The GGUF works as a normal quantised model. You won't get TurboQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.

llama.cpp (upstream)

llama-cli -m Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

LM Studio

  1. Download the GGUF file and load in LM Studio.
  2. Enable Developer Mode (Settings β†’ Developer).
  3. In the model loader's advanced settings, set Flash Attention to ON.
  4. Set K Cache Quantization and V Cache Quantization to q8_0 (or q4_0 for more aggressive VRAM savings).
  5. Note: LM Studio does not currently support TurboQuant's planar3 cache types. Track this feature request for updates.

Ollama

# Standard Ollama does not support TurboQuant cache types.
# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0

Specifications

Property Value
Base Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Architecture Mamba-2 + Transformer hybrid Sparse MoE
Parameters 120B total, 12B active per token
Context Length 1M
Weight Quantization GGUF Q8_0 (near-lossless 8-bit, reference quality)
Original Size (BF16) ~240 GB
Quantized File Size ~120.0 GB
KV Cache (TurboQuant) 3-bit via --cache-type-k planar3 --cache-type-v planar3 (fork only)
KV Cache (standard) q8_0, q4_0, f16, etc. (any llama.cpp runtime)
License other
Modalities Text only
Compatible Runtimes llama.cpp, LM Studio, Ollama, koboldcpp

What is TurboQuant?

TurboQuant (ICLR 2026) is a KV cache compression method that applies a random orthogonal rotation followed by optimal scalar quantization. Bit-identical prefill logits at 4-bit on tested models, with up to 4-8Γ— memory savings for long sequences.

Benchmarks from the TurboQuant repository (Llama 3.1 8B, RTX 5090 β€” results will vary by model and hardware):

Metric TurboQuant (4-bit) Standard q4_0
Quality Bit-identical prefill Lossy
KV Compression ~4Γ— vs FP16 ~4Γ— vs FP16
Speedup (Apple Silicon) 1.4–1.7Γ— β€”

Note: These benchmarks are from the TurboQuant repository using Llama 3.1 8B on an RTX 5090. Performance on Nemotron-3-Super-120B-A12B will differ. Independent benchmarks for this specific model are welcome β€” please open a discussion if you have results to share.

Current Status of TurboQuant in the Ecosystem

Runtime TurboQuant Support Standard KV Quant
llama.cpp (upstream) ❌ Not merged βœ… q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
llama-cpp-turboquant fork βœ… planar3 βœ… All standard types
LM Studio ❌ Requested βœ… Via advanced settings
Ollama ❌ Not supported βœ… Via OLLAMA_KV_CACHE_TYPE
koboldcpp ❌ Not supported βœ… Standard types

Recommended Settings

For VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled β€” it is required for V cache quantization and improves memory efficiency regardless.

VRAM Suggested Configuration
24 GB (RTX 4090) Q8_0 + q8_0 KV cache + Flash Attention, 8K–16K context
16 GB Q8_0 + q4_0 KV cache + Flash Attention, 4K–8K context
48+ GB Q8_0 + f16 KV cache, full 32K+ context

See Also

Downloads last month
94
GGUF
Model size
121B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0

Quantized
(42)
this model

Paper for majentik/Nemotron-3-Super-120B-A12B-TurboQuant-GGUF-Q8_0