Nemotron-Nano-9B-v2 W4A16

W4A16-quantized Nemotron-Nano-9B-v2 — optimized for consumer GPUs (RTX 3090, 24 GB).

This is a community quantization of nvidia/NVIDIA-Nemotron-Nano-9B-v2 using a custom Triton GEMV kernel targeting the SSM (Mamba-2) layers that dominate decode throughput. Weights are 4-bit integers; activations remain bfloat16.

Quantization code and serving instructions: github.com/redhelix/nanoquant

Performance

Throughput (RTX 3090, 24 GB GDDR6X)

Metric	BF16 baseline	W4A16 (this model)	Speedup
Single decode token/s (p50)	44.8 t/s	51.0 t/s	1.14×

Single decode: measured via OpenAI API against vLLM server (20 runs, 128 output tokens). Batched-decode throughput under real concurrency has not yet been measured end-to-end.

VRAM

	BF16	W4A16	Saved
SSM layer weights	16.6 GB	8.5 GB	−8.1 GB

The VRAM reduction allows a 32k context window on a single 24 GB GPU. BF16 requires either a larger GPU or a reduced context.

Accuracy

BF16 numbers are reproduced using NVIDIA's NeMo-Skills eval framework. All results use Reasoning-On mode (thinking enabled) except RULER 128K which uses Reasoning-Off.

Benchmark	BF16 (NeMo-Skills)	W4A16 (this model)	NVFP4 (NVIDIA)
AIME 2025	72.1%	—	71.5%
MATH-500	97.0%	—	97.2%
GPQA Diamond	66.1%	—	62.7%
LiveCodeBench	67.4%	—	67.8%
BFCL v3	67.0%	—	65.9%
IFEval (Strict)	90.1%	—	89.3%
RULER 128K	79.1%	—	75.0%

BF16 results: pass@1 avg-of-8 (GPQA: majority@8), reproduced via NeMo-Skills. W4A16 results pending — contributions welcome via the nanoquant repo.

NVIDIA's NVFP4 uses Quantization-Aware Distillation (QAD) via ModelOpt, which typically recovers most accuracy lost to quantization. This W4A16 uses simple min-max quantization without calibration data — accuracy results may vary slightly.

Model Architecture

Nemotron-Nano-9B-v2 is a Nemotron-H hybrid SSM-transformer:

Layer type	Count	Notes
MambaMixer2 (SSM)	27	Quantized by NanoQuant
MLP	25	BF16, unchanged
Attention	4	BF16, unchanged
Total	56

The SSM layers account for ~75% of decode wall-clock time due to large, memory-bound GEMVs:

in_proj: [22,656 × 4,480] — 50.7 MB per call in BF16
out_proj: [4,480 × 10,240] — 45.9 MB per call in BF16

W4A16 reduces weight reads by ~4× per call, translating directly to throughput gains on memory-bandwidth-limited hardware.

Quantization Method

Format: Asymmetric int4, group_size=128, column-major packed storage

Per group (128 input channels):

scale = (w_max - w_min) / 15
zero  = w_min / scale
w_q   = round(w / scale − zero).clamp(0, 15)   # stored as uint4

Storage layout: [K//2, N] column-major (2 weights packed per byte). Column-major coalesces adjacent-thread reads in the Triton kernel — the key layout decision over naive row-major packing.

Kernel: K-parallel 2D Triton GEMV. Grid (ceil(N/BLOCK_N), n_groups) — each CTA handles one quantization group (64 inner iterations). Groups reduce via atomicAdd into a float32 accumulator, then cast to bf16. BLOCK_N autotuned across {32, 64, 128, 256}.

vLLM integration: torch.library.custom_op makes the GEMV opaque to aot_compile_fullgraph — the body runs eagerly during each CUDA graph capture so per-batch-size kernel launches are correctly recorded without graph breaks.

Comparison: W4A16 vs NVFP4

	NanoQuant W4A16	NVIDIA NVFP4
Hardware target	RTX 3090 (Ampere, SM86)	A10G / H100 / Jetson AGX Thor
Quantization format	Asymmetric int4 (group_size=128)	NVFP4 (4-bit float)
Calibration	Min-max (no data needed)	QAD (Quantization-Aware Distillation)
Kernel	Custom Triton GEMV	cuDNN / CUTLASS GEMM
Layers quantized	SSM in_proj / out_proj (27 layers)	All layers except first/last 2 + attention
vLLM support	Runtime patch via custom_op	Native (ModelOpt)
VRAM (model only)	8.5 GB (SSM) + ~8 GB (BF16 rest)	~5 GB full model

NVFP4 targets NVIDIA's professional fleet and uses hardware-native float4 GEMMs on H100/A10G. NanoQuant targets consumer Ampere GPUs (RTX 3080/3090/4090) where NVFP4 kernels are unavailable.

Usage

With vLLM (recommended)

pip install nanoquant vllm

VLLM_USE_BYTECODE_HOOK=0 python -m nanoquant.serve \
    --model redhelix/Nemotron-Nano-9B-v2-W4A16 \
    --host 0.0.0.0 --port 8000 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Direct inference

# Apply runtime patch to stock base model (no download of this checkpoint needed)
from nanoquant import patch_nemotron_h
from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    trust_remote_code=True,
    max_model_len=32768,
    gpu_memory_utilization=0.92,
)
patch_nemotron_h(llm.llm_engine.model_executor.driver_worker.model_runner.model)

output = llm.generate("Explain the Mamba SSM architecture.", SamplingParams(max_tokens=512))
print(output[0].outputs[0].text)

Tool calling

Tool calling works via hermes parser (Nemotron uses <TOOLCALL> format, closest to hermes). Use hintonator-cascade-2-30b or a frontier model for applications requiring reliable structured tool output.

Important Notes

VLLM_USE_BYTECODE_HOOK=0 is required — the bytecode hook interferes with the W4A16 custom_op patching
--compilation-config '{"max_cudagraph_capture_size":8}' is injected automatically by the serve script — limits CUDA graph captures to 4 sizes (1,2,4,8) instead of 51, preventing private pool OOM
Tool call accuracy is approximate — hermes parser handles tool_choice: auto but Nemotron's <TOOLCALL>[...]</TOOLCALL> format differs from standard hermes
Attention layers and MLP layers remain in BF16 — only the 27 SSM projection layers are quantized

License

The quantization code (NanoQuant) is Apache 2.0.

The base model weights are subject to the NVIDIA Open Model License. This quantized checkpoint is a derivative work and inherits the same license terms. Commercial use is permitted subject to NVIDIA's terms.

Citation

@misc{nanoquant2026,
  title  = {NanoQuant: W4A16 Triton GEMV for Nemotron-Nano-9B-v2 SSM Decode},
  author = {Amin Lalji},
  year   = {2026},
  url    = {https://github.com/redhelix/nanoquant},
}

Base model:

@misc{nvidia2025nemotron,
  title  = {Nemotron-Nano-9B: A Reasoning Model for Efficient Deployment},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2},
}

Downloads last month: 245

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for redhelix/Nemotron-Nano-9B-v2-W4A16

Base model

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base

Finetuned

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Finetuned

nvidia/NVIDIA-Nemotron-Nano-9B-v2

Finetuned

(17)

this model