docs: Tier 2 polish — variant matrix + quant trade-off

6414902 verified 1 day ago

12.5 kB

license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags:
  - nemotron
  - multimodal
  - turboquant
  - kv-cache
  - gguf
  - combo-card
  - llama.cpp
  - runtime-modifier
  - matched-stack
library_name: gguf
pipeline_tag: image-text-to-text
language:
  - en
datasets:
  - nvidia/Nemotron-Image-Training-v3
inference: false

Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack)

Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack of Nemotron-3-Nano-Omni-30B-A3B-Reasoning at GGUF Q2_K.

No new weights are published here. This card describes a runtime configuration: load the weights from majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K and apply the KV-cache modifier documented in majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.

Quickstart

This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual .gguf binaries.

# 1. Download the GGUF + the multimodal projector
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K-TQ-KV Q2_K.gguf --local-dir ./model
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj

# 2. Multimodal inference (text + image + audio + video)
llama-mtmd-cli \
  -m ./model/Q2_K.gguf \
  --mmproj ./mmproj/mmproj-F16.gguf \
  --image cat.jpg \
  -p "Describe this image in detail" \
  --temp 0.6 --top-p 0.95 -n 512

# 3. Text-only inference (no mmproj needed)
llama-cli \
  -m ./model/Q2_K.gguf \
  -p "What is the capital of France?" \
  --temp 0.6 --top-p 0.95 -n 256

# Disable extended reasoning (default is on):
#   add `--chat-template-kwargs '{"enable_thinking": false}'`

⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.

Modality matrix

Modality	Encoder	Quantization in this variant
Text	LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE)	per the variant suffix
Image	CRADIO v4-H	BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file)
Audio	Parakeet-TDT-0.6B-v2	BF16 (same rationale)
Video	Parakeet-TDT-0.6B-v2 + frame sampler	BF16 (≤ 2 min, 256 frames @ 2 FPS)

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.

Runtime quirks

llama.cpp

Use llama-mtmd-cli for multimodal inference; pass --mmproj mmproj-F16.gguf (see majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16).

Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use the Metal/CPU paths.

Ollama

Text-only; multimodal is blocked because Ollama doesn't yet support the mmproj split-file pattern.

Reasoning mode

enable_thinking defaults to True. To disable extended reasoning (e.g., for latency-sensitive cases), pass enable_thinking=False to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant.

Variants in this family

(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — TurboQuant-GGUF-Q2_K-TQ-KV — is bolded.)

Variant	Runtime	Approx size	Use case
mmproj-F16	llama-mtmd-cli	~1-2 GB	Multimodal projector (pair with any GGUF)
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS	llama.cpp	~26 GB	Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-MXFP4_MOE	llama.cpp	~30 GB	MXFP4 MoE quant
RotorQuant-GGUF-Q2_K	llama.cpp	~18 GB	Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M	llama.cpp	~23 GB	Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M	llama.cpp	~33 GB	Balanced default
RotorQuant-GGUF-Q5_K_M	llama.cpp	~40 GB	Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0	llama.cpp	~63 GB	Near-lossless reference
RotorQuant-GGUF-IQ4_XS-RQ-KV	llama.cpp	~26 GB	IQ4_XS + RotorQuant KV
RotorQuant-GGUF-MXFP4_MOE-RQ-KV	llama.cpp	~30 GB	MXFP4 MoE + RotorQuant KV
RotorQuant-GGUF-Q2_K-RQ-KV	llama.cpp	~18 GB	Q2_K + RotorQuant KV
RotorQuant-GGUF-Q3_K_M-RQ-KV	llama.cpp	~23 GB	Q3_K_M + RotorQuant KV
RotorQuant-GGUF-Q4_K_M-RQ-KV	llama.cpp	~33 GB	Q4_K_M + RotorQuant KV
RotorQuant-GGUF-Q5_K_M-RQ-KV	llama.cpp	~40 GB	Q5_K_M + RotorQuant KV
RotorQuant-GGUF-Q8_0-RQ-KV	llama.cpp	~63 GB	Q8_0 + RotorQuant KV
RotorQuant-MLX-2bit	mlx-lm	~9.6 GB	Apple Silicon, smallest
RotorQuant-MLX-2bit-RQ-KV	mlx-lm	~9.6 GB	2-bit + RotorQuant KV
RotorQuant-MLX-3bit	mlx-lm	~14 GB	Apple Silicon, small
RotorQuant-MLX-3bit-RQ-KV	mlx-lm	~14 GB	3-bit + RotorQuant KV
RotorQuant-MLX-4bit	mlx-lm	~19 GB	Apple Silicon balanced
RotorQuant-MLX-4bit-RQ-KV	mlx-lm	~19 GB	4-bit + RotorQuant KV
RotorQuant-MLX-5bit	mlx-lm	~23 GB	Apple Silicon, higher fidelity
RotorQuant-MLX-5bit-RQ-KV	mlx-lm	~23 GB	5-bit + RotorQuant KV
RotorQuant-MLX-6bit	mlx-lm	~27 GB	Apple Silicon, near-lossless
RotorQuant-MLX-6bit-RQ-KV	mlx-lm	~27 GB	6-bit + RotorQuant KV
RotorQuant-MLX-8bit	mlx-lm	~35 GB	Apple Silicon reference
RotorQuant-MLX-8bit-RQ-KV	mlx-lm	~35 GB	8-bit + RotorQuant KV
RotorQuant-MLX-MXFP4	mlx-lm	~19 GB	Apple Silicon MXFP4
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-GGUF-IQ4_XS	llama.cpp	~26 GB	Lossy 4-bit, low-RAM CPU/edge
TurboQuant-GGUF-MXFP4_MOE	llama.cpp	~30 GB	MXFP4 MoE quant
TurboQuant-GGUF-Q2_K	llama.cpp	~18 GB	Lossy, low-RAM CPU/edge
TurboQuant-GGUF-Q3_K_M	llama.cpp	~23 GB	Smaller 3-bit, CPU-friendly
TurboQuant-GGUF-Q4_K_M	llama.cpp	~33 GB	Balanced default
TurboQuant-GGUF-Q5_K_M	llama.cpp	~40 GB	Higher fidelity, more RAM
TurboQuant-GGUF-Q8_0	llama.cpp	~63 GB	Near-lossless reference
TurboQuant-GGUF-IQ4_XS-TQ-KV	llama.cpp	~26 GB	IQ4_XS + TurboQuant KV
TurboQuant-GGUF-MXFP4_MOE-TQ-KV	llama.cpp	~30 GB	MXFP4 MoE + TurboQuant KV
TurboQuant-GGUF-Q2_K-TQ-KV	llama.cpp	~18 GB	Q2_K + TurboQuant KV
TurboQuant-GGUF-Q3_K_M-TQ-KV	llama.cpp	~23 GB	Q3_K_M + TurboQuant KV
TurboQuant-GGUF-Q4_K_M-TQ-KV	llama.cpp	~33 GB	Q4_K_M + TurboQuant KV
TurboQuant-GGUF-Q5_K_M-TQ-KV	llama.cpp	~40 GB	Q5_K_M + TurboQuant KV
TurboQuant-GGUF-Q8_0-TQ-KV	llama.cpp	~63 GB	Q8_0 + TurboQuant KV
TurboQuant-MLX-2bit	mlx-lm	~9.6 GB	Apple Silicon, smallest
TurboQuant-MLX-2bit-TQ-KV	mlx-lm	~9.6 GB	2-bit + TurboQuant KV
TurboQuant-MLX-3bit	mlx-lm	~14 GB	Apple Silicon, small
TurboQuant-MLX-3bit-TQ-KV	mlx-lm	~14 GB	3-bit + TurboQuant KV
TurboQuant-MLX-4bit	mlx-lm	~19 GB	Apple Silicon balanced
TurboQuant-MLX-4bit-TQ-KV	mlx-lm	~19 GB	4-bit + TurboQuant KV
TurboQuant-MLX-5bit	mlx-lm	~23 GB	Apple Silicon, higher fidelity
TurboQuant-MLX-5bit-TQ-KV	mlx-lm	~23 GB	5-bit + TurboQuant KV
TurboQuant-MLX-6bit	mlx-lm	~27 GB	Apple Silicon, near-lossless
TurboQuant-MLX-6bit-TQ-KV	mlx-lm	~27 GB	6-bit + TurboQuant KV
TurboQuant-MLX-8bit	mlx-lm	~35 GB	Apple Silicon reference
TurboQuant-MLX-8bit-TQ-KV	mlx-lm	~35 GB	8-bit + TurboQuant KV