docs: Tier 2 polish — variant matrix + quant trade-off

edabf22 verified 1 day ago

12.6 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
	tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
	matched-stack]
	library_name: gguf
	pipeline_tag: image-text-to-text
	language: [en]
	datasets: [nvidia/Nemotron-Image-Training-v3]
	inference: false
	---

	# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)

	Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
	of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at GGUF IQ4_XS.

	No new weights are published here. This card describes a runtime configuration:
	load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS)
	and apply the KV-cache modifier
	documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).

	## Quickstart

	This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
	```bash
	# 1. Download the GGUF + the multimodal projector
	huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV IQ4_XS.gguf --local-dir ./model
	huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj

	# 2. Multimodal inference (text + image + audio + video)
	llama-mtmd-cli \
	-m ./model/IQ4_XS.gguf \
	--mmproj ./mmproj/mmproj-F16.gguf \
	--image cat.jpg \
	-p "Describe this image in detail" \
	--temp 0.6 --top-p 0.95 -n 512

	# 3. Text-only inference (no mmproj needed)
	llama-cli \
	-m ./model/IQ4_XS.gguf \
	-p "What is the capital of France?" \
	--temp 0.6 --top-p 0.95 -n 256

	# Disable extended reasoning (default is on):
	# add `--chat-template-kwargs '{"enable_thinking": false}'`
	```

	> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.

	## Modality matrix

	\| Modality \| Encoder \| Quantization in this variant \|
	\|---\|---\|---\|
	\| Text \| LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) \| per the variant suffix \|
	\| Image \| CRADIO v4-H \| BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) \|
	\| Audio \| Parakeet-TDT-0.6B-v2 \| BF16 (same rationale) \|
	\| Video \| Parakeet-TDT-0.6B-v2 + frame sampler \| BF16 (≤ 2 min, 256 frames @ 2 FPS) \|

	NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
	MLP projectors in BF16 to preserve multimodal accuracy. We follow that
	convention in every quantized variant we ship.

	## Runtime quirks

	### llama.cpp

	Use `llama-mtmd-cli` for multimodal inference; pass `--mmproj mmproj-F16.gguf`
	(see `majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`).

	Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or
	use the Metal/CPU paths.

	### Ollama

	Text-only; multimodal is blocked because Ollama doesn't yet support
	the mmproj split-file pattern.

	### Reasoning mode

	`enable_thinking` defaults to `True`. To disable extended reasoning
	(e.g., for latency-sensitive cases), pass `enable_thinking=False`
	to the chat template / generate call. No separate "no-think"
	variant card exists — this is a runtime flag, not a model variant.

	## Variants in this family

	(Showing 56 sibling variants under `majentik/nemotron3-nano-omni-30b-`. The current variant — `TurboQuant-GGUF-IQ4_XS-TQ-KV` — is bolded*.)

	\| Variant \| Runtime \| Approx size \| Use case \|
	\|---\|---\|---\|---\|
	\| [mmproj-F16](https://huggingface.co/majentik/nemotron3-nano-omni-30b-mmproj-f16) \| llama-mtmd-cli \| ~1-2 GB \| Multimodal projector (pair with any GGUF) \|
	\| [RotorQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-IQ4_XS) \| llama.cpp \| ~26 GB \| Lossy 4-bit, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-MXFP4_MOE) \| llama.cpp \| ~30 GB \| MXFP4 MoE quant \|
	\| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q2_K) \| llama.cpp \| ~18 GB \| Lossy, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q3_K_M) \| llama.cpp \| ~23 GB \| Smaller 3-bit, CPU-friendly \|
	\| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q4_K_M) \| llama.cpp \| ~33 GB \| Balanced default \|
	\| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q5_K_M) \| llama.cpp \| ~40 GB \| Higher fidelity, more RAM \|
	\| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q8_0) \| llama.cpp \| ~63 GB \| Near-lossless reference \|
	\| [RotorQuant-GGUF-IQ4_XS-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-iq4_xs-rq-kv) \| llama.cpp \| ~26 GB \| IQ4_XS + RotorQuant KV \|
	\| [RotorQuant-GGUF-MXFP4_MOE-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-mxfp4_moe-rq-kv) \| llama.cpp \| ~30 GB \| MXFP4 MoE + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q2_K-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q2_k-rq-kv) \| llama.cpp \| ~18 GB \| Q2_K + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q3_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q3_k_m-rq-kv) \| llama.cpp \| ~23 GB \| Q3_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q4_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q4_k_m-rq-kv) \| llama.cpp \| ~33 GB \| Q4_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q5_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q5_k_m-rq-kv) \| llama.cpp \| ~40 GB \| Q5_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q8_0-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q8_0-rq-kv) \| llama.cpp \| ~63 GB \| Q8_0 + RotorQuant KV \|
	\| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit) \| mlx-lm \| ~9.6 GB \| Apple Silicon, smallest \|
	\| [RotorQuant-MLX-2bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit-rq-kv) \| mlx-lm \| ~9.6 GB \| 2-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit) \| mlx-lm \| ~14 GB \| Apple Silicon, small \|
	\| [RotorQuant-MLX-3bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit-rq-kv) \| mlx-lm \| ~14 GB \| 3-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit) \| mlx-lm \| ~19 GB \| Apple Silicon balanced \|
	\| [RotorQuant-MLX-4bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit-rq-kv) \| mlx-lm \| ~19 GB \| 4-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit) \| mlx-lm \| ~23 GB \| Apple Silicon, higher fidelity \|
	\| [RotorQuant-MLX-5bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit-rq-kv) \| mlx-lm \| ~23 GB \| 5-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit) \| mlx-lm \| ~27 GB \| Apple Silicon, near-lossless \|
	\| [RotorQuant-MLX-6bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit-rq-kv) \| mlx-lm \| ~27 GB \| 6-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit) \| mlx-lm \| ~35 GB \| Apple Silicon reference \|
	\| [RotorQuant-MLX-8bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit-rq-kv) \| mlx-lm \| ~35 GB \| 8-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-MXFP4](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-mxfp4) \| mlx-lm \| ~19 GB \| Apple Silicon MXFP4 \|
	\| [TurboQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| [TurboQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-IQ4_XS) \| llama.cpp \| ~26 GB \| Lossy 4-bit, low-RAM CPU/edge \|
	\| [TurboQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-MXFP4_MOE) \| llama.cpp \| ~30 GB \| MXFP4 MoE quant \|
	\| [TurboQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q2_K) \| llama.cpp \| ~18 GB \| Lossy, low-RAM CPU/edge \|
	\| [TurboQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q3_K_M) \| llama.cpp \| ~23 GB \| Smaller 3-bit, CPU-friendly \|
	\| [TurboQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q4_K_M) \| llama.cpp \| ~33 GB \| Balanced default \|
	\| [TurboQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q5_K_M) \| llama.cpp \| ~40 GB \| Higher fidelity, more RAM \|
	\| [TurboQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q8_0) \| llama.cpp \| ~63 GB \| Near-lossless reference \|
	\| TurboQuant-GGUF-IQ4_XS-TQ-KV \| llama.cpp \| ~26 GB \| IQ4_XS + TurboQuant KV \|
	\| [TurboQuant-GGUF-MXFP4_MOE-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-mxfp4_moe-tq-kv) \| llama.cpp \| ~30 GB \| MXFP4 MoE + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q2_K-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q2_k-tq-kv) \| llama.cpp \| ~18 GB \| Q2_K + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q3_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q3_k_m-tq-kv) \| llama.cpp \| ~23 GB \| Q3_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q4_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q4_k_m-tq-kv) \| llama.cpp \| ~33 GB \| Q4_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q5_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q5_k_m-tq-kv) \| llama.cpp \| ~40 GB \| Q5_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q8_0-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q8_0-tq-kv) \| llama.cpp \| ~63 GB \| Q8_0 + TurboQuant KV \|
	\| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit) \| mlx-lm \| ~9.6 GB \| Apple Silicon, smallest \|
	\| [TurboQuant-MLX-2bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit-tq-kv) \| mlx-lm \| ~9.6 GB \| 2-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit) \| mlx-lm \| ~14 GB \| Apple Silicon, small \|
	\| [TurboQuant-MLX-3bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit-tq-kv) \| mlx-lm \| ~14 GB \| 3-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit) \| mlx-lm \| ~19 GB \| Apple Silicon balanced \|
	\| [TurboQuant-MLX-4bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit-tq-kv) \| mlx-lm \| ~19 GB \| 4-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit) \| mlx-lm \| ~23 GB \| Apple Silicon, higher fidelity \|
	\| [TurboQuant-MLX-5bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit-tq-kv) \| mlx-lm \| ~23 GB \| 5-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit) \| mlx-lm \| ~27 GB \| Apple Silicon, near-lossless \|
	\| [TurboQuant-MLX-6bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit-tq-kv) \| mlx-lm \| ~27 GB \| 6-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit) \| mlx-lm \| ~35 GB \| Apple Silicon reference \|
	\| [TurboQuant-MLX-8bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit-tq-kv) \| mlx-lm \| ~35 GB \| 8-bit + TurboQuant KV \|

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
	tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
	matched-stack]
	library_name: gguf
	pipeline_tag: image-text-to-text
	language: [en]
	datasets: [nvidia/Nemotron-Image-Training-v3]
	inference: false
	---

	# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)

	Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
	of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at GGUF IQ4_XS.

	No new weights are published here. This card describes a runtime configuration:
	load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS)
	and apply the KV-cache modifier
	documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).

	## Quickstart

	This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
	```bash
	# 1. Download the GGUF + the multimodal projector
	huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV IQ4_XS.gguf --local-dir ./model
	huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj

	# 2. Multimodal inference (text + image + audio + video)
	llama-mtmd-cli \
	-m ./model/IQ4_XS.gguf \
	--mmproj ./mmproj/mmproj-F16.gguf \
	--image cat.jpg \
	-p "Describe this image in detail" \
	--temp 0.6 --top-p 0.95 -n 512

	# 3. Text-only inference (no mmproj needed)
	llama-cli \
	-m ./model/IQ4_XS.gguf \
	-p "What is the capital of France?" \
	--temp 0.6 --top-p 0.95 -n 256

	# Disable extended reasoning (default is on):
	# add `--chat-template-kwargs '{"enable_thinking": false}'`
	```

	> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.

	## Modality matrix

	\| Modality \| Encoder \| Quantization in this variant \|
	\|---\|---\|---\|
	\| Text \| LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) \| per the variant suffix \|
	\| Image \| CRADIO v4-H \| BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) \|
	\| Audio \| Parakeet-TDT-0.6B-v2 \| BF16 (same rationale) \|
	\| Video \| Parakeet-TDT-0.6B-v2 + frame sampler \| BF16 (≤ 2 min, 256 frames @ 2 FPS) \|

	NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
	MLP projectors in BF16 to preserve multimodal accuracy. We follow that
	convention in every quantized variant we ship.

	## Runtime quirks

	### llama.cpp

	Use `llama-mtmd-cli` for multimodal inference; pass `--mmproj mmproj-F16.gguf`
	(see `majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`).

	Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or
	use the Metal/CPU paths.

	### Ollama

	Text-only; multimodal is blocked because Ollama doesn't yet support
	the mmproj split-file pattern.

	### Reasoning mode

	`enable_thinking` defaults to `True`. To disable extended reasoning
	(e.g., for latency-sensitive cases), pass `enable_thinking=False`
	to the chat template / generate call. No separate "no-think"
	variant card exists — this is a runtime flag, not a model variant.

	## Variants in this family

	(Showing 56 sibling variants under `majentik/nemotron3-nano-omni-30b-`. The current variant — `TurboQuant-GGUF-IQ4_XS-TQ-KV` — is bolded*.)

	\| Variant \| Runtime \| Approx size \| Use case \|
	\|---\|---\|---\|---\|
	\| [mmproj-F16](https://huggingface.co/majentik/nemotron3-nano-omni-30b-mmproj-f16) \| llama-mtmd-cli \| ~1-2 GB \| Multimodal projector (pair with any GGUF) \|
	\| [RotorQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-IQ4_XS) \| llama.cpp \| ~26 GB \| Lossy 4-bit, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-MXFP4_MOE) \| llama.cpp \| ~30 GB \| MXFP4 MoE quant \|
	\| [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q2_K) \| llama.cpp \| ~18 GB \| Lossy, low-RAM CPU/edge \|
	\| [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q3_K_M) \| llama.cpp \| ~23 GB \| Smaller 3-bit, CPU-friendly \|
	\| [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q4_K_M) \| llama.cpp \| ~33 GB \| Balanced default \|
	\| [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q5_K_M) \| llama.cpp \| ~40 GB \| Higher fidelity, more RAM \|
	\| [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q8_0) \| llama.cpp \| ~63 GB \| Near-lossless reference \|
	\| [RotorQuant-GGUF-IQ4_XS-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-iq4_xs-rq-kv) \| llama.cpp \| ~26 GB \| IQ4_XS + RotorQuant KV \|
	\| [RotorQuant-GGUF-MXFP4_MOE-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-mxfp4_moe-rq-kv) \| llama.cpp \| ~30 GB \| MXFP4 MoE + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q2_K-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q2_k-rq-kv) \| llama.cpp \| ~18 GB \| Q2_K + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q3_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q3_k_m-rq-kv) \| llama.cpp \| ~23 GB \| Q3_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q4_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q4_k_m-rq-kv) \| llama.cpp \| ~33 GB \| Q4_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q5_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q5_k_m-rq-kv) \| llama.cpp \| ~40 GB \| Q5_K_M + RotorQuant KV \|
	\| [RotorQuant-GGUF-Q8_0-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q8_0-rq-kv) \| llama.cpp \| ~63 GB \| Q8_0 + RotorQuant KV \|
	\| [RotorQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit) \| mlx-lm \| ~9.6 GB \| Apple Silicon, smallest \|
	\| [RotorQuant-MLX-2bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit-rq-kv) \| mlx-lm \| ~9.6 GB \| 2-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit) \| mlx-lm \| ~14 GB \| Apple Silicon, small \|
	\| [RotorQuant-MLX-3bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit-rq-kv) \| mlx-lm \| ~14 GB \| 3-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit) \| mlx-lm \| ~19 GB \| Apple Silicon balanced \|
	\| [RotorQuant-MLX-4bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit-rq-kv) \| mlx-lm \| ~19 GB \| 4-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit) \| mlx-lm \| ~23 GB \| Apple Silicon, higher fidelity \|
	\| [RotorQuant-MLX-5bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit-rq-kv) \| mlx-lm \| ~23 GB \| 5-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit) \| mlx-lm \| ~27 GB \| Apple Silicon, near-lossless \|
	\| [RotorQuant-MLX-6bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit-rq-kv) \| mlx-lm \| ~27 GB \| 6-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit) \| mlx-lm \| ~35 GB \| Apple Silicon reference \|
	\| [RotorQuant-MLX-8bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit-rq-kv) \| mlx-lm \| ~35 GB \| 8-bit + RotorQuant KV \|
	\| [RotorQuant-MLX-MXFP4](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-mxfp4) \| mlx-lm \| ~19 GB \| Apple Silicon MXFP4 \|
	\| [TurboQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant) \| runtime modifier \| n/a \| KV-cache root (weight-agnostic) \|
	\| [TurboQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-IQ4_XS) \| llama.cpp \| ~26 GB \| Lossy 4-bit, low-RAM CPU/edge \|
	\| [TurboQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-MXFP4_MOE) \| llama.cpp \| ~30 GB \| MXFP4 MoE quant \|
	\| [TurboQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q2_K) \| llama.cpp \| ~18 GB \| Lossy, low-RAM CPU/edge \|
	\| [TurboQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q3_K_M) \| llama.cpp \| ~23 GB \| Smaller 3-bit, CPU-friendly \|
	\| [TurboQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q4_K_M) \| llama.cpp \| ~33 GB \| Balanced default \|
	\| [TurboQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q5_K_M) \| llama.cpp \| ~40 GB \| Higher fidelity, more RAM \|
	\| [TurboQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q8_0) \| llama.cpp \| ~63 GB \| Near-lossless reference \|
	\| TurboQuant-GGUF-IQ4_XS-TQ-KV \| llama.cpp \| ~26 GB \| IQ4_XS + TurboQuant KV \|
	\| [TurboQuant-GGUF-MXFP4_MOE-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-mxfp4_moe-tq-kv) \| llama.cpp \| ~30 GB \| MXFP4 MoE + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q2_K-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q2_k-tq-kv) \| llama.cpp \| ~18 GB \| Q2_K + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q3_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q3_k_m-tq-kv) \| llama.cpp \| ~23 GB \| Q3_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q4_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q4_k_m-tq-kv) \| llama.cpp \| ~33 GB \| Q4_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q5_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q5_k_m-tq-kv) \| llama.cpp \| ~40 GB \| Q5_K_M + TurboQuant KV \|
	\| [TurboQuant-GGUF-Q8_0-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q8_0-tq-kv) \| llama.cpp \| ~63 GB \| Q8_0 + TurboQuant KV \|
	\| [TurboQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit) \| mlx-lm \| ~9.6 GB \| Apple Silicon, smallest \|
	\| [TurboQuant-MLX-2bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit-tq-kv) \| mlx-lm \| ~9.6 GB \| 2-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit) \| mlx-lm \| ~14 GB \| Apple Silicon, small \|
	\| [TurboQuant-MLX-3bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit-tq-kv) \| mlx-lm \| ~14 GB \| 3-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit) \| mlx-lm \| ~19 GB \| Apple Silicon balanced \|
	\| [TurboQuant-MLX-4bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit-tq-kv) \| mlx-lm \| ~19 GB \| 4-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit) \| mlx-lm \| ~23 GB \| Apple Silicon, higher fidelity \|
	\| [TurboQuant-MLX-5bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit-tq-kv) \| mlx-lm \| ~23 GB \| 5-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit) \| mlx-lm \| ~27 GB \| Apple Silicon, near-lossless \|
	\| [TurboQuant-MLX-6bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit-tq-kv) \| mlx-lm \| ~27 GB \| 6-bit + TurboQuant KV \|
	\| [TurboQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit) \| mlx-lm \| ~35 GB \| Apple Silicon reference \|
	\| [TurboQuant-MLX-8bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit-tq-kv) \| mlx-lm \| ~35 GB \| 8-bit + TurboQuant KV \|