license: other
license_name: nvidia-open-model-license
license_link: >-
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags:
- nemotron
- multimodal
- turboquant
- kv-cache
- gguf
- combo-card
- llama.cpp
- runtime-modifier
- matched-stack
library_name: gguf
pipeline_tag: image-text-to-text
language:
- en
datasets:
- nvidia/Nemotron-Image-Training-v3
inference: false
Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack)
Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
of Nemotron-3-Nano-Omni-30B-A3B-Reasoning at GGUF Q2_K.
No new weights are published here. This card describes a runtime configuration:
load the weights from majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K
and apply the KV-cache modifier
documented in majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.
Quickstart
This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual .gguf binaries.
# 1. Download the GGUF + the multimodal projector
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K-TQ-KV Q2_K.gguf --local-dir ./model
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj
# 2. Multimodal inference (text + image + audio + video)
llama-mtmd-cli \
-m ./model/Q2_K.gguf \
--mmproj ./mmproj/mmproj-F16.gguf \
--image cat.jpg \
-p "Describe this image in detail" \
--temp 0.6 --top-p 0.95 -n 512
# 3. Text-only inference (no mmproj needed)
llama-cli \
-m ./model/Q2_K.gguf \
-p "What is the capital of France?" \
--temp 0.6 --top-p 0.95 -n 256
# Disable extended reasoning (default is on):
# add `--chat-template-kwargs '{"enable_thinking": false}'`
⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.
Modality matrix
| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | BF16 (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | BF16 (≤ 2 min, 256 frames @ 2 FPS) |
NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.
Runtime quirks
llama.cpp
Use llama-mtmd-cli for multimodal inference; pass --mmproj mmproj-F16.gguf
(see majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16).
Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use the Metal/CPU paths.
Ollama
Text-only; multimodal is blocked because Ollama doesn't yet support the mmproj split-file pattern.
Reasoning mode
enable_thinking defaults to True. To disable extended reasoning
(e.g., for latency-sensitive cases), pass enable_thinking=False
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.
Variants in this family
(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — TurboQuant-GGUF-Q2_K-TQ-KV — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| mmproj-F16 | llama-mtmd-cli | ~1-2 GB | Multimodal projector (pair with any GGUF) |
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| RotorQuant-GGUF-IQ4_XS-RQ-KV | llama.cpp | ~26 GB | IQ4_XS + RotorQuant KV |
| RotorQuant-GGUF-MXFP4_MOE-RQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + RotorQuant KV |
| RotorQuant-GGUF-Q2_K-RQ-KV | llama.cpp | ~18 GB | Q2_K + RotorQuant KV |
| RotorQuant-GGUF-Q3_K_M-RQ-KV | llama.cpp | ~23 GB | Q3_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q4_K_M-RQ-KV | llama.cpp | ~33 GB | Q4_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q5_K_M-RQ-KV | llama.cpp | ~40 GB | Q5_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q8_0-RQ-KV | llama.cpp | ~63 GB | Q8_0 + RotorQuant KV |
| RotorQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| RotorQuant-MLX-2bit-RQ-KV | mlx-lm | ~9.6 GB | 2-bit + RotorQuant KV |
| RotorQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| RotorQuant-MLX-3bit-RQ-KV | mlx-lm | ~14 GB | 3-bit + RotorQuant KV |
| RotorQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| RotorQuant-MLX-4bit-RQ-KV | mlx-lm | ~19 GB | 4-bit + RotorQuant KV |
| RotorQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| RotorQuant-MLX-5bit-RQ-KV | mlx-lm | ~23 GB | 5-bit + RotorQuant KV |
| RotorQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| RotorQuant-MLX-6bit-RQ-KV | mlx-lm | ~27 GB | 6-bit + RotorQuant KV |
| RotorQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| RotorQuant-MLX-8bit-RQ-KV | mlx-lm | ~35 GB | 8-bit + RotorQuant KV |
| RotorQuant-MLX-MXFP4 | mlx-lm | ~19 GB | Apple Silicon MXFP4 |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| TurboQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| TurboQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| TurboQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| TurboQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| TurboQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| TurboQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| TurboQuant-GGUF-IQ4_XS-TQ-KV | llama.cpp | ~26 GB | IQ4_XS + TurboQuant KV |
| TurboQuant-GGUF-MXFP4_MOE-TQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + TurboQuant KV |
| TurboQuant-GGUF-Q2_K-TQ-KV | llama.cpp | ~18 GB | Q2_K + TurboQuant KV |
| TurboQuant-GGUF-Q3_K_M-TQ-KV | llama.cpp | ~23 GB | Q3_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q4_K_M-TQ-KV | llama.cpp | ~33 GB | Q4_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q5_K_M-TQ-KV | llama.cpp | ~40 GB | Q5_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q8_0-TQ-KV | llama.cpp | ~63 GB | Q8_0 + TurboQuant KV |
| TurboQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| TurboQuant-MLX-2bit-TQ-KV | mlx-lm | ~9.6 GB | 2-bit + TurboQuant KV |
| TurboQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| TurboQuant-MLX-3bit-TQ-KV | mlx-lm | ~14 GB | 3-bit + TurboQuant KV |
| TurboQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| TurboQuant-MLX-4bit-TQ-KV | mlx-lm | ~19 GB | 4-bit + TurboQuant KV |
| TurboQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| TurboQuant-MLX-5bit-TQ-KV | mlx-lm | ~23 GB | 5-bit + TurboQuant KV |
| TurboQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| TurboQuant-MLX-6bit-TQ-KV | mlx-lm | ~27 GB | 6-bit + TurboQuant KV |
| TurboQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| TurboQuant-MLX-8bit-TQ-KV | mlx-lm | ~35 GB | 8-bit + TurboQuant KV |