---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
  matched-stack]
library_name: gguf
pipeline_tag: image-text-to-text
language: [en]
datasets: [nvidia/Nemotron-Image-Training-v3]
inference: false
---

# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack)

Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at GGUF Q2_K.

**No new weights are published here.** This card describes a runtime configuration:
load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K)
and apply the KV-cache modifier
documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).

## Quickstart

This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
```bash
# 1. Download the GGUF + the multimodal projector
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K-TQ-KV Q2_K.gguf --local-dir ./model
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj

# 2. Multimodal inference (text + image + audio + video)
llama-mtmd-cli \
  -m ./model/Q2_K.gguf \
  --mmproj ./mmproj/mmproj-F16.gguf \
  --image cat.jpg \
  -p "Describe this image in detail" \
  --temp 0.6 --top-p 0.95 -n 512

# 3. Text-only inference (no mmproj needed)
llama-cli \
  -m ./model/Q2_K.gguf \
  -p "What is the capital of France?" \
  --temp 0.6 --top-p 0.95 -n 256

# Disable extended reasoning (default is on):
#   add `--chat-template-kwargs '{"enable_thinking": false}'`
```

> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.

## Modality matrix

| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) |

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
MLP projectors in BF16 to preserve multimodal accuracy. We follow that
convention in every quantized variant we ship.

## Runtime quirks

### llama.cpp

Use `llama-mtmd-cli` for multimodal inference; pass `--mmproj mmproj-F16.gguf`
(see `majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`).

**Do NOT use CUDA 13.2** — produces gibberish. Pin CUDA 12.x or
use the Metal/CPU paths.

### Ollama

Text-only; multimodal is blocked because Ollama doesn't yet support
the mmproj split-file pattern.

### Reasoning mode

`enable_thinking` defaults to `True`. To disable extended reasoning
(e.g., for latency-sensitive cases), pass `enable_thinking=False`
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.