| --- |
| license: other |
| license_name: nvidia-open-model-license |
| license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf |
| base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
| tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier, |
| matched-stack] |
| library_name: gguf |
| pipeline_tag: image-text-to-text |
| language: [en] |
| datasets: [nvidia/Nemotron-Image-Training-v3] |
| inference: false |
| --- |
| |
| # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack) |
| |
| Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack |
| of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at GGUF Q2_K. |
|
|
| **No new weights are published here.** This card describes a runtime configuration: |
| load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K) |
| and apply the KV-cache modifier |
| documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant). |
|
|
| ## Quickstart |
|
|
| This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only β load the parent weight repo for actual `.gguf` binaries. |
| ```bash |
| # 1. Download the GGUF + the multimodal projector |
| huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K-TQ-KV Q2_K.gguf --local-dir ./model |
| huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj |
| |
| # 2. Multimodal inference (text + image + audio + video) |
| llama-mtmd-cli \ |
| -m ./model/Q2_K.gguf \ |
| --mmproj ./mmproj/mmproj-F16.gguf \ |
| --image cat.jpg \ |
| -p "Describe this image in detail" \ |
| --temp 0.6 --top-p 0.95 -n 512 |
| |
| # 3. Text-only inference (no mmproj needed) |
| llama-cli \ |
| -m ./model/Q2_K.gguf \ |
| -p "What is the capital of France?" \ |
| --temp 0.6 --top-p 0.95 -n 256 |
| |
| # Disable extended reasoning (default is on): |
| # add `--chat-template-kwargs '{"enable_thinking": false}'` |
| ``` |
|
|
| > β οΈ Do NOT use llama.cpp built against CUDA 13.2 β produces gibberish. Pin CUDA 12.x or use Metal/CPU. |
|
|
| ## Modality matrix |
|
|
| | Modality | Encoder | Quantization in this variant | |
| |---|---|---| |
| | Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix | |
| | Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) | |
| | Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) | |
| | Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (β€ 2 min, 256 frames @ 2 FPS) | |
|
|
| NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal |
| MLP projectors in BF16 to preserve multimodal accuracy. We follow that |
| convention in every quantized variant we ship. |
|
|
| ## Runtime quirks |
|
|
| ### llama.cpp |
|
|
| Use `llama-mtmd-cli` for multimodal inference; pass `--mmproj mmproj-F16.gguf` |
| (see `majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16`). |
|
|
| **Do NOT use CUDA 13.2** β produces gibberish. Pin CUDA 12.x or |
| use the Metal/CPU paths. |
|
|
| ### Ollama |
|
|
| Text-only; multimodal is blocked because Ollama doesn't yet support |
| the mmproj split-file pattern. |
|
|
| ### Reasoning mode |
|
|
| `enable_thinking` defaults to `True`. To disable extended reasoning |
| (e.g., for latency-sensitive cases), pass `enable_thinking=False` |
| to the chat template / generate call. No separate "no-think" |
| variant card exists β this is a runtime flag, not a model variant. |