majentik
/

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV

@@ -3,7 +3,13 @@ license: other
 license_name: nvidia-open-model-license
 license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
 base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
-tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card]
 ---
 # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)
@@ -16,6 +22,34 @@ load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQua
 and apply the KV-cache modifier
 documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
 ## Modality matrix
 | Modality | Encoder | Quantization in this variant |

 license_name: nvidia-open-model-license
 license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
 base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
+tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
+  matched-stack]
+library_name: gguf
+pipeline_tag: image-text-to-text
+language: [en]
+datasets: [nvidia/Nemotron-Image-Training-v3]
+inference: false
 ---
 # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)
 and apply the KV-cache modifier
 documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
+## Quickstart
+This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
+```bash
+# 1. Download the GGUF + the multimodal projector
+huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV IQ4_XS.gguf --local-dir ./model
+huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj
+# 2. Multimodal inference (text + image + audio + video)
+llama-mtmd-cli \
+  -m ./model/IQ4_XS.gguf \
+  --mmproj ./mmproj/mmproj-F16.gguf \
+  --image cat.jpg \
+  -p "Describe this image in detail" \
+  --temp 0.6 --top-p 0.95 -n 512
+# 3. Text-only inference (no mmproj needed)
+llama-cli \
+  -m ./model/IQ4_XS.gguf \
+  -p "What is the capital of France?" \
+  --temp 0.6 --top-p 0.95 -n 256
+# Disable extended reasoning (default is on):
+#   add `--chat-template-kwargs '{"enable_thinking": false}'`
+```
+> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.
 ## Modality matrix
 | Modality | Encoder | Quantization in this variant |