docs: Tier 1 polish — frontmatter + quickstart + KV-root rewrite
Browse files
README.md
CHANGED
|
@@ -3,7 +3,13 @@ license: other
|
|
| 3 |
license_name: nvidia-open-model-license
|
| 4 |
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
|
| 5 |
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
|
| 6 |
-
tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack)
|
|
@@ -16,6 +22,34 @@ load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQua
|
|
| 16 |
and apply the KV-cache modifier
|
| 17 |
documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Modality matrix
|
| 20 |
|
| 21 |
| Modality | Encoder | Quantization in this variant |
|
|
|
|
| 3 |
license_name: nvidia-open-model-license
|
| 4 |
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
|
| 5 |
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
|
| 6 |
+
tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
|
| 7 |
+
matched-stack]
|
| 8 |
+
library_name: gguf
|
| 9 |
+
pipeline_tag: image-text-to-text
|
| 10 |
+
language: [en]
|
| 11 |
+
datasets: [nvidia/Nemotron-Image-Training-v3]
|
| 12 |
+
inference: false
|
| 13 |
---
|
| 14 |
|
| 15 |
# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF Q2_K + TurboQuant KV-Cache (matched stack)
|
|
|
|
| 22 |
and apply the KV-cache modifier
|
| 23 |
documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
|
| 24 |
|
| 25 |
+
## Quickstart
|
| 26 |
+
|
| 27 |
+
This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
|
| 28 |
+
```bash
|
| 29 |
+
# 1. Download the GGUF + the multimodal projector
|
| 30 |
+
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-Q2_K-TQ-KV Q2_K.gguf --local-dir ./model
|
| 31 |
+
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj
|
| 32 |
+
|
| 33 |
+
# 2. Multimodal inference (text + image + audio + video)
|
| 34 |
+
llama-mtmd-cli \
|
| 35 |
+
-m ./model/Q2_K.gguf \
|
| 36 |
+
--mmproj ./mmproj/mmproj-F16.gguf \
|
| 37 |
+
--image cat.jpg \
|
| 38 |
+
-p "Describe this image in detail" \
|
| 39 |
+
--temp 0.6 --top-p 0.95 -n 512
|
| 40 |
+
|
| 41 |
+
# 3. Text-only inference (no mmproj needed)
|
| 42 |
+
llama-cli \
|
| 43 |
+
-m ./model/Q2_K.gguf \
|
| 44 |
+
-p "What is the capital of France?" \
|
| 45 |
+
--temp 0.6 --top-p 0.95 -n 256
|
| 46 |
+
|
| 47 |
+
# Disable extended reasoning (default is on):
|
| 48 |
+
# add `--chat-template-kwargs '{"enable_thinking": false}'`
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
> ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.
|
| 52 |
+
|
| 53 |
## Modality matrix
|
| 54 |
|
| 55 |
| Modality | Encoder | Quantization in this variant |
|