majentik commited on
Commit
c720c7a
·
verified ·
1 Parent(s): a2f6a39

docs: Tier 1 polish — frontmatter + quickstart + KV-root rewrite

Browse files
Files changed (1) hide show
  1. README.md +35 -1
README.md CHANGED
@@ -3,7 +3,13 @@ license: other
3
  license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
6
- tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card]
 
 
 
 
 
 
7
  ---
8
 
9
  # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)
@@ -16,6 +22,34 @@ load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQua
16
  and apply the KV-cache modifier
17
  documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Modality matrix
20
 
21
  | Modality | Encoder | Quantization in this variant |
 
3
  license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
6
+ tags: [nemotron, multimodal, turboquant, kv-cache, gguf, combo-card, llama.cpp, runtime-modifier,
7
+ matched-stack]
8
+ library_name: gguf
9
+ pipeline_tag: image-text-to-text
10
+ language: [en]
11
+ datasets: [nvidia/Nemotron-Image-Training-v3]
12
+ inference: false
13
  ---
14
 
15
  # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS + TurboQuant KV-Cache (matched stack)
 
22
  and apply the KV-cache modifier
23
  documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
24
 
25
+ ## Quickstart
26
+
27
+ This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual `.gguf` binaries.
28
+ ```bash
29
+ # 1. Download the GGUF + the multimodal projector
30
+ huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV IQ4_XS.gguf --local-dir ./model
31
+ huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj
32
+
33
+ # 2. Multimodal inference (text + image + audio + video)
34
+ llama-mtmd-cli \
35
+ -m ./model/IQ4_XS.gguf \
36
+ --mmproj ./mmproj/mmproj-F16.gguf \
37
+ --image cat.jpg \
38
+ -p "Describe this image in detail" \
39
+ --temp 0.6 --top-p 0.95 -n 512
40
+
41
+ # 3. Text-only inference (no mmproj needed)
42
+ llama-cli \
43
+ -m ./model/IQ4_XS.gguf \
44
+ -p "What is the capital of France?" \
45
+ --temp 0.6 --top-p 0.95 -n 256
46
+
47
+ # Disable extended reasoning (default is on):
48
+ # add `--chat-template-kwargs '{"enable_thinking": false}'`
49
+ ```
50
+
51
+ > ⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.
52
+
53
  ## Modality matrix
54
 
55
  | Modality | Encoder | Quantization in this variant |