majentik commited on
Commit
0b8aeb6
·
verified ·
1 Parent(s): e42fcaf

docs: Tier 1 polish — frontmatter + quickstart + KV-root rewrite

Browse files
Files changed (1) hide show
  1. README.md +36 -1
README.md CHANGED
@@ -3,7 +3,13 @@ license: other
3
  license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
6
- tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier]
 
 
 
 
 
 
7
  ---
8
 
9
  # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 4-bit + TurboQuant KV-Cache (matched stack)
@@ -16,6 +22,35 @@ of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 4-bit.
16
  and apply the TurboQuant KV-cache modifier documented in
17
  [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Modality matrix
20
 
21
  | Modality | Encoder | Quantization in this variant |
 
3
  license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
6
+ tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier,
7
+ apple-silicon, runtime-modifier, matched-stack]
8
+ library_name: mlx
9
+ pipeline_tag: text-generation
10
+ language: [en]
11
+ datasets: [nvidia/Nemotron-Image-Training-v3]
12
+ inference: false
13
  ---
14
 
15
  # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 4-bit + TurboQuant KV-Cache (matched stack)
 
22
  and apply the TurboQuant KV-cache modifier documented in
23
  [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
24
 
25
+ ## Quickstart
26
+
27
+ This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards.
28
+ ```python
29
+ # Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class
30
+ # is not yet registered in mlx-lm. The cell below is the API shape that WILL
31
+ # work once upstream lands the class (track ml-explore/mlx-lm#386).
32
+
33
+ from mlx_lm import load, generate
34
+
35
+ model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-4bit-TQ-KV")
36
+
37
+ prompt = tokenizer.apply_chat_template(
38
+ [{"role": "user", "content": "Solve: 17 * 23"}],
39
+ add_generation_prompt=True,
40
+ enable_thinking=False, # set True to enable extended reasoning (default)
41
+ )
42
+
43
+ response = generate(
44
+ model, tokenizer,
45
+ prompt=prompt,
46
+ max_tokens=512,
47
+ sampler=lambda x: x.argmax(axis=-1), # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95)
48
+ )
49
+ print(response)
50
+ ```
51
+
52
+ > ⚠️ This variant covers the **text tower only**. For multimodal inference (vision + audio + video), use the GGUF variants with `llama-mtmd-cli` — see the GGUF cards in this family.
53
+
54
  ## Modality matrix
55
 
56
  | Modality | Encoder | Quantization in this variant |