majentik
/

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-4bit-TQ-KV

@@ -3,7 +3,13 @@ license: other
 license_name: nvidia-open-model-license
 license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
 base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
-tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier]
 ---
 # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 4-bit + TurboQuant KV-Cache (matched stack)
@@ -16,6 +22,35 @@ of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 4-bit.
 and apply the TurboQuant KV-cache modifier documented in
 [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
 ## Modality matrix
 | Modality | Encoder | Quantization in this variant |

 license_name: nvidia-open-model-license
 license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
 base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
+tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier,
+  apple-silicon, runtime-modifier, matched-stack]
+library_name: mlx
+pipeline_tag: text-generation
+language: [en]
+datasets: [nvidia/Nemotron-Image-Training-v3]
+inference: false
 ---
 # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 4-bit + TurboQuant KV-Cache (matched stack)
 and apply the TurboQuant KV-cache modifier documented in
 [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
+## Quickstart
+This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards.
+```python
+# Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class
+# is not yet registered in mlx-lm. The cell below is the API shape that WILL
+# work once upstream lands the class (track ml-explore/mlx-lm#386).
+from mlx_lm import load, generate
+model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-4bit-TQ-KV")
+prompt = tokenizer.apply_chat_template(
+    [{"role": "user", "content": "Solve: 17 * 23"}],
+    add_generation_prompt=True,
+    enable_thinking=False,  # set True to enable extended reasoning (default)
+)
+response = generate(
+    model, tokenizer,
+    prompt=prompt,
+    max_tokens=512,
+    sampler=lambda x: x.argmax(axis=-1),  # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95)
+)
+print(response)
+```
+> ⚠️ This variant covers the **text tower only**. For multimodal inference (vision + audio + video), use the GGUF variants with `llama-mtmd-cli` — see the GGUF cards in this family.
 ## Modality matrix
 | Modality | Encoder | Quantization in this variant |