license: other
license_name: nvidia-open-model-license
license_link: >-
https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags:
- nemotron
- multimodal
- mamba2
- moe
- quantized
- turboquant
- mlx
- kv-cache-modifier
- apple-silicon
- runtime-modifier
- matched-stack
library_name: mlx
pipeline_tag: text-generation
language:
- en
datasets:
- nvidia/Nemotron-Image-Training-v3
inference: false
Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 3-bit + TurboQuant KV-Cache (matched stack)
Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
of Nemotron-3-Nano-Omni-30B-A3B-Reasoning at MLX 3-bit.
No new weights are published here. Load the weights from
majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit
and apply the TurboQuant KV-cache modifier documented in
majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.
Quickstart
This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards.
# Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class
# is not yet registered in mlx-lm. The cell below is the API shape that WILL
# work once upstream lands the class (track ml-explore/mlx-lm#386).
from mlx_lm import load, generate
model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Solve: 17 * 23"}],
add_generation_prompt=True,
enable_thinking=False, # set True to enable extended reasoning (default)
)
response = generate(
model, tokenizer,
prompt=prompt,
max_tokens=512,
sampler=lambda x: x.argmax(axis=-1), # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95)
)
print(response)
⚠️ This variant covers the text tower only. For multimodal inference (vision + audio + video), use the GGUF variants with
llama-mtmd-cli— see the GGUF cards in this family.
Modality matrix
| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | BF16 (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | BF16 (≤ 2 min, 256 frames @ 2 FPS) |
NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.
Runtime quirks
MLX-LM (text-only)
This variant covers the LLM backbone only. Vision + audio encoders are NOT included — MLX-VLM Nemotron-Omni model class is pending upstream support (no PR observed as of 2026-05-04).
Use the mlx_lm.generate API; enable_thinking is a runtime flag
(see below).
Reasoning mode
enable_thinking defaults to True. To disable extended reasoning
(e.g., for latency-sensitive cases), pass enable_thinking=False
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.