| --- |
| license: other |
| license_name: nvidia-open-model-license |
| license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf |
| base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 |
| tags: [nemotron, multimodal, mamba2, moe, quantized, rotorquant, mlx, kv-cache-modifier, |
| apple-silicon, runtime-modifier, matched-stack] |
| library_name: mlx |
| pipeline_tag: text-generation |
| language: [en] |
| datasets: [nvidia/Nemotron-Image-Training-v3] |
| inference: false |
| --- |
| |
| # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant MLX 8-bit + RotorQuant KV-Cache (matched stack) |
|
|
| Documentation card for the matched RotorQuant weight + RotorQuant KV-cache stack |
| of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 8-bit. |
|
|
| **No new weights are published here.** Load the weights from |
| [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit) |
| and apply the RotorQuant KV-cache modifier documented in |
| [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant). |
|
|
| ## Quickstart |
|
|
| This card pairs the RotorQuant weights with the RotorQuant KV-cache modifier (matched stack). Both are documentation-only β load the parent weight repo for actual MLX shards. |
| ```python |
| # Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class |
| # is not yet registered in mlx-lm. The cell below is the API shape that WILL |
| # work once upstream lands the class (track ml-explore/mlx-lm#386). |
| |
| from mlx_lm import load, generate |
| |
| model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit-RQ-KV") |
| |
| prompt = tokenizer.apply_chat_template( |
| [{"role": "user", "content": "Solve: 17 * 23"}], |
| add_generation_prompt=True, |
| enable_thinking=False, # set True to enable extended reasoning (default) |
| ) |
| |
| response = generate( |
| model, tokenizer, |
| prompt=prompt, |
| max_tokens=512, |
| sampler=lambda x: x.argmax(axis=-1), # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95) |
| ) |
| print(response) |
| ``` |
|
|
| > β οΈ This variant covers the **text tower only**. For multimodal inference (vision + audio + video), use the GGUF variants with `llama-mtmd-cli` β see the GGUF cards in this family. |
|
|
| ## Modality matrix |
|
|
| | Modality | Encoder | Quantization in this variant | |
| |---|---|---| |
| | Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix | |
| | Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) | |
| | Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) | |
| | Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (β€ 2 min, 256 frames @ 2 FPS) | |
|
|
| NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal |
| MLP projectors in BF16 to preserve multimodal accuracy. We follow that |
| convention in every quantized variant we ship. |
|
|
| ## Runtime quirks |
|
|
| ### MLX-LM (text-only) |
|
|
| This variant covers the LLM backbone only. Vision + audio encoders |
| are NOT included β MLX-VLM Nemotron-Omni model class is |
| **pending upstream support** (no PR observed as of 2026-05-04). |
|
|
| Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag |
| (see below). |
|
|
| ### Reasoning mode |
|
|
| `enable_thinking` defaults to `True`. To disable extended reasoning |
| (e.g., for latency-sensitive cases), pass `enable_thinking=False` |
| to the chat template / generate call. No separate "no-think" |
| variant card exists β this is a runtime flag, not a model variant. |