--- license: other license_name: nvidia-open-model-license license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 tags: [nemotron, multimodal, mamba2, moe, quantized, rotorquant, mlx, kv-cache-modifier, apple-silicon, runtime-modifier, matched-stack] library_name: mlx pipeline_tag: text-generation language: [en] datasets: [nvidia/Nemotron-Image-Training-v3] inference: false --- # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant MLX 6-bit + RotorQuant KV-Cache (matched stack) Documentation card for the matched RotorQuant weight + RotorQuant KV-cache stack of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 6-bit. **No new weights are published here.** Load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-6bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-6bit) and apply the RotorQuant KV-cache modifier documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant). ## Quickstart This card pairs the RotorQuant weights with the RotorQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards. ```python # Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class # is not yet registered in mlx-lm. The cell below is the API shape that WILL # work once upstream lands the class (track ml-explore/mlx-lm#386). from mlx_lm import load, generate model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-6bit-RQ-KV") prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Solve: 17 * 23"}], add_generation_prompt=True, enable_thinking=False, # set True to enable extended reasoning (default) ) response = generate( model, tokenizer, prompt=prompt, max_tokens=512, sampler=lambda x: x.argmax(axis=-1), # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95) ) print(response) ``` > ⚠️ This variant covers the **text tower only**. For multimodal inference (vision + audio + video), use the GGUF variants with `llama-mtmd-cli` — see the GGUF cards in this family. ## Modality matrix | Modality | Encoder | Quantization in this variant | |---|---|---| | Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix | | Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) | | Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) | | Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) | NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship. ## Runtime quirks ### MLX-LM (text-only) This variant covers the LLM backbone only. Vision + audio encoders are NOT included — MLX-VLM Nemotron-Omni model class is **pending upstream support** (no PR observed as of 2026-05-04). Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag (see below). ### Reasoning mode `enable_thinking` defaults to `True`. To disable extended reasoning (e.g., for latency-sensitive cases), pass `enable_thinking=False` to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant. ## Variants in this family (Showing 56 sibling variants under `majentik/nemotron3-nano-omni-30b-*`. The current variant — `RotorQuant-MLX-6bit-RQ-KV` — is **bolded**.) | Variant | Runtime | Approx size | Use case | |---|---|---|---| | [mmproj-F16](https://huggingface.co/majentik/nemotron3-nano-omni-30b-mmproj-f16) | llama-mtmd-cli | ~1-2 GB | Multimodal projector (pair with any GGUF) | | [RotorQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [RotorQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-IQ4_XS) | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge | | [RotorQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-MXFP4_MOE) | llama.cpp | ~30 GB | MXFP4 MoE quant | | [RotorQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q2_K) | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge | | [RotorQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q3_K_M) | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly | | [RotorQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q4_K_M) | llama.cpp | ~33 GB | Balanced default | | [RotorQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q5_K_M) | llama.cpp | ~40 GB | Higher fidelity, more RAM | | [RotorQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-Q8_0) | llama.cpp | ~63 GB | Near-lossless reference | | [RotorQuant-GGUF-IQ4_XS-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-iq4_xs-rq-kv) | llama.cpp | ~26 GB | IQ4_XS + RotorQuant KV | | [RotorQuant-GGUF-MXFP4_MOE-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-mxfp4_moe-rq-kv) | llama.cpp | ~30 GB | MXFP4 MoE + RotorQuant KV | | [RotorQuant-GGUF-Q2_K-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q2_k-rq-kv) | llama.cpp | ~18 GB | Q2_K + RotorQuant KV | | [RotorQuant-GGUF-Q3_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q3_k_m-rq-kv) | llama.cpp | ~23 GB | Q3_K_M + RotorQuant KV | | [RotorQuant-GGUF-Q4_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q4_k_m-rq-kv) | llama.cpp | ~33 GB | Q4_K_M + RotorQuant KV | | [RotorQuant-GGUF-Q5_K_M-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q5_k_m-rq-kv) | llama.cpp | ~40 GB | Q5_K_M + RotorQuant KV | | [RotorQuant-GGUF-Q8_0-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-gguf-q8_0-rq-kv) | llama.cpp | ~63 GB | Q8_0 + RotorQuant KV | | [RotorQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit) | mlx-lm | ~9.6 GB | Apple Silicon, smallest | | [RotorQuant-MLX-2bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-2bit-rq-kv) | mlx-lm | ~9.6 GB | 2-bit + RotorQuant KV | | [RotorQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit) | mlx-lm | ~14 GB | Apple Silicon, small | | [RotorQuant-MLX-3bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-3bit-rq-kv) | mlx-lm | ~14 GB | 3-bit + RotorQuant KV | | [RotorQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced | | [RotorQuant-MLX-4bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-4bit-rq-kv) | mlx-lm | ~19 GB | 4-bit + RotorQuant KV | | [RotorQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit) | mlx-lm | ~23 GB | Apple Silicon, higher fidelity | | [RotorQuant-MLX-5bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-5bit-rq-kv) | mlx-lm | ~23 GB | 5-bit + RotorQuant KV | | [RotorQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-6bit) | mlx-lm | ~27 GB | Apple Silicon, near-lossless | | **RotorQuant-MLX-6bit-RQ-KV** | mlx-lm | ~27 GB | 6-bit + RotorQuant KV | | [RotorQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit) | mlx-lm | ~35 GB | Apple Silicon reference | | [RotorQuant-MLX-8bit-RQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-8bit-rq-kv) | mlx-lm | ~35 GB | 8-bit + RotorQuant KV | | [RotorQuant-MLX-MXFP4](https://huggingface.co/majentik/nemotron3-nano-omni-30b-rotorquant-mlx-mxfp4) | mlx-lm | ~19 GB | Apple Silicon MXFP4 | | [TurboQuant](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant) | runtime modifier | n/a | KV-cache root (weight-agnostic) | | [TurboQuant-GGUF-IQ4_XS](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-IQ4_XS) | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge | | [TurboQuant-GGUF-MXFP4_MOE](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-MXFP4_MOE) | llama.cpp | ~30 GB | MXFP4 MoE quant | | [TurboQuant-GGUF-Q2_K](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q2_K) | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge | | [TurboQuant-GGUF-Q3_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q3_K_M) | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly | | [TurboQuant-GGUF-Q4_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q4_K_M) | llama.cpp | ~33 GB | Balanced default | | [TurboQuant-GGUF-Q5_K_M](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q5_K_M) | llama.cpp | ~40 GB | Higher fidelity, more RAM | | [TurboQuant-GGUF-Q8_0](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-Q8_0) | llama.cpp | ~63 GB | Near-lossless reference | | [TurboQuant-GGUF-IQ4_XS-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-iq4_xs-tq-kv) | llama.cpp | ~26 GB | IQ4_XS + TurboQuant KV | | [TurboQuant-GGUF-MXFP4_MOE-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-mxfp4_moe-tq-kv) | llama.cpp | ~30 GB | MXFP4 MoE + TurboQuant KV | | [TurboQuant-GGUF-Q2_K-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q2_k-tq-kv) | llama.cpp | ~18 GB | Q2_K + TurboQuant KV | | [TurboQuant-GGUF-Q3_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q3_k_m-tq-kv) | llama.cpp | ~23 GB | Q3_K_M + TurboQuant KV | | [TurboQuant-GGUF-Q4_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q4_k_m-tq-kv) | llama.cpp | ~33 GB | Q4_K_M + TurboQuant KV | | [TurboQuant-GGUF-Q5_K_M-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q5_k_m-tq-kv) | llama.cpp | ~40 GB | Q5_K_M + TurboQuant KV | | [TurboQuant-GGUF-Q8_0-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-gguf-q8_0-tq-kv) | llama.cpp | ~63 GB | Q8_0 + TurboQuant KV | | [TurboQuant-MLX-2bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit) | mlx-lm | ~9.6 GB | Apple Silicon, smallest | | [TurboQuant-MLX-2bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-2bit-tq-kv) | mlx-lm | ~9.6 GB | 2-bit + TurboQuant KV | | [TurboQuant-MLX-3bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit) | mlx-lm | ~14 GB | Apple Silicon, small | | [TurboQuant-MLX-3bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-3bit-tq-kv) | mlx-lm | ~14 GB | 3-bit + TurboQuant KV | | [TurboQuant-MLX-4bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit) | mlx-lm | ~19 GB | Apple Silicon balanced | | [TurboQuant-MLX-4bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-4bit-tq-kv) | mlx-lm | ~19 GB | 4-bit + TurboQuant KV | | [TurboQuant-MLX-5bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit) | mlx-lm | ~23 GB | Apple Silicon, higher fidelity | | [TurboQuant-MLX-5bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-5bit-tq-kv) | mlx-lm | ~23 GB | 5-bit + TurboQuant KV | | [TurboQuant-MLX-6bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit) | mlx-lm | ~27 GB | Apple Silicon, near-lossless | | [TurboQuant-MLX-6bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-6bit-tq-kv) | mlx-lm | ~27 GB | 6-bit + TurboQuant KV | | [TurboQuant-MLX-8bit](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit) | mlx-lm | ~35 GB | Apple Silicon reference | | [TurboQuant-MLX-8bit-TQ-KV](https://huggingface.co/majentik/nemotron3-nano-omni-30b-turboquant-mlx-8bit-tq-kv) | mlx-lm | ~35 GB | 8-bit + TurboQuant KV |