--- license: other license_name: nvidia-open-model-license license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier] --- # Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 3-bit + TurboQuant KV-Cache (matched stack) Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 3-bit. **No new weights are published here.** Load the weights from [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit) and apply the TurboQuant KV-cache modifier documented in [`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant). ## Modality matrix | Modality | Encoder | Quantization in this variant | |---|---|---| | Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix | | Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) | | Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) | | Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) | NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship. ## Runtime quirks ### MLX-LM (text-only) This variant covers the LLM backbone only. Vision + audio encoders are NOT included — MLX-VLM Nemotron-Omni model class is **pending upstream support** (no PR observed as of 2026-05-04). Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag (see below). ### Reasoning mode `enable_thinking` defaults to `True`. To disable extended reasoning (e.g., for latency-sensitive cases), pass `enable_thinking=False` to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant.