File size: 3,581 Bytes

---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags: [nemotron, multimodal, mamba2, moe, quantized, rotorquant, mlx, kv-cache-modifier,
  apple-silicon, runtime-modifier, matched-stack]
library_name: mlx
pipeline_tag: text-generation
language: [en]
datasets: [nvidia/Nemotron-Image-Training-v3]
inference: false
---

# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - RotorQuant MLX 8-bit + RotorQuant KV-Cache (matched stack)

Documentation card for the matched RotorQuant weight + RotorQuant KV-cache stack
of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 8-bit.

**No new weights are published here.** Load the weights from
[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit)
and apply the RotorQuant KV-cache modifier documented in
[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant).

## Quickstart

This card pairs the RotorQuant weights with the RotorQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards.
```python
# Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class
# is not yet registered in mlx-lm. The cell below is the API shape that WILL
# work once upstream lands the class (track ml-explore/mlx-lm#386).

from mlx_lm import load, generate

model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-RotorQuant-MLX-8bit-RQ-KV")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Solve: 17 * 23"}],
    add_generation_prompt=True,
    enable_thinking=False,  # set True to enable extended reasoning (default)
)

response = generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=512,
    sampler=lambda x: x.argmax(axis=-1),  # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95)
)
print(response)
```

> ⚠️ This variant covers the **text tower only**. For multimodal inference (vision + audio + video), use the GGUF variants with `llama-mtmd-cli` — see the GGUF cards in this family.

## Modality matrix

| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) |

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
MLP projectors in BF16 to preserve multimodal accuracy. We follow that
convention in every quantized variant we ship.

## Runtime quirks

### MLX-LM (text-only)

This variant covers the LLM backbone only. Vision + audio encoders
are NOT included — MLX-VLM Nemotron-Omni model class is
**pending upstream support** (no PR observed as of 2026-05-04).

Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag
(see below).

### Reasoning mode

`enable_thinking` defaults to `True`. To disable extended reasoning
(e.g., for latency-sensitive cases), pass `enable_thinking=False`
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.