majentik's picture
docs: Tier 2 polish — variant matrix + quant trade-off
7f3d0de verified
|
raw
history blame
12.5 kB
metadata
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags:
  - nemotron
  - multimodal
  - mamba2
  - moe
  - quantized
  - turboquant
  - mlx
  - kv-cache-modifier
  - apple-silicon
  - runtime-modifier
  - matched-stack
library_name: mlx
pipeline_tag: text-generation
language:
  - en
datasets:
  - nvidia/Nemotron-Image-Training-v3
inference: false

Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 4-bit + TurboQuant KV-Cache (matched stack)

Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack of Nemotron-3-Nano-Omni-30B-A3B-Reasoning at MLX 4-bit.

No new weights are published here. Load the weights from majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-4bit and apply the TurboQuant KV-cache modifier documented in majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.

Quickstart

This card pairs the TurboQuant weights with the TurboQuant KV-cache modifier (matched stack). Both are documentation-only — load the parent weight repo for actual MLX shards.

# Today (mlx-lm 0.31.x): the NemotronH_Nano_Omni_Reasoning_V3 model class
# is not yet registered in mlx-lm. The cell below is the API shape that WILL
# work once upstream lands the class (track ml-explore/mlx-lm#386).

from mlx_lm import load, generate

model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-4bit-TQ-KV")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Solve: 17 * 23"}],
    add_generation_prompt=True,
    enable_thinking=False,  # set True to enable extended reasoning (default)
)

response = generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=512,
    sampler=lambda x: x.argmax(axis=-1),  # or use mlx_lm.sample_utils.make_sampler(temp=0.6, top_p=0.95)
)
print(response)

⚠️ This variant covers the text tower only. For multimodal inference (vision + audio + video), use the GGUF variants with llama-mtmd-cli — see the GGUF cards in this family.

Modality matrix

Modality Encoder Quantization in this variant
Text LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) per the variant suffix
Image CRADIO v4-H BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file)
Audio Parakeet-TDT-0.6B-v2 BF16 (same rationale)
Video Parakeet-TDT-0.6B-v2 + frame sampler BF16 (≤ 2 min, 256 frames @ 2 FPS)

NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.

Runtime quirks

MLX-LM (text-only)

This variant covers the LLM backbone only. Vision + audio encoders are NOT included — MLX-VLM Nemotron-Omni model class is pending upstream support (no PR observed as of 2026-05-04).

Use the mlx_lm.generate API; enable_thinking is a runtime flag (see below).

Reasoning mode

enable_thinking defaults to True. To disable extended reasoning (e.g., for latency-sensitive cases), pass enable_thinking=False to the chat template / generate call. No separate "no-think" variant card exists — this is a runtime flag, not a model variant.

Variants in this family

(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — TurboQuant-MLX-4bit-TQ-KV — is bolded.)

Variant Runtime Approx size Use case
mmproj-F16 llama-mtmd-cli ~1-2 GB Multimodal projector (pair with any GGUF)
RotorQuant runtime modifier n/a KV-cache root (weight-agnostic)
RotorQuant-GGUF-IQ4_XS llama.cpp ~26 GB Lossy 4-bit, low-RAM CPU/edge
RotorQuant-GGUF-MXFP4_MOE llama.cpp ~30 GB MXFP4 MoE quant
RotorQuant-GGUF-Q2_K llama.cpp ~18 GB Lossy, low-RAM CPU/edge
RotorQuant-GGUF-Q3_K_M llama.cpp ~23 GB Smaller 3-bit, CPU-friendly
RotorQuant-GGUF-Q4_K_M llama.cpp ~33 GB Balanced default
RotorQuant-GGUF-Q5_K_M llama.cpp ~40 GB Higher fidelity, more RAM
RotorQuant-GGUF-Q8_0 llama.cpp ~63 GB Near-lossless reference
RotorQuant-GGUF-IQ4_XS-RQ-KV llama.cpp ~26 GB IQ4_XS + RotorQuant KV
RotorQuant-GGUF-MXFP4_MOE-RQ-KV llama.cpp ~30 GB MXFP4 MoE + RotorQuant KV
RotorQuant-GGUF-Q2_K-RQ-KV llama.cpp ~18 GB Q2_K + RotorQuant KV
RotorQuant-GGUF-Q3_K_M-RQ-KV llama.cpp ~23 GB Q3_K_M + RotorQuant KV
RotorQuant-GGUF-Q4_K_M-RQ-KV llama.cpp ~33 GB Q4_K_M + RotorQuant KV
RotorQuant-GGUF-Q5_K_M-RQ-KV llama.cpp ~40 GB Q5_K_M + RotorQuant KV
RotorQuant-GGUF-Q8_0-RQ-KV llama.cpp ~63 GB Q8_0 + RotorQuant KV
RotorQuant-MLX-2bit mlx-lm ~9.6 GB Apple Silicon, smallest
RotorQuant-MLX-2bit-RQ-KV mlx-lm ~9.6 GB 2-bit + RotorQuant KV
RotorQuant-MLX-3bit mlx-lm ~14 GB Apple Silicon, small
RotorQuant-MLX-3bit-RQ-KV mlx-lm ~14 GB 3-bit + RotorQuant KV
RotorQuant-MLX-4bit mlx-lm ~19 GB Apple Silicon balanced
RotorQuant-MLX-4bit-RQ-KV mlx-lm ~19 GB 4-bit + RotorQuant KV
RotorQuant-MLX-5bit mlx-lm ~23 GB Apple Silicon, higher fidelity
RotorQuant-MLX-5bit-RQ-KV mlx-lm ~23 GB 5-bit + RotorQuant KV
RotorQuant-MLX-6bit mlx-lm ~27 GB Apple Silicon, near-lossless
RotorQuant-MLX-6bit-RQ-KV mlx-lm ~27 GB 6-bit + RotorQuant KV
RotorQuant-MLX-8bit mlx-lm ~35 GB Apple Silicon reference
RotorQuant-MLX-8bit-RQ-KV mlx-lm ~35 GB 8-bit + RotorQuant KV
RotorQuant-MLX-MXFP4 mlx-lm ~19 GB Apple Silicon MXFP4
TurboQuant runtime modifier n/a KV-cache root (weight-agnostic)
TurboQuant-GGUF-IQ4_XS llama.cpp ~26 GB Lossy 4-bit, low-RAM CPU/edge
TurboQuant-GGUF-MXFP4_MOE llama.cpp ~30 GB MXFP4 MoE quant
TurboQuant-GGUF-Q2_K llama.cpp ~18 GB Lossy, low-RAM CPU/edge
TurboQuant-GGUF-Q3_K_M llama.cpp ~23 GB Smaller 3-bit, CPU-friendly
TurboQuant-GGUF-Q4_K_M llama.cpp ~33 GB Balanced default
TurboQuant-GGUF-Q5_K_M llama.cpp ~40 GB Higher fidelity, more RAM
TurboQuant-GGUF-Q8_0 llama.cpp ~63 GB Near-lossless reference
TurboQuant-GGUF-IQ4_XS-TQ-KV llama.cpp ~26 GB IQ4_XS + TurboQuant KV
TurboQuant-GGUF-MXFP4_MOE-TQ-KV llama.cpp ~30 GB MXFP4 MoE + TurboQuant KV
TurboQuant-GGUF-Q2_K-TQ-KV llama.cpp ~18 GB Q2_K + TurboQuant KV
TurboQuant-GGUF-Q3_K_M-TQ-KV llama.cpp ~23 GB Q3_K_M + TurboQuant KV
TurboQuant-GGUF-Q4_K_M-TQ-KV llama.cpp ~33 GB Q4_K_M + TurboQuant KV
TurboQuant-GGUF-Q5_K_M-TQ-KV llama.cpp ~40 GB Q5_K_M + TurboQuant KV
TurboQuant-GGUF-Q8_0-TQ-KV llama.cpp ~63 GB Q8_0 + TurboQuant KV
TurboQuant-MLX-2bit mlx-lm ~9.6 GB Apple Silicon, smallest
TurboQuant-MLX-2bit-TQ-KV mlx-lm ~9.6 GB 2-bit + TurboQuant KV
TurboQuant-MLX-3bit mlx-lm ~14 GB Apple Silicon, small
TurboQuant-MLX-3bit-TQ-KV mlx-lm ~14 GB 3-bit + TurboQuant KV
TurboQuant-MLX-4bit mlx-lm ~19 GB Apple Silicon balanced
TurboQuant-MLX-4bit-TQ-KV mlx-lm ~19 GB 4-bit + TurboQuant KV
TurboQuant-MLX-5bit mlx-lm ~23 GB Apple Silicon, higher fidelity
TurboQuant-MLX-5bit-TQ-KV mlx-lm ~23 GB 5-bit + TurboQuant KV
TurboQuant-MLX-6bit mlx-lm ~27 GB Apple Silicon, near-lossless
TurboQuant-MLX-6bit-TQ-KV mlx-lm ~27 GB 6-bit + TurboQuant KV
TurboQuant-MLX-8bit mlx-lm ~35 GB Apple Silicon reference
TurboQuant-MLX-8bit-TQ-KV mlx-lm ~35 GB 8-bit + TurboQuant KV