Text Generation
MLX
English
nemotron
multimodal
mamba2
Mixture of Experts
quantized
turboquant
kv-cache-modifier
apple-silicon
runtime-modifier
matched-stack
Instructions to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit-TQ-KV" --prompt "Once upon a time"
File size: 2,240 Bytes
1108968 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | ---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier]
---
# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 3-bit + TurboQuant KV-Cache (matched stack)
Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 3-bit.
**No new weights are published here.** Load the weights from
[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-3bit)
and apply the TurboQuant KV-cache modifier documented in
[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).
## Modality matrix
| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | **BF16** (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | **BF16** (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | **BF16** (≤ 2 min, 256 frames @ 2 FPS) |
NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
MLP projectors in BF16 to preserve multimodal accuracy. We follow that
convention in every quantized variant we ship.
## Runtime quirks
### MLX-LM (text-only)
This variant covers the LLM backbone only. Vision + audio encoders
are NOT included — MLX-VLM Nemotron-Omni model class is
**pending upstream support** (no PR observed as of 2026-05-04).
Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag
(see below).
### Reasoning mode
`enable_thinking` defaults to `True`. To disable extended reasoning
(e.g., for latency-sensitive cases), pass `enable_thinking=False`
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant. |