majentik
/

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-8bit-TQ-KV

Text Generation

Mixture of Experts

kv-cache-modifier

runtime-modifier

Model card Files Files and versions

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-8bit-TQ-KV / README.md

majentik's picture

feat: upload TurboQuant-MLX-8bit-TQ-KV combo card

cdaa881 verified 3 days ago

|

2.24 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	base_model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
	tags: [nemotron, multimodal, mamba2, moe, quantized, turboquant, mlx, kv-cache-modifier]
	---

	# Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant MLX 8-bit + TurboQuant KV-Cache (matched stack)

	Documentation card for the matched TurboQuant weight + TurboQuant KV-cache stack
	of `Nemotron-3-Nano-Omni-30B-A3B-Reasoning` at MLX 8-bit.

	No new weights are published here. Load the weights from
	[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-8bit`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-MLX-8bit)
	and apply the TurboQuant KV-cache modifier documented in
	[`majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant`](https://huggingface.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant).

	## Modality matrix

	\| Modality \| Encoder \| Quantization in this variant \|
	\|---\|---\|---\|
	\| Text \| LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) \| per the variant suffix \|
	\| Image \| CRADIO v4-H \| BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) \|
	\| Audio \| Parakeet-TDT-0.6B-v2 \| BF16 (same rationale) \|
	\| Video \| Parakeet-TDT-0.6B-v2 + frame sampler \| BF16 (≤ 2 min, 256 frames @ 2 FPS) \|

	NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal
	MLP projectors in BF16 to preserve multimodal accuracy. We follow that
	convention in every quantized variant we ship.

	## Runtime quirks

	### MLX-LM (text-only)

	This variant covers the LLM backbone only. Vision + audio encoders
	are NOT included — MLX-VLM Nemotron-Omni model class is
	pending upstream support (no PR observed as of 2026-05-04).

	Use the `mlx_lm.generate` API; `enable_thinking` is a runtime flag
	(see below).

	### Reasoning mode

	`enable_thinking` defaults to `True`. To disable extended reasoning
	(e.g., for latency-sensitive cases), pass `enable_thinking=False`
	to the chat template / generate call. No separate "no-think"
	variant card exists — this is a runtime flag, not a model variant.