NVIDIA-Nemotron-3-Nano-30B-A3B — oQe Series

This repository contains Enhanced (oQe) MLX quants for NVIDIA-Nemotron-3-Nano-30B-A3B. These builds utilize a high-precision anchoring strategy specifically tuned for NVIDIA's Mixture-of-Experts (MoE) architecture to prevent "expert collapse" and routing errors at lower bitrates.

🚀 The oQe (Enhanced) Build Path

MoE models like Nemotron-3 Nano are particularly sensitive to quantization because small errors in the gating logic can lead to the wrong expert being activated. Our build path includes:

Expert Sensitivity Analysis: We identify which of the "Active-3" experts are most critical for specialized tasks (Code, Math, Reasoning) and protect their weights.
MoE Gating Protection: The router and attention heads are locked to a minimum of 6-bit or 8-bit precision to ensure token routing remains accurate.
Hessian-Based Tuning: Post-quantization adjustment is applied to the expert blocks to recover the "drift" caused by compressing the massive parameter space.

📋 oQ Build Performance Matrix

Tier	Target bpw	Actual bpw	Size	Precision Boosts	Hybrid Plan / Strategy
oQ8e	8.0	8.00	32.82 GB	0	Full 8-bit Static
oQ6e	6.0	6.57	27.21 GB	48	8bitX48 (Router & Head Anchors)
oQ5e	5.0	5.64	23.36 GB	118	8bitX48, 6bitX70
oQ4e	4.0	4.70	18.94 GB	118	8bitX48, 6bitX48, 5bitX22

🛠 Technical Build Audit

Calibration: Uses a 128-sample dataset ($128 \times 256$ tokens) covering the diverse language set (EN, ES, FR, DE, JA, IT).
Sensitivity Proxy: Nano-30B-A3B-oQ8.
MoE Strategy: Aggressive anchoring. We used 48 mandatory 8-bit anchors for the routing logic and early attention blocks to maintain the 30B parameter logic flow.

Model Highlights

Active-3 MoE: Only 3 billion parameters are active per token, providing the inference speed of a small model with the broad knowledge base of a 30B model.
Multilingual: Maintains strong performance across the 6 primary languages supported by the base model.
Architecture: Hybrid MoE structure optimized for efficiency on Apple Silicon.

Acknowledgments: These quants were built using the oMLX framework. The weight optimization process is based on the GPTQ algorithm by Frantar et al.

Verified via Splats Lab Vault v2.8. These models are standard mlx-lm compatible and work with any app supporting MLX safetensors.

Downloads last month: 28

Safetensors

Model size

7B params

Tensor type

BF16

U32

MLX

Hardware compatibility

6-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Quantized

(47)

this model