NVIDIA-Nemotron-3-Nano-30B-A3B — oQe Series

This repository contains Enhanced (oQe) MLX quants for NVIDIA-Nemotron-3-Nano-30B-A3B. These builds utilize a high-precision anchoring strategy specifically tuned for NVIDIA's Mixture-of-Experts (MoE) architecture to prevent "expert collapse" and routing errors at lower bitrates.

🚀 The oQe (Enhanced) Build Path

MoE models like Nemotron-3 Nano are particularly sensitive to quantization because small errors in the gating logic can lead to the wrong expert being activated. Our build path includes:

  1. Expert Sensitivity Analysis: We identify which of the "Active-3" experts are most critical for specialized tasks (Code, Math, Reasoning) and protect their weights.
  2. MoE Gating Protection: The router and attention heads are locked to a minimum of 6-bit or 8-bit precision to ensure token routing remains accurate.
  3. Hessian-Based Tuning: Post-quantization adjustment is applied to the expert blocks to recover the "drift" caused by compressing the massive parameter space.

📋 oQ Build Performance Matrix

Tier Target bpw Actual bpw Size Precision Boosts Hybrid Plan / Strategy
oQ8e 8.0 8.00 32.82 GB 0 Full 8-bit Static
oQ6e 6.0 6.57 27.21 GB 48 8bitX48 (Router & Head Anchors)
oQ5e 5.0 5.64 23.36 GB 118 8bitX48, 6bitX70
oQ4e 4.0 4.70 18.94 GB 118 8bitX48, 6bitX48, 5bitX22

🛠 Technical Build Audit

  • Calibration: Uses a 128-sample dataset ($128 \times 256$ tokens) covering the diverse language set (EN, ES, FR, DE, JA, IT).
  • Sensitivity Proxy: Nano-30B-A3B-oQ8.
  • MoE Strategy: Aggressive anchoring. We used 48 mandatory 8-bit anchors for the routing logic and early attention blocks to maintain the 30B parameter logic flow.

Model Highlights

  • Active-3 MoE: Only 3 billion parameters are active per token, providing the inference speed of a small model with the broad knowledge base of a 30B model.
  • Multilingual: Maintains strong performance across the 6 primary languages supported by the base model.
  • Architecture: Hybrid MoE structure optimized for efficiency on Apple Silicon.

Acknowledgments: These quants were built using the oMLX framework. The weight optimization process is based on the GPTQ algorithm by Frantar et al.

Verified via Splats Lab Vault v2.8. These models are standard mlx-lm compatible and work with any app supporting MLX safetensors.

Downloads last month
28
Safetensors
Model size
7B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e

Quantized
(47)
this model