Instructions to use splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
NVIDIA-Nemotron-3-Nano-30B-A3B — oQe Series
This repository contains Enhanced (oQe) MLX quants for NVIDIA-Nemotron-3-Nano-30B-A3B. These builds utilize a high-precision anchoring strategy specifically tuned for NVIDIA's Mixture-of-Experts (MoE) architecture to prevent "expert collapse" and routing errors at lower bitrates.
🚀 The oQe (Enhanced) Build Path
MoE models like Nemotron-3 Nano are particularly sensitive to quantization because small errors in the gating logic can lead to the wrong expert being activated. Our build path includes:
- Expert Sensitivity Analysis: We identify which of the "Active-3" experts are most critical for specialized tasks (Code, Math, Reasoning) and protect their weights.
- MoE Gating Protection: The router and attention heads are locked to a minimum of 6-bit or 8-bit precision to ensure token routing remains accurate.
- Hessian-Based Tuning: Post-quantization adjustment is applied to the expert blocks to recover the "drift" caused by compressing the massive parameter space.
📋 oQ Build Performance Matrix
| Tier | Target bpw | Actual bpw | Size | Precision Boosts | Hybrid Plan / Strategy |
|---|---|---|---|---|---|
| oQ8e | 8.0 | 8.00 | 32.82 GB | 0 | Full 8-bit Static |
| oQ6e | 6.0 | 6.57 | 27.21 GB | 48 | 8bitX48 (Router & Head Anchors) |
| oQ5e | 5.0 | 5.64 | 23.36 GB | 118 | 8bitX48, 6bitX70 |
| oQ4e | 4.0 | 4.70 | 18.94 GB | 118 | 8bitX48, 6bitX48, 5bitX22 |
🛠 Technical Build Audit
- Calibration: Uses a 128-sample dataset ($128 \times 256$ tokens) covering the diverse language set (EN, ES, FR, DE, JA, IT).
- Sensitivity Proxy: Nano-30B-A3B-oQ8.
- MoE Strategy: Aggressive anchoring. We used 48 mandatory 8-bit anchors for the routing logic and early attention blocks to maintain the 30B parameter logic flow.
Model Highlights
- Active-3 MoE: Only 3 billion parameters are active per token, providing the inference speed of a small model with the broad knowledge base of a 30B model.
- Multilingual: Maintains strong performance across the 6 primary languages supported by the base model.
- Architecture: Hybrid MoE structure optimized for efficiency on Apple Silicon.
Acknowledgments: These quants were built using the oMLX framework. The weight optimization process is based on the GPTQ algorithm by Frantar et al.
Verified via Splats Lab Vault v2.8. These models are standard mlx-lm compatible and work with any app supporting MLX safetensors.
- Downloads last month
- 28
6-bit
Model tree for splats/NVIDIA-Nemotron-3-Nano-30B-A3B-oQ6e
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16