Nemotron-3-Super-120B-A12B — MLX 6-bit
MLX quantization of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 for Apple Silicon.
Key Specs
| Detail | Value |
|---|---|
| Architecture | Hybrid Mamba-2 + Transformer Attention + Latent MoE |
| Total Parameters | 120B |
| Active Parameters | 12B per token |
| Context Length | 1M tokens (262,144 default) |
| Experts | 512 routed, 22 active per token, 1 shared |
| Quantization | 6-bit affine (6.507 BPW), group size 64 |
| Disk Size | ~92 GB |
| Peak Memory | ~98.4 GB |
Requirements
- Apple Silicon Mac with 128GB+ unified memory
mlx-lm >= 0.31.2(install from git main for Latent MoE support)
pip install git+https://github.com/ml-explore/mlx-lm.git
Usage
CLI
mlx_lm.generate \
--model FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit \
--prompt "Hello!" \
--max-tokens 256
Python
from mlx_lm import load, generate
model, tokenizer = load("FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=256)
print(response)
LM Studio
This model is compatible with LM Studio on Apple Silicon. Search for FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit in the model browser and download directly.
Performance
Tested on M5 Pro Max (128GB):
| Metric | Value |
|---|---|
| Generation Speed | ~43.6 tok/s |
| Peak Memory | 98.4 GB |
About the Architecture
Nemotron-H is a hybrid architecture combining three components:
- Mamba-2 layers — efficient state-space model for long-context processing
- Transformer attention layers — standard multi-head attention (GQA, 32 heads, 2 KV heads)
- Latent MoE — 512 experts with latent routing, 22 active per token, plus 1 shared expert
The layer pattern alternates between Mamba (M) and attention with MoE (E) blocks across 88 layers. This hybrid design achieves strong performance with only 12B active parameters per token despite having 120B total.
Reasoning Model
This is a reasoning model that outputs chain-of-thought before the final answer. The model uses <think> and </think> tags to delineate reasoning.
License
Credits
- Downloads last month
- 1,001
Model size
121B params
Tensor type
BF16
·
U32 ·
F32 ·
Hardware compatibility
Log In to add your hardware
6-bit