Nemotron-3-Super-120B-A12B — MLX 6-bit

MLX quantization of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 for Apple Silicon.

Key Specs

Detail Value
Architecture Hybrid Mamba-2 + Transformer Attention + Latent MoE
Total Parameters 120B
Active Parameters 12B per token
Context Length 1M tokens (262,144 default)
Experts 512 routed, 22 active per token, 1 shared
Quantization 6-bit affine (6.507 BPW), group size 64
Disk Size ~92 GB
Peak Memory ~98.4 GB

Requirements

  • Apple Silicon Mac with 128GB+ unified memory
  • mlx-lm >= 0.31.2 (install from git main for Latent MoE support)
pip install git+https://github.com/ml-explore/mlx-lm.git

Usage

CLI

mlx_lm.generate \
  --model FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit \
  --prompt "Hello!" \
  --max-tokens 256

Python

from mlx_lm import load, generate

model, tokenizer = load("FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=256)
print(response)

LM Studio

This model is compatible with LM Studio on Apple Silicon. Search for FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit in the model browser and download directly.

Performance

Tested on M5 Pro Max (128GB):

Metric Value
Generation Speed ~43.6 tok/s
Peak Memory 98.4 GB

About the Architecture

Nemotron-H is a hybrid architecture combining three components:

  • Mamba-2 layers — efficient state-space model for long-context processing
  • Transformer attention layers — standard multi-head attention (GQA, 32 heads, 2 KV heads)
  • Latent MoE — 512 experts with latent routing, 22 active per token, plus 1 shared expert

The layer pattern alternates between Mamba (M) and attention with MoE (E) blocks across 88 layers. This hybrid design achieves strong performance with only 12B active parameters per token despite having 120B total.

Reasoning Model

This is a reasoning model that outputs chain-of-thought before the final answer. The model uses <think> and </think> tags to delineate reasoning.

License

NVIDIA Open Model License

Credits

Downloads last month
1,001
Safetensors
Model size
121B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FaisalFehad/Nemotron-3-Super-120B-A12B-MLX-6bit

Quantized
(42)
this model