Nemotron-3-Super-120B-A12B — MLX 6-bit

MLX quantization of nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 for Apple Silicon.

Key Specs

Detail	Value
Architecture	Hybrid Mamba-2 + Transformer Attention + Latent MoE
Total Parameters	120B
Active Parameters	12B per token
Context Length	1M tokens (262,144 default)
Experts	512 routed, 22 active per token, 1 shared
Quantization	6-bit affine (6.507 BPW), group size 64
Disk Size	~92 GB
Peak Memory	~98.4 GB

Requirements

Apple Silicon Mac with 128GB+ unified memory
mlx-lm >= 0.31.2 (install from git main for Latent MoE support)

pip install git+https://github.com/ml-explore/mlx-lm.git

Usage

CLI

mlx_lm.generate \
  --model FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit \
  --prompt "Hello!" \
  --max-tokens 256

Python

from mlx_lm import load, generate

model, tokenizer = load("FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit")
response = generate(model, tokenizer, prompt="Hello!", max_tokens=256)
print(response)

LM Studio

This model is compatible with LM Studio on Apple Silicon. Search for FF-01/Nemotron-3-Super-120B-A12B-MLX-6bit in the model browser and download directly.

Performance

Tested on M5 Pro Max (128GB):

Metric	Value
Generation Speed	~43.6 tok/s
Peak Memory	98.4 GB

About the Architecture

Nemotron-H is a hybrid architecture combining three components:

Mamba-2 layers — efficient state-space model for long-context processing
Transformer attention layers — standard multi-head attention (GQA, 32 heads, 2 KV heads)
Latent MoE — 512 experts with latent routing, 22 active per token, plus 1 shared expert

The layer pattern alternates between Mamba (M) and attention with MoE (E) blocks across 88 layers. This hybrid design achieves strong performance with only 12B active parameters per token despite having 120B total.

Reasoning Model

This is a reasoning model that outputs chain-of-thought before the final answer. The model uses <think> and </think> tags to delineate reasoning.

License

NVIDIA Open Model License

Credits

Base model by NVIDIA
MLX quantization by FF-01

Downloads last month: 1,001

Safetensors

Model size

121B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

6-bit

Model tree for FaisalFehad/Nemotron-3-Super-120B-A12B-MLX-6bit

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Quantized

(42)

this model