JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L

MLX Studio — the only app that natively supports JANG models with reasoning

Nemotron-3-Super-120B on a 64 GB Mac. The first working Nemotron-H quantization for Apple Silicon. 43 GB, 52 tok/s, 86% MMLU with reasoning. Hybrid Mamba-2 SSM + Latent MoE + Attention architecture.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or pip install "jang[mlx]>=2.1.5".

Nemotron-3-Super-120B-A12B — JANG_2L (2.8-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

86.0% MMLU (200 questions, reasoning mode)
51.6 tok/s generation, 136 tok/s prefill
43 GB on disk, 42.4 GB GPU RAM
Reasoning mode: <think>...</think> step-by-step problem solving
Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
bfloat16 compute: auto-detected for 512-expert models

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. Both tested with and without reasoning using identical methodology.

Subject	JANG_2L No-Think	JANG_2L Reasoning	JANG_4M No-Think	JANG_4M Reasoning	MLX 4-bit No-Think	MLX 4-bit Reasoning
Abstract Algebra	12/20	16/20	10/20	19/20	9/20	19/20
Anatomy	15/20	17/20	15/20	18/20	14/20	18/20
Astronomy	19/20	19/20	19/20	19/20	19/20	19/20
College CS	13/20	15/20	13/20	17/20	14/20	17/20
College Physics	14/20	18/20	14/20	19/20	13/20	20/20
HS Biology	19/20	18/20	19/20	20/20	18/20	20/20
HS Chemistry	15/20	16/20	15/20	18/20	16/20	19/20
HS Mathematics	8/20	18/20	6/20	18/20	6/20	18/20
Logical Fallacies	17/20	18/20	17/20	19/20	17/20	18/20
World Religions	18/20	17/20	17/20	19/20	16/20	19/20
Total	150/200 (75.0%)	172/200 (86.0%)	145/200 (72.5%)	186/200 (93.0%)	142/200 (71.0%)	187/200 (93.5%)

Summary

	JANG_2L	JANG_4M	MLX 3-bit	MLX 4-bit
MMLU (no-think)	75.0%	72.5%	Crashes	71.0%
MMLU (reasoning)	86.0%	93.0%	Crashes	93.5%
Size	43 GB	63 GB	N/A	63 GB
GPU RAM	42.4 GB	61.2 GB	N/A	63.3 GB
Speed	51.6 tok/s	55.1 tok/s	N/A	59.8 tok/s
Fits 64 GB?	YES	YES	N/A	YES

JANG_2L is 20 GB smaller than MLX 4-bit with +4% higher no-think MMLU (75% vs 71%). JANG_4M matches MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size. JANG_2L at 43 GB beats MLX on no-think (75% vs 71%) at 32% less storage. MLX cannot create sub-4-bit versions — crashes on mtp.* weights.

MLX 3-bit cannot be created — mlx_lm.convert crashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations of this model.

JANG_4M at same size as MLX 4-bit: 93.0% vs 93.5% — nearly identical quality with 8-bit attention protection.

Specs

Metric	Value
Source	NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Architecture	Hybrid Mamba-2 SSM + Latent MoE + Dense Attention
Layers	88 (40 Mamba-2 + 40 MoE + 8 Attention)
Experts	512 per MoE layer, top-22 active (12B active params)
Latent MoE	fc1_latent_proj (4096→1024), experts in latent space
Profile	JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2)
Average bits	2.76 bpw
Disk size	43 GB
GPU RAM	42.4 GB
Speed	51.6 tok/s generation, 136 tok/s prefill
Compute	bfloat16 (auto-detected)

Requirements

Apple Silicon Mac with 64+ GB unified memory
MLX Studio (recommended) or pip install "jang[mlx]>=2.1.5"
Requires patched mlx-lm with latent MoE support (included in JANG loader)

Quick Start

pip install "jang[mlx]>=2.1.5"

from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L")

# With reasoning (recommended for hard questions)
messages = [{"role": "user", "content": "Explain the difference between SSM and attention."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Technical Notes

Latent MoE: Nemotron-H compresses hidden states from 4096→1024 before expert routing, then decompresses back. This is unique among MoE models and requires a patched mlx-lm. The JANG loader handles this automatically.
bfloat16: 512-expert models overflow float16 (max 65,504). JANG auto-detects and uses bfloat16 (max 3.4×10^38). Zero quality impact.
Mamba-2 SSM: 40 of 88 layers use state-space models instead of attention, enabling efficient long-context processing.
trust_remote_code: Model requires custom Python files (modeling_nemotron_h.py, configuration_nemotron_h.py) which are included.

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

한국어

Nemotron-3-Super-120B를 64 GB Mac에서 실행할 수 있는 최초의 양자화입니다. 43 GB, 52 tok/s, 86% MMLU.

	JANG_2L	MLX 4-bit
MMLU (추론 없음)	75.0%	71.0%
MMLU (추론 포함)	86.0%	93.5%
크기	43 GB	63 GB
속도	51.6 tok/s	59.8 tok/s

pip install "jang[mlx]>=2.1.5"

Downloads last month: 1,151

Safetensors

Model size

13B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

Finetuned

(2)

this model