MLX Studio — the only app that natively supports JANG models with reasoning
Nemotron-3-Super-120B on a 64 GB Mac. The first working Nemotron-H quantization for Apple Silicon. 43 GB, 52 tok/s, 86% MMLU with reasoning. Hybrid Mamba-2 SSM + Latent MoE + Attention architecture.
LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or
pip install "jang[mlx]>=2.1.5".
Nemotron-3-Super-120B-A12B — JANG_2L (2.8-bit, 8-bit attention) — Reasoning
JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX
JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.
Key Features
- 86.0% MMLU (200 questions, reasoning mode)
- 51.6 tok/s generation, 136 tok/s prefill
- 43 GB on disk, 42.4 GB GPU RAM
- Reasoning mode:
<think>...</think>step-by-step problem solving - Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
- bfloat16 compute: auto-detected for 512-expert models
Results: JANG vs MLX (200-question MMLU)
Per-subject comparison. Both tested with and without reasoning using identical methodology.
| Subject | JANG_2L No-Think | JANG_2L Reasoning | JANG_4M No-Think | JANG_4M Reasoning | MLX 4-bit No-Think | MLX 4-bit Reasoning |
|---|---|---|---|---|---|---|
| Abstract Algebra | 12/20 | 16/20 | 10/20 | 19/20 | 9/20 | 19/20 |
| Anatomy | 15/20 | 17/20 | 15/20 | 18/20 | 14/20 | 18/20 |
| Astronomy | 19/20 | 19/20 | 19/20 | 19/20 | 19/20 | 19/20 |
| College CS | 13/20 | 15/20 | 13/20 | 17/20 | 14/20 | 17/20 |
| College Physics | 14/20 | 18/20 | 14/20 | 19/20 | 13/20 | 20/20 |
| HS Biology | 19/20 | 18/20 | 19/20 | 20/20 | 18/20 | 20/20 |
| HS Chemistry | 15/20 | 16/20 | 15/20 | 18/20 | 16/20 | 19/20 |
| HS Mathematics | 8/20 | 18/20 | 6/20 | 18/20 | 6/20 | 18/20 |
| Logical Fallacies | 17/20 | 18/20 | 17/20 | 19/20 | 17/20 | 18/20 |
| World Religions | 18/20 | 17/20 | 17/20 | 19/20 | 16/20 | 19/20 |
| Total | 150/200 (75.0%) | 172/200 (86.0%) | 145/200 (72.5%) | 186/200 (93.0%) | 142/200 (71.0%) | 187/200 (93.5%) |
Summary
| JANG_2L | JANG_4M | MLX 3-bit | MLX 4-bit | |
|---|---|---|---|---|
| MMLU (no-think) | 75.0% | 72.5% | Crashes | 71.0% |
| MMLU (reasoning) | 86.0% | 93.0% | Crashes | 93.5% |
| Size | 43 GB | 63 GB | N/A | 63 GB |
| GPU RAM | 42.4 GB | 61.2 GB | N/A | 63.3 GB |
| Speed | 51.6 tok/s | 55.1 tok/s | N/A | 59.8 tok/s |
| Fits 64 GB? | YES | YES | N/A | YES |
JANG_2L is 20 GB smaller than MLX 4-bit with +4% higher no-think MMLU (75% vs 71%). JANG_4M matches MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size. JANG_2L at 43 GB beats MLX on no-think (75% vs 71%) at 32% less storage. MLX cannot create sub-4-bit versions — crashes on mtp.* weights.
MLX 3-bit cannot be created —
mlx_lm.convertcrashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations of this model.JANG_4M at same size as MLX 4-bit: 93.0% vs 93.5% — nearly identical quality with 8-bit attention protection.
Specs
| Metric | Value |
|---|---|
| Source | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
| Architecture | Hybrid Mamba-2 SSM + Latent MoE + Dense Attention |
| Layers | 88 (40 Mamba-2 + 40 MoE + 8 Attention) |
| Experts | 512 per MoE layer, top-22 active (12B active params) |
| Latent MoE | fc1_latent_proj (4096→1024), experts in latent space |
| Profile | JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2) |
| Average bits | 2.76 bpw |
| Disk size | 43 GB |
| GPU RAM | 42.4 GB |
| Speed | 51.6 tok/s generation, 136 tok/s prefill |
| Compute | bfloat16 (auto-detected) |
Requirements
- Apple Silicon Mac with 64+ GB unified memory
- MLX Studio (recommended) or
pip install "jang[mlx]>=2.1.5" - Requires patched mlx-lm with latent MoE support (included in JANG loader)
Quick Start
pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate
model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L")
# With reasoning (recommended for hard questions)
messages = [{"role": "user", "content": "Explain the difference between SSM and attention."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)
Technical Notes
- Latent MoE: Nemotron-H compresses hidden states from 4096→1024 before expert routing, then decompresses back. This is unique among MoE models and requires a patched mlx-lm. The JANG loader handles this automatically.
- bfloat16: 512-expert models overflow float16 (max 65,504). JANG auto-detects and uses bfloat16 (max 3.4×10^38). Zero quality impact.
- Mamba-2 SSM: 40 of 88 layers use state-space models instead of attention, enabling efficient long-context processing.
- trust_remote_code: Model requires custom Python files (modeling_nemotron_h.py, configuration_nemotron_h.py) which are included.
JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace
한국어
Nemotron-3-Super-120B를 64 GB Mac에서 실행할 수 있는 최초의 양자화입니다. 43 GB, 52 tok/s, 86% MMLU.
| JANG_2L | MLX 4-bit | |
|---|---|---|
| MMLU (추론 없음) | 75.0% | 71.0% |
| MMLU (추론 포함) | 86.0% | 93.5% |
| 크기 | 43 GB | 63 GB |
| 속도 | 51.6 tok/s | 59.8 tok/s |
pip install "jang[mlx]>=2.1.5"
- Downloads last month
- 1,151
Quantized
Model tree for JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L
Base model
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
