MLX Studio

MLX Studio App

MLX Studio — the only app that natively supports JANG models with reasoning


Nemotron-3-Super-120B on a 64 GB Mac. The first working Nemotron-H quantization for Apple Silicon. 43 GB, 52 tok/s, 86% MMLU with reasoning. Hybrid Mamba-2 SSM + Latent MoE + Attention architecture.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or pip install "jang[mlx]>=2.1.5".


JANG

Nemotron-3-Super-120B-A12B — JANG_2L (2.8-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX


GitHub  PyPI  Website  X/Twitter

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

  • 86.0% MMLU (200 questions, reasoning mode)
  • 51.6 tok/s generation, 136 tok/s prefill
  • 43 GB on disk, 42.4 GB GPU RAM
  • Reasoning mode: <think>...</think> step-by-step problem solving
  • Hybrid architecture: 40 Mamba-2 SSM + 40 Latent MoE (512 experts) + 8 Dense Attention layers
  • bfloat16 compute: auto-detected for 512-expert models

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. Both tested with and without reasoning using identical methodology.

Subject JANG_2L No-Think JANG_2L Reasoning JANG_4M No-Think JANG_4M Reasoning MLX 4-bit No-Think MLX 4-bit Reasoning
Abstract Algebra 12/20 16/20 10/20 19/20 9/20 19/20
Anatomy 15/20 17/20 15/20 18/20 14/20 18/20
Astronomy 19/20 19/20 19/20 19/20 19/20 19/20
College CS 13/20 15/20 13/20 17/20 14/20 17/20
College Physics 14/20 18/20 14/20 19/20 13/20 20/20
HS Biology 19/20 18/20 19/20 20/20 18/20 20/20
HS Chemistry 15/20 16/20 15/20 18/20 16/20 19/20
HS Mathematics 8/20 18/20 6/20 18/20 6/20 18/20
Logical Fallacies 17/20 18/20 17/20 19/20 17/20 18/20
World Religions 18/20 17/20 17/20 19/20 16/20 19/20
Total 150/200 (75.0%) 172/200 (86.0%) 145/200 (72.5%) 186/200 (93.0%) 142/200 (71.0%) 187/200 (93.5%)

Summary

JANG_2L JANG_4M MLX 3-bit MLX 4-bit
MMLU (no-think) 75.0% 72.5% Crashes 71.0%
MMLU (reasoning) 86.0% 93.0% Crashes 93.5%
Size 43 GB 63 GB N/A 63 GB
GPU RAM 42.4 GB 61.2 GB N/A 63.3 GB
Speed 51.6 tok/s 55.1 tok/s N/A 59.8 tok/s
Fits 64 GB? YES YES N/A YES

JANG_2L is 20 GB smaller than MLX 4-bit with +4% higher no-think MMLU (75% vs 71%). JANG_4M matches MLX 4-bit (93.0% vs 93.5%) at the same 63 GB size. JANG_2L at 43 GB beats MLX on no-think (75% vs 71%) at 32% less storage. MLX cannot create sub-4-bit versions — crashes on mtp.* weights.

MLX 3-bit cannot be createdmlx_lm.convert crashes on Nemotron's mtp.* weights. Only JANG can produce sub-4-bit quantizations of this model.

JANG_4M at same size as MLX 4-bit: 93.0% vs 93.5% — nearly identical quality with 8-bit attention protection.

Specs

Metric Value
Source NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Architecture Hybrid Mamba-2 SSM + Latent MoE + Dense Attention
Layers 88 (40 Mamba-2 + 40 MoE + 8 Attention)
Experts 512 per MoE layer, top-22 active (12B active params)
Latent MoE fc1_latent_proj (4096→1024), experts in latent space
Profile JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2)
Average bits 2.76 bpw
Disk size 43 GB
GPU RAM 42.4 GB
Speed 51.6 tok/s generation, 136 tok/s prefill
Compute bfloat16 (auto-detected)

Requirements

  • Apple Silicon Mac with 64+ GB unified memory
  • MLX Studio (recommended) or pip install "jang[mlx]>=2.1.5"
  • Requires patched mlx-lm with latent MoE support (included in JANG loader)

Quick Start

pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L")

# With reasoning (recommended for hard questions)
messages = [{"role": "user", "content": "Explain the difference between SSM and attention."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Technical Notes

  • Latent MoE: Nemotron-H compresses hidden states from 4096→1024 before expert routing, then decompresses back. This is unique among MoE models and requires a patched mlx-lm. The JANG loader handles this automatically.
  • bfloat16: 512-expert models overflow float16 (max 65,504). JANG auto-detects and uses bfloat16 (max 3.4×10^38). Zero quality impact.
  • Mamba-2 SSM: 40 of 88 layers use state-space models instead of attention, enabling efficient long-context processing.
  • trust_remote_code: Model requires custom Python files (modeling_nemotron_h.py, configuration_nemotron_h.py) which are included.

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

한국어

Nemotron-3-Super-120B를 64 GB Mac에서 실행할 수 있는 최초의 양자화입니다. 43 GB, 52 tok/s, 86% MMLU.

JANG_2L MLX 4-bit
MMLU (추론 없음) 75.0% 71.0%
MMLU (추론 포함) 86.0% 93.5%
크기 43 GB 63 GB
속도 51.6 tok/s 59.8 tok/s
pip install "jang[mlx]>=2.1.5"
Downloads last month
1,151
Safetensors
Model size
13B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Nemotron-3-Super-120B-A12B-JANG_2L

Finetuned
(2)
this model