---
language:
- en
- zh
library_name: mlx
license: mit
pipeline_tag: text-generation
base_model: MiniMaxAI/MiniMax-M2.7
base_model_relation: quantized
tags:
- mlx
- jang
- jangtq
- jangtq4
- minimax
- minimax_m2
- moe
- apple-silicon
- 4bit
- turboquant
---
MiniMax M2.7 — JANGTQ4 (MLX)
TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 4-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine. Near-bf16 quality at ~25% of bf16 disk.
---
## Model Details
| Property | Value |
|---|---|
| **Base Model** | MiniMaxAI/MiniMax-M2.7 |
| **Architecture** | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE |
| **Total Parameters** | 228.7 B |
| **Active per Token** | ~1.4 B |
| **Profile** | JANGTQ4 |
| **Format** | JANGTQ (codebook + Hadamard) — `weight_format: mxtq` in `jang_config.json` |
| **Avg bits/param** | ~4.10 |
| **Codebook size** | 16 entries (4-bit) |
| **Disk** | ~113 GB |
| **Context length** | 192 K tokens |
| **Chat template** | Always-reasoning (`` opened at assistant start) |
## What is JANGTQ4?
**JANGTQ** (JANG TurboQuant) is a codebook-based quantization format for MoE
models on Apple Silicon. Routed expert weights stay in a compact **codebook +
Hadamard-rotated** form at runtime — no decompression to affine — and the
matmul path uses custom Metal kernels that read packed `uint32` weights, look
up centroids in a small codebook, and accumulate dot products against a
Hadamard-rotated input (QuIP# *rotate-input-once* math).
**JANGTQ4** uses a 16-entry Lloyd-Max codebook per routed expert tensor, which
captures the weight distribution near-losslessly. Quality approaches bf16 at
~25% of bf16 disk and runs at the full JANGTQ decode speed. Pick this profile
when RAM permits and you want the closest quality to bf16 on Apple Silicon;
pick JANGTQ (2-bit) for the smallest footprint.
## JANGTQ vs JANGTQ4 vs bf16
| | JANGTQ (2-bit) | **JANGTQ4** | bf16 |
|---|---|---|---|
| Disk | ~57 GB | **~113 GB** | ~457 GB |
| Routed expert bits | 2 | **4** | 16 |
| Codebook size | 4 entries | **16 entries** | — |
| Avg bits/param | ~2.15 | **~4.10** | 16 |
## Bit Allocation
| Component | Bits | Format |
|---|:---:|---|
| Routed expert MLP (gate / up / down) | **4** | JANGTQ codebook + Hadamard |
| Attention (Q / K / V / O) | 8 | Affine (`nn.QuantizedLinear`, group_size=64) |
| Shared expert | 8 | Affine |
| Embed tokens / LM head | 8 | Affine |
| Router gate | fp16 | Unquantized `nn.Linear` |
| RMSNorms / RoPE / biases | fp16 | Unquantized |
The routed experts are 98 % of parameters and the natural compression target.
Everything else stays at 8-bit affine so the quality-critical hot path runs
at full precision.
## Important Settings
MiniMax M2.7 is an **always-reasoning** model. The chat template
unconditionally opens `` at each assistant turn.
| Setting | Value | Notes |
|---|---|---|
| Temperature | **1.0** | Required — `temp=0` can cause thinking loops |
| Top-P | 0.95 | |
| Top-K | 40 | |
| Repetition Penalty | 1.1 | Optional, helps prevent loops |
| `max_tokens` | ≥ 8192 | Give reasoning room to converge |
Strip `…` from the response before using the final answer.
## Usage
This model requires the `jang-tools` loader — stock `mlx_lm.load()` does not
recognize `weight_format: mxtq`. The loader applies Metal kernel
monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block
Hadamard, router compile, QKV fusion).
```bash
pip install jang-tools
```
```python
from huggingface_hub import snapshot_download
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ4")
model, tokenizer = load_jangtq_model(model_path)
messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
out = generate(model, tokenizer, prompt, max_tokens=600,
temperature=1.0, verbose=True)
```
### Swift — Osaurus / MLX Studio
Both clients auto-detect the JANGTQ runtime from `jang_config.json` and route
through the `MiniMaxJANGTQModel` class. Just load the repo — no extra flags.
## What's In This Repo
| File | Role |
|---|---|
| `model-*.safetensors` (117 shards, ~113 GB) | Weights — 4-bit routed TQ + 8-bit affine |
| `model.safetensors.index.json` | Shard index |
| `jangtq_runtime.safetensors` | Codebooks + Hadamard signs sidecar (Swift loader) |
| `jang_config.json` | JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`) |
| `config.json` | HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=4`) |
| `chat_template.jinja`, `tokenizer.*`, `vocab.json`, `merges.txt` | Tokenizer + chat template |
| `configuration_minimax_m2.py`, `modeling_minimax_m2.py` | HF custom code (untouched from upstream) |
| `osaurus-x-banner.png` | Branding asset |
## Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx)
```json
{
"reasoning_parser": "qwen3",
"tool_parser": "minimax",
"think_in_template": true,
"supports_tools": true,
"supports_thinking": true,
"family": "minimax_m2",
"modality": "text",
"cache_type": "kv"
}
```
`` and `` are non-special tokens by design — the
application layer parses them. Osaurus and `vmlx` `CapabilityDetector` read
this block verbatim and wire the `qwen3` reasoning parser + `minimax` tool
parser automatically, so streamed responses route `reasoning_content` and
`tool_calls` into the OpenAI-compatible SSE fields instead of leaking into
`content`.
## License
MIT — see [`LICENSE`](./LICENSE).
## Credits
Created by [Jinho Jang](https://twitter.com/jangq_ai) — `eric@jangq.ai`
Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI.