| --- |
| language: |
| - en |
| - zh |
| library_name: mlx |
| license: mit |
| pipeline_tag: text-generation |
| base_model: MiniMaxAI/MiniMax-M2.7 |
| base_model_relation: quantized |
| tags: |
| - mlx |
| - jang |
| - jangtq |
| - minimax |
| - minimax_m2 |
| - moe |
| - apple-silicon |
| - 2bit |
| - turboquant |
| --- |
| |
| <p align="center"> |
| <a href="https://osaurus.ai"><img src="./osaurus-x-banner.png" alt="Osaurus AI"></a> |
| </p> |
|
|
| <h3 align="center">MiniMax M2.7 — JANGTQ (MLX)</h3> |
| <p align="center">TurboQuant codebook quantization of MiniMax's 228B agentic MoE — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / shared-expert / lm_head at 8-bit affine.</p> |
| |
| <p align="center"> |
| <a href="https://osaurus.ai"><img src="https://img.shields.io/badge/Web-osaurus.ai-blue" alt="Website"></a> |
| <a href="https://huggingface.co/OsaurusAI"><img src="https://img.shields.io/badge/HF-OsaurusAI-yellow?logo=huggingface" alt="OsaurusAI"></a> |
| </p> |
| |
| --- |
| |
| ## Model Details |
| |
| | Property | Value | |
| |---|---| |
| | **Base Model** | MiniMaxAI/MiniMax-M2.7 | |
| | **Architecture** | MoE (256 experts, top-8 active) + standard Q/K/V attention + partial RoPE | |
| | **Total Parameters** | 228.7 B | |
| | **Active per Token** | ~1.4 B | |
| | **Profile** | JANGTQ | |
| | **Format** | JANGTQ (codebook + Hadamard) — `weight_format: mxtq` in `jang_config.json` | |
| | **Avg bits/param** | ~2.15 | |
| | **Disk** | ~57 GB | |
| | **Context length** | 192 K tokens | |
| | **Chat template** | Always-reasoning (`<think>` opened at assistant start) | |
| |
| ## What is JANGTQ? |
| |
| **JANGTQ** (JANG TurboQuant) is a codebook-based quantization format for MoE |
| models on Apple Silicon. Routed expert weights stay in a compact **codebook + |
| Hadamard-rotated** form at runtime — no decompression to affine — and the |
| matmul path uses custom Metal kernels that read packed `uint32` weights, look |
| up centroids in a small codebook, and accumulate dot products against a |
| Hadamard-rotated input (QuIP# *rotate-input-once* math). |
| |
| **Result vs uniform 2-bit affine:** smaller on disk, higher quality, runs at |
| ~89 % of affine 2-bit speed. |
| |
| ## Bit Allocation |
| |
| | Component | Bits | Format | |
| |---|:---:|---| |
| | Routed expert MLP (gate / up / down) | **2** | JANGTQ codebook + Hadamard | |
| | Attention (Q / K / V / O) | 8 | Affine (`nn.QuantizedLinear`, group_size=64) | |
| | Shared expert | 8 | Affine | |
| | Embed tokens / LM head | 8 | Affine | |
| | Router gate | fp16 | Unquantized `nn.Linear` | |
| | RMSNorms / RoPE / biases | fp16 | Unquantized | |
|
|
| The routed experts are 98 % of parameters and the natural compression target. |
| Everything else stays at 8-bit affine so the quality-critical hot path runs |
| at full precision. |
|
|
| ## Important Settings |
|
|
| MiniMax M2.7 is an **always-reasoning** model. The chat template |
| unconditionally opens `<think>` at each assistant turn. |
|
|
| | Setting | Value | Notes | |
| |---|---|---| |
| | Temperature | **1.0** | Required — `temp=0` can cause thinking loops | |
| | Top-P | 0.95 | | |
| | Top-K | 40 | | |
| | Repetition Penalty | 1.1 | Optional, helps prevent loops | |
| | `max_tokens` | ≥ 8192 | Give reasoning room to converge | |
|
|
| Strip `<think>…</think>` from the response before using the final answer. |
|
|
| ## Usage |
|
|
| This model requires the `jang-tools` loader — stock `mlx_lm.load()` does not |
| recognize `weight_format: mxtq`. The loader applies Metal kernel |
| monkey-patches at load time (fused gate+up+SwiGLU, gather TQ, multi-block |
| Hadamard, router compile, QKV fusion). |
|
|
| ```bash |
| pip install jang-tools |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from jang_tools.load_jangtq import load_jangtq_model |
| from mlx_lm import generate |
| |
| model_path = snapshot_download("OsaurusAI/MiniMax-M2.7-JANGTQ") |
| model, tokenizer = load_jangtq_model(model_path) |
| |
| messages = [{"role": "user", "content": "Explain photosynthesis in five sentences."}] |
| prompt = tokenizer.apply_chat_template( |
| messages, add_generation_prompt=True, tokenize=False |
| ) |
| out = generate(model, tokenizer, prompt, max_tokens=600, |
| temperature=1.0, verbose=True) |
| ``` |
|
|
| ### Swift — Osaurus / MLX Studio |
|
|
| Both clients auto-detect the JANGTQ runtime from `jang_config.json` and route |
| through the `MiniMaxJANGTQModel` class. Just load the repo — no extra flags. |
|
|
| ## What's In This Repo |
|
|
| | File | Role | |
| |---|---| |
| | `model-*.safetensors` (61 shards, ~57 GB) | Weights — 2-bit routed TQ + 8-bit affine | |
| | `model.safetensors.index.json` | Shard index | |
| | `jangtq_runtime.safetensors` | Codebooks + Hadamard signs sidecar (Swift loader) | |
| | `jang_config.json` | JANG metadata + Tier-1 `capabilities` stamp (`reasoning=qwen3`, `tool=minimax`) | |
| | `config.json` | HF model config (`minimax_m2`, `weight_format=mxtq`, `mxtq_bits=2`) | |
| | `chat_template.jinja`, `tokenizer.*`, `vocab.json`, `merges.txt` | Tokenizer + chat template | |
| | `configuration_minimax_m2.py`, `modeling_minimax_m2.py` | HF custom code (untouched from upstream) | |
| | `osaurus-x-banner.png`, `mlx-studio-logo.png` | Branding assets | |
|
|
| ## Parser Capabilities (Tier-1 auto-detected by Osaurus / vmlx) |
|
|
| ```json |
| { |
| "reasoning_parser": "qwen3", |
| "tool_parser": "minimax", |
| "think_in_template": true, |
| "supports_tools": true, |
| "supports_thinking": true, |
| "family": "minimax_m2", |
| "modality": "text", |
| "cache_type": "kv" |
| } |
| ``` |
|
|
| `<think>` and `<tool_call>` are non-special tokens by design — the |
| application layer parses them. Osaurus and `vmlx` `CapabilityDetector` read |
| this block verbatim and wire the `qwen3` reasoning parser + `minimax` tool |
| parser automatically, so streamed responses route `reasoning_content` and |
| `tool_calls` into the OpenAI-compatible SSE fields instead of leaking into |
| `content`. |
|
|
| ## License |
|
|
| MIT — see [`LICENSE`](./LICENSE). |
|
|
| ## Credits |
|
|
| Created by [Jinho Jang](https://twitter.com/jangq_ai) — `eric@jangq.ai` |
|
|
| Based on MiniMaxAI's MiniMax M2.7. JANGTQ quantization © JANGQ-AI. |
|
|