--- language: - en - zh library_name: mlx license: other license_name: modified-mit pipeline_tag: text-generation base_model: MiniMaxAI/MiniMax-M2 tags: - moe - mixture-of-experts - minimax_m2 - quantized - apple-silicon - mlx - turboquant - jangtq - jangtq2 - reap ---
This is now a ~138B-A10B MoE — 38 GB on disk (down from MiniMax M2's ~460 GB / 230B base) — 40% routed-expert prune + 2-bit JANGTQ quantization. Distilled from MiniMax M2 via REAP saliency + JANGTQ2 codebook quantization — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / lm_head / dense MLP at 8-bit affine, norms and router at 16-bit.
--- ## Model Details Runs on Apple Silicon via the JANG toolchain + MLX. ``` MiniMax M2 (base) ↓ v3 calibration corpus (code · agentic · general · academic · science · CN · cyber · systems · long-context) ↓ REAP saliency observer (62 layers × 256 experts → scoring) ↓ 40% expert prune (154 of 256 kept per layer) ↓ JANGTQ2 quantization • 2-bit MXTQ on routed-expert weights (Hadamard-rotated Lloyd-Max codebook) • 8-bit affine on attention + dense MLP + embed + lm_head • 16-bit on norms and router weights ``` | | Value | |---|---| | Parameters | **~138B total, ~10B active per token** | | Routed experts kept | 154 of 256 (60%) | | Top-k active experts | 8 per token | | Layers | 62 | | Bundle size | 38 GB | | Dtype | bfloat16 activations | | Attention | Standard Q/K/V + GQA 6:1, head_dim=128, rope_theta=5M | | Context | 196,608 | ## Use ```python from jang_tools.load_jangtq import load_jangtq_model from mlx_lm import generate from mlx_lm.sample_utils import make_sampler model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-Small-JANGTQ") messages = [{"role": "user", "content": "Write a Python function that…"}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=False ) # Interleaved-thinking / always-reasoning. Use MiniMax's # official sampling: temp=1.0, top_p=0.95, top_k=40 out = generate(model, tokenizer, prompt=prompt, max_tokens=4096, sampler=make_sampler(temp=1.0, top_p=0.95, top_k=40)) ``` ## Evaluation ### HumanEval+ (code generation) - **Dataset**: `evalplus/humanevalplus` test split (164 prompts, harder tests than HumanEval). - **Protocol**: sampled pass@1 baseline + pass@5 retry on failures. - **Sampling for both pass@1 and pass@5 retry**: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass. - **Grading**: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests. - **Extractor**: `jang_tools.kimi_prune.bench_humaneval._extract_code` (≥ 2026-04-24). The earlier extractor mis-paired markdown fences when the model emitted token-boundary glitches at the language tag (e.g. `\`\`\`python一致:`, `\`\`\`pythonfr`) and when the chat template prefilled `