Osaurus AI

MiniMax M2.7 Small — 138B-A10B — JANGTQ (MLX)

This is now a ~138B-A10B MoE — 38 GB on disk (down from MiniMax M2's ~460 GB / 230B base) — 40% routed-expert prune + 2-bit JANGTQ quantization. Distilled from MiniMax M2 via REAP saliency + JANGTQ2 codebook quantization — routed experts at 2-bit via Lloyd-Max codebooks + Hadamard rotation, attention / embed / lm_head / dense MLP at 8-bit affine, norms and router at 16-bit.

Website  OsaurusAI  MiniMax M2


Model Details

Runs on Apple Silicon via the JANG toolchain + MLX.

MiniMax M2 (base)
    ↓  v3 calibration corpus  (code · agentic · general · academic · science · CN · cyber · systems · long-context)
    ↓
REAP saliency observer (62 layers × 256 experts → scoring)
    ↓  40% expert prune (154 of 256 kept per layer)
    ↓
JANGTQ2 quantization
    • 2-bit MXTQ on routed-expert weights (Hadamard-rotated Lloyd-Max codebook)
    • 8-bit affine on attention + dense MLP + embed + lm_head
    • 16-bit on norms and router weights
Value
Parameters ~138B total, ~10B active per token
Routed experts kept 154 of 256 (60%)
Top-k active experts 8 per token
Layers 62
Bundle size 38 GB
Dtype bfloat16 activations
Attention Standard Q/K/V + GQA 6:1, head_dim=128, rope_theta=5M
Context 196,608

Use

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jangtq_model("OsaurusAI/MiniMax-M2.7-Small-JANGTQ")

messages = [{"role": "user", "content": "Write a Python function that…"}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

# Interleaved-thinking / always-reasoning. Use MiniMax's
# official sampling:  temp=1.0, top_p=0.95, top_k=40
out = generate(model, tokenizer, prompt=prompt, max_tokens=4096,
               sampler=make_sampler(temp=1.0, top_p=0.95, top_k=40))

Evaluation

HumanEval+ (code generation)

  • Dataset: evalplus/humanevalplus test split (164 prompts, harder tests than HumanEval).
  • Protocol: sampled pass@1 baseline + pass@5 retry on failures.
  • Sampling for both pass@1 and pass@5 retry: temp=1.0, top_p=0.95, top_k=40 (MiniMax official); max_tokens=5000 on pass@1, 1200 on pass@5; k=5 samples per failed problem, early stop on first pass.
  • Grading: each candidate run with 20s subprocess timeout; must pass ALL EvalPlus tests.
  • Extractor: jang_tools.kimi_prune.bench_humaneval._extract_code (≥ 2026-04-24). The earlier extractor mis-paired markdown fences when the model emitted token-boundary glitches at the language tag (e.g. \``python一致:, ```pythonfr) and when the chat template prefilled ` at the prompt boundary, costing roughly nine points of pass@1.
Metric Score
pass@1 (sampled, temp=1.0) 81.10% (133/164)
pass@5 (sampled, retry of failures) 90.24% (148/164)

After the extractor fix, 30 of 46 originally-counted pass@1 failures resolve cleanly: 15 were correct answers eaten by fence-pairing, and another 15 recover under pass@5 sampling. The 16 residuals split into ~8 token-budget starvations (no_code_block), ~5 in-code 2-bit token-boundary glitches (return False言, Nonef, etc.), and ~3 genuine logic errors on EvalPlus hidden tests.

Variants

Variant Prune Size HF
MiniMax-M2.7-Small 40% 38 GB OsaurusAI/MiniMax-M2.7-Small-JANGTQ
MiniMax-M2.7-Med 25% ~48 GB OsaurusAI/MiniMax-M2.7-Med-JANGTQ (pending)
MiniMax-M2.7-Large 10% ~57 GB OsaurusAI/MiniMax-M2.7-Large-JANGTQ (pending)

Also released under JANGQ-AI/MiniMax-M2.7-*-JANGTQ.

Credits

Base model: MiniMax M2. Methodology: JANG toolchain — REAP saliency + JANGTQ codebook quantization. Served by: Osaurus — Apple-Silicon-native MLX inference.

License

Modified MIT — inherited from MiniMax M2.

Downloads last month
770
Safetensors
Model size
10B params
Tensor type
F32
·
U32
·
I32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/MiniMax-M2.7-Small-JANGTQ

Finetuned
(15)
this model