HumanEval+ (EvalPlus hidden tests, 164 Qs)
| pass@1 | pass@5 |
|---|---|
| 88.41% (145/164) | 95.12% (156/164) |
Sampled (temp=0.6, top_p=0.95) · max-tokens 5000/8000 · EvalPlus strict grading
Kimi-K2.6-Small-JANGTQ
This is now a ~586B-A32B MoE — 153 GB on disk (down from Kimi K2.6's ~610 GB / ~1T base) — aggressive 45% routed-expert prune + 2-bit JANGTQ quantization, runnable on a 256 GB Apple Silicon Mac.
- Source: moonshotai/Kimi-K2.6 (Moonshot AI's MLA/MoE flagship, INT4 pack-quantized base)
- Prune: 45% of routed experts removed via REAP saliency ranking, computed over our v3 calibration corpus (below)
- Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated) on routed-expert weights + 8-bit affine on attention / dense MLP / embed / lm_head + fp16 on norms & router
- Bundle size: 153 GB (down from ~610 GB source)
- Runs on: Mac Studio M3 Ultra 256 GB (recommended), or MacBook Pro M4 Max 128 GB+ with SSD swap
Variants
The Kimi K2.6 × JANGTQ line, three variants:
| Variant | Prune | Experts kept | Size | HF |
|---|---|---|---|---|
| Kimi-K2.6-Small (this card) | 45% | 211 of 384 | 153 GB | JANGQ-AI/Kimi-K2.6-Small-JANGTQ |
| Kimi-K2.6-Med | 35% | 250 of 384 | ~180 GB | JANGQ-AI/Kimi-K2.6-Med-JANGTQ (building) |
| Kimi-K2.6-Large | 25% | 288 of 384 | ~200 GB | JANGQ-AI/Kimi-K2.6-Large-JANGTQ (pending) |
Smaller % prune = more experts kept = bigger model = closer to the original Kimi K2.6.
Pipeline
moonshotai/Kimi-K2.6 (INT4 pack-quantized, ~610 GB)
│
├── v3 calibration corpus sampled across 8.6 M tokens
│ (31,338 records — coding, agentic, general, academic_mc,
│ science, Chinese, cybersec, long-context, systems)
│
▼
REAP saliency observer
S[L,e] = (1/count[L,e]) * Σ ||g · f||
where g = gradient of top-8 routing to expert e, f = expert output
│
├── per-layer ranking → keep top 211 / 384 experts (55%)
│
▼
Kimi-K2.6-REAP-45 (pruned FP8 source, ~315 GB)
│
▼
JANGTQ2 quantization
• 2-bit MXTQ (Multi-Codebook Turbo Quantization, Hadamard-rotated)
on routed-expert weights
• 8-bit affine (per-channel scale/bias) on attention, dense MLP,
embed, lm_head
• 16-bit on norms and router gate weights
│
▼
Kimi-K2.6-Small-JANGTQ (this release, 153 GB)
Architecture
Kimi K2.6 inherits from Moonshot's DeepSeek-V3-family MLA stack:
| Component | Detail |
|---|---|
| Attention | Multi-head Latent Attention (MLA), dim-512 latent contraction |
| Routing | Sigmoid + bias-corrected router, top-8 of 384 routed experts + shared expert |
| Layers | 61 transformer blocks |
| Hidden size | 7,168 |
| Context | ≥256k |
| Original dtype | INT4 pack-quantized weights (compressed-tensors format) |
| Post-prune routed experts | 211 per layer (55% of 384) |
| Post-quant expert bits | 2-bit MXTQ |
| Activations dtype | bfloat16 (required — fp16 overflows for big MoE) |
The modeling code keeps model_type: kimi_k25 / architectures: KimiK25ForConditionalGeneration — unchanged from the Moonshot source,
so the JANG loader + mlx-lm handle it via the DeepSeek V3 family
adapter.
Calibration corpus (v3 mix)
REAP saliency was computed over a 31,338-sample stratified English+CJK mix (≈8.6 M tokens), bucketed to approximate typical JANGQ-AI workload. Each dataset was sampled to the listed share of total tokens, filtered for length and dedup'd via Jaccard near-dup detection.
| Bucket | Share | Source datasets |
|---|---|---|
| Coding (22%) | 7% · 6% · 4% · 3% · 2% | ise-uiuc/Magicoder-OSS-Instruct-75K · nvidia/OpenCodeReasoning · m-a-p/CodeFeedback-Filtered-Instruction · HuggingFaceH4/CodeAlpaca_20K · iamtarun/python_code_instructions_18k_alpaca |
| Agentic / tool-use (19%) | 7% · 5% · 3% · 2% · 2% | NousResearch/hermes-function-calling-v1 · glaiveai/glaive-function-calling-v2 · lilacai/glaive-function-calling-v2-sharegpt · THUDM/AgentInstruct (os) · princeton-nlp/SWE-bench_oracle |
| General SFT (17%) | 7% · 4% · 3% · 3% | allenai/tulu-3-sft-mixture · open-thoughts/OpenThoughts-114k · teknium/OpenHermes-2.5 · HuggingFaceH4/ultrachat_200k |
| Academic / multiple-choice (11%) | 5% · 3% · 1% · 1% · 1% · 0.5% · 0.5% | cais/mmlu (all, auxiliary_train split — not the test split) · TIGER-Lab/MMLU-Pro · allenai/ai2_arc · allenai/openbookqa · allenai/sciq · tau/commonsense_qa · bigbio/med_qa |
| Science / math (10%) | 4% · 3% · 1.5% · 1.5% | AI-MO/NuminaMath-CoT · ccdv/arxiv-summarization · qiaojin/PubMedQA · camel-ai/physics |
| Chinese (9%) | 4% · 2.5% · 2.5% | silk-road/alpaca-data-gpt4-chinese · wangrui6/Zhihu-KOL · YeungNLP/firefly-train-1.1M |
| Cybersec (5%) | 3% · 2% | CyberNative/Code_Vulnerability_Security_DPO · Trendyol/cybersecurity-instruction-datasets |
| Long-context (3%) | 2% · 1% | emozilla/pg19 · ccdv/arxiv-summarization (long-doc subset) |
| Systems / SQL (3%) | 1.5% · 1.5% | b-mc2/sql-create-context · cognitivecomputations/dolphin-coder |
Training-data isolation: only cais/mmlu appears both in the
calibration corpus (auxiliary_train split, for REAP saliency) and as an
eval target (test split). The two are disjoint per dataset design. No
HumanEval data (training or test) is in the calibration mix — eval is
clean.
Evaluation
HumanEval+ (primary benchmark)
- Dataset:
evalplus/humanevalplustestsplit — same 164 prompts as the original OpenAI HumanEval but with much harder hidden test cases from the EvalPlus project. Solutions that pass the original tests but miss edge cases (off-by-one, empty input, overflow, negative numbers, etc.) are caught here. - Protocol: sampled pass@1 (seed=42) + pass@5 retry on fails (seeds 43–46, early-exit once a Q passes so we don't waste compute).
- Sampling: temp=0.6, top_p=0.95 — Moonshot AI's recommended sampling for Kimi K2.6.
- Max tokens: 5000 (pass@1) · 8000 (pass@5 retry, extra room for harder Qs that exhausted the 5k budget on pass@1).
- Grading: each candidate is executed as a Python subprocess with a 20-second wall-clock timeout; passes only if ALL EvalPlus tests pass.
- Extractor: final
python ...fenced block containingdef <entry_point>(handles reasoning preamble cleanly).
Results:
| Metric | Score | Notes |
|---|---|---|
| pass@1 | 88.41% (145 / 164) | Sampled, temp=0.6, seed=42. 1 no-code-block fail (overthink), 18 real code fails. |
| pass@5 | 95.12% (156 / 164) | 11 of 19 first-pass failures rescued by retries at seeds 43–46. 8 residual hard failures across all 5 seeds. |
| avg elapsed | ~16 s / question | M3 Ultra, 8-bit attention + 2-bit routed experts |
| avg generated | ~235 tokens / question | Kimi reasons concisely vs MiniMax |
Usage
from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load_jangtq_model("JANGQ-AI/Kimi-K2.6-Small-JANGTQ")
messages = [{"role": "user", "content": "Write a Python function that..."}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
# Moonshot's recommended sampling:
out = generate(
model, tokenizer, prompt=prompt,
max_tokens=4096,
sampler=make_sampler(temp=0.6, top_p=0.95),
)
In MLX Studio or vMLX HTTP server
Drop the bundle under ~/.mlxstudio/models/JANGQ-AI/Kimi-K2.6-Small-JANGTQ/
and it will be auto-detected. The JANG loader's auto-tuner sets Metal
wired_limit to ~70% of physical RAM and handles the MLA pre-warmup
before first generation.
JANGTQ format details
JANGTQ2 is our low-bit MoE format combining:
- MXTQ codebook quantization on routed-expert weights — 2-bit indices into a Hadamard-rotated codebook, with per-channel FP16 norms. Recovers most of the error vs plain RTN 2-bit.
- 8-bit affine per-channel on attention (Q/K/V/O), dense MLP, embedding, and lm_head — these see every forward pass and benefit from higher precision.
- 16-bit passthrough on RMSNorm weights and MoE router gates — tiny tensors, high sensitivity, no benefit from quantizing.
All of this plus the activation dtype (bfloat16 for 384-expert models
to avoid fp16 overflow) is documented in the bundle's jang_config.json
— the JANG loader reads this to reconstruct the appropriate
TurboQuantLinear modules at load time.
Credits
- Original model: Moonshot AI — Kimi K2.6.
- Calibration + prune + quant: JANGQ-AI (
eric@jangq.ai) using the JANG toolchain — REAP saliency observer + JANGTQ2 codebook format. - MLX kernels: Apple's MLX framework.
License
Modified MIT — inherited from Moonshot AI's Kimi K2.6 license. See
LICENSE in the bundle. For commercial use, follow the original
Kimi K2.6 license terms.
- Downloads last month
- 863
Quantized
Model tree for JANGQ-AI/Kimi-K2.6-Small-JANGTQ
Base model
moonshotai/Kimi-K2.6