HumanEval+ (EvalPlus hidden tests, 164 Qs)

pass@1	pass@5
88.41% (145/164)	95.12% (156/164)

Sampled (temp=0.6, top_p=0.95) · max-tokens 5000/8000 · EvalPlus strict grading

Kimi-K2.6-Small-JANGTQ

This is now a ~586B-A32B MoE — 153 GB on disk (down from Kimi K2.6's ~610 GB / ~1T base) — aggressive 45% routed-expert prune + 2-bit JANGTQ quantization, runnable on a 256 GB Apple Silicon Mac.

Source: moonshotai/Kimi-K2.6 (Moonshot AI's MLA/MoE flagship, INT4 pack-quantized base)
Prune: 45% of routed experts removed via REAP saliency ranking, computed over our v3 calibration corpus (below)
Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated) on routed-expert weights + 8-bit affine on attention / dense MLP / embed / lm_head + fp16 on norms & router
Bundle size: 153 GB (down from ~610 GB source)
Runs on: Mac Studio M3 Ultra 256 GB (recommended), or MacBook Pro M4 Max 128 GB+ with SSD swap

Variants

The Kimi K2.6 × JANGTQ line, three variants:

Variant	Prune	Experts kept	Size	HF
Kimi-K2.6-Small (this card)	45%	211 of 384	153 GB	`JANGQ-AI/Kimi-K2.6-Small-JANGTQ`
Kimi-K2.6-Med	35%	250 of 384	~180 GB	`JANGQ-AI/Kimi-K2.6-Med-JANGTQ` (building)
Kimi-K2.6-Large	25%	288 of 384	~200 GB	`JANGQ-AI/Kimi-K2.6-Large-JANGTQ` (pending)

Smaller % prune = more experts kept = bigger model = closer to the original Kimi K2.6.

Pipeline

moonshotai/Kimi-K2.6  (INT4 pack-quantized, ~610 GB)
    │
    ├── v3 calibration corpus sampled across 8.6 M tokens
    │   (31,338 records — coding, agentic, general, academic_mc,
    │    science, Chinese, cybersec, long-context, systems)
    │
    ▼
REAP saliency observer
    S[L,e] = (1/count[L,e]) * Σ ||g · f||
    where g = gradient of top-8 routing to expert e, f = expert output
    │
    ├── per-layer ranking → keep top 211 / 384 experts (55%)
    │
    ▼
Kimi-K2.6-REAP-45  (pruned FP8 source, ~315 GB)
    │
    ▼
JANGTQ2 quantization
    • 2-bit MXTQ (Multi-Codebook Turbo Quantization, Hadamard-rotated)
      on routed-expert weights
    • 8-bit affine (per-channel scale/bias) on attention, dense MLP,
      embed, lm_head
    • 16-bit on norms and router gate weights
    │
    ▼
Kimi-K2.6-Small-JANGTQ  (this release, 153 GB)

Architecture

Kimi K2.6 inherits from Moonshot's DeepSeek-V3-family MLA stack:

Component	Detail
Attention	Multi-head Latent Attention (MLA), dim-512 latent contraction
Routing	Sigmoid + bias-corrected router, top-8 of 384 routed experts + shared expert
Layers	61 transformer blocks
Hidden size	7,168
Context	≥256k
Original dtype	INT4 pack-quantized weights (compressed-tensors format)
Post-prune routed experts	211 per layer (55% of 384)
Post-quant expert bits	2-bit MXTQ
Activations dtype	bfloat16 (required — fp16 overflows for big MoE)

The modeling code keeps model_type: kimi_k25 / architectures: KimiK25ForConditionalGeneration — unchanged from the Moonshot source, so the JANG loader + mlx-lm handle it via the DeepSeek V3 family adapter.

Calibration corpus (v3 mix)

REAP saliency was computed over a 31,338-sample stratified English+CJK mix (≈8.6 M tokens), bucketed to approximate typical JANGQ-AI workload. Each dataset was sampled to the listed share of total tokens, filtered for length and dedup'd via Jaccard near-dup detection.

Bucket	Share	Source datasets
Coding (22%)	7% · 6% · 4% · 3% · 2%	`ise-uiuc/Magicoder-OSS-Instruct-75K` · `nvidia/OpenCodeReasoning` · `m-a-p/CodeFeedback-Filtered-Instruction` · `HuggingFaceH4/CodeAlpaca_20K` · `iamtarun/python_code_instructions_18k_alpaca`
Agentic / tool-use (19%)	7% · 5% · 3% · 2% · 2%	`NousResearch/hermes-function-calling-v1` · `glaiveai/glaive-function-calling-v2` · `lilacai/glaive-function-calling-v2-sharegpt` · `THUDM/AgentInstruct` (os) · `princeton-nlp/SWE-bench_oracle`
General SFT (17%)	7% · 4% · 3% · 3%	`allenai/tulu-3-sft-mixture` · `open-thoughts/OpenThoughts-114k` · `teknium/OpenHermes-2.5` · `HuggingFaceH4/ultrachat_200k`
Academic / multiple-choice (11%)	5% · 3% · 1% · 1% · 1% · 0.5% · 0.5%	`cais/mmlu` (all, auxiliary_train split — not the test split) · `TIGER-Lab/MMLU-Pro` · `allenai/ai2_arc` · `allenai/openbookqa` · `allenai/sciq` · `tau/commonsense_qa` · `bigbio/med_qa`
Science / math (10%)	4% · 3% · 1.5% · 1.5%	`AI-MO/NuminaMath-CoT` · `ccdv/arxiv-summarization` · `qiaojin/PubMedQA` · `camel-ai/physics`
Chinese (9%)	4% · 2.5% · 2.5%	`silk-road/alpaca-data-gpt4-chinese` · `wangrui6/Zhihu-KOL` · `YeungNLP/firefly-train-1.1M`
Cybersec (5%)	3% · 2%	`CyberNative/Code_Vulnerability_Security_DPO` · `Trendyol/cybersecurity-instruction-datasets`
Long-context (3%)	2% · 1%	`emozilla/pg19` · `ccdv/arxiv-summarization` (long-doc subset)
Systems / SQL (3%)	1.5% · 1.5%	`b-mc2/sql-create-context` · `cognitivecomputations/dolphin-coder`

Training-data isolation: only cais/mmlu appears both in the calibration corpus (auxiliary_train split, for REAP saliency) and as an eval target (test split). The two are disjoint per dataset design. No HumanEval data (training or test) is in the calibration mix — eval is clean.

Evaluation

HumanEval+ (primary benchmark)

Dataset: evalplus/humanevalplus test split — same 164 prompts as the original OpenAI HumanEval but with much harder hidden test cases from the EvalPlus project. Solutions that pass the original tests but miss edge cases (off-by-one, empty input, overflow, negative numbers, etc.) are caught here.
Protocol: sampled pass@1 (seed=42) + pass@5 retry on fails (seeds 43–46, early-exit once a Q passes so we don't waste compute).
Sampling: temp=0.6, top_p=0.95 — Moonshot AI's recommended sampling for Kimi K2.6.
Max tokens: 5000 (pass@1) · 8000 (pass@5 retry, extra room for harder Qs that exhausted the 5k budget on pass@1).
Grading: each candidate is executed as a Python subprocess with a 20-second wall-clock timeout; passes only if ALL EvalPlus tests pass.
Extractor: final python ... fenced block containing def <entry_point> (handles reasoning preamble cleanly).

Results:

Metric	Score	Notes
pass@1	88.41% (145 / 164)	Sampled, temp=0.6, seed=42. 1 no-code-block fail (overthink), 18 real code fails.
pass@5	95.12% (156 / 164)	11 of 19 first-pass failures rescued by retries at seeds 43–46. 8 residual hard failures across all 5 seeds.
avg elapsed	~16 s / question	M3 Ultra, 8-bit attention + 2-bit routed experts
avg generated	~235 tokens / question	Kimi reasons concisely vs MiniMax

Usage

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jangtq_model("JANGQ-AI/Kimi-K2.6-Small-JANGTQ")

messages = [{"role": "user", "content": "Write a Python function that..."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

# Moonshot's recommended sampling:
out = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=4096,
    sampler=make_sampler(temp=0.6, top_p=0.95),
)

In MLX Studio or vMLX HTTP server

Drop the bundle under ~/.mlxstudio/models/JANGQ-AI/Kimi-K2.6-Small-JANGTQ/ and it will be auto-detected. The JANG loader's auto-tuner sets Metal wired_limit to ~70% of physical RAM and handles the MLA pre-warmup before first generation.

JANGTQ format details

JANGTQ2 is our low-bit MoE format combining:

MXTQ codebook quantization on routed-expert weights — 2-bit indices into a Hadamard-rotated codebook, with per-channel FP16 norms. Recovers most of the error vs plain RTN 2-bit.
8-bit affine per-channel on attention (Q/K/V/O), dense MLP, embedding, and lm_head — these see every forward pass and benefit from higher precision.
16-bit passthrough on RMSNorm weights and MoE router gates — tiny tensors, high sensitivity, no benefit from quantizing.

All of this plus the activation dtype (bfloat16 for 384-expert models to avoid fp16 overflow) is documented in the bundle's jang_config.json — the JANG loader reads this to reconstruct the appropriate TurboQuantLinear modules at load time.

Credits

Original model: Moonshot AI — Kimi K2.6.
Calibration + prune + quant: JANGQ-AI (eric@jangq.ai) using the JANG toolchain — REAP saliency observer + JANGTQ2 codebook format.
MLX kernels: Apple's MLX framework.

License

Modified MIT — inherited from Moonshot AI's Kimi K2.6 license. See LICENSE in the bundle. For commercial use, follow the original Kimi K2.6 license terms.