JANGQ-AI

HumanEval+ (EvalPlus hidden tests, 164 Qs)

pass@1 pass@5
88.41% (145/164) 95.12% (156/164)

Sampled (temp=0.6, top_p=0.95) · max-tokens 5000/8000 · EvalPlus strict grading

Kimi-K2.6-Small-JANGTQ

This is now a ~586B-A32B MoE — 153 GB on disk (down from Kimi K2.6's ~610 GB / ~1T base) — aggressive 45% routed-expert prune + 2-bit JANGTQ quantization, runnable on a 256 GB Apple Silicon Mac.

  • Source: moonshotai/Kimi-K2.6 (Moonshot AI's MLA/MoE flagship, INT4 pack-quantized base)
  • Prune: 45% of routed experts removed via REAP saliency ranking, computed over our v3 calibration corpus (below)
  • Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated) on routed-expert weights + 8-bit affine on attention / dense MLP / embed / lm_head + fp16 on norms & router
  • Bundle size: 153 GB (down from ~610 GB source)
  • Runs on: Mac Studio M3 Ultra 256 GB (recommended), or MacBook Pro M4 Max 128 GB+ with SSD swap

Variants

The Kimi K2.6 × JANGTQ line, three variants:

Variant Prune Experts kept Size HF
Kimi-K2.6-Small (this card) 45% 211 of 384 153 GB JANGQ-AI/Kimi-K2.6-Small-JANGTQ
Kimi-K2.6-Med 35% 250 of 384 ~180 GB JANGQ-AI/Kimi-K2.6-Med-JANGTQ (building)
Kimi-K2.6-Large 25% 288 of 384 ~200 GB JANGQ-AI/Kimi-K2.6-Large-JANGTQ (pending)

Smaller % prune = more experts kept = bigger model = closer to the original Kimi K2.6.

Pipeline

moonshotai/Kimi-K2.6  (INT4 pack-quantized, ~610 GB)
    │
    ├── v3 calibration corpus sampled across 8.6 M tokens
    │   (31,338 records — coding, agentic, general, academic_mc,
    │    science, Chinese, cybersec, long-context, systems)
    │
    ▼
REAP saliency observer
    S[L,e] = (1/count[L,e]) * Σ ||g · f||
    where g = gradient of top-8 routing to expert e, f = expert output
    │
    ├── per-layer ranking → keep top 211 / 384 experts (55%)
    │
    ▼
Kimi-K2.6-REAP-45  (pruned FP8 source, ~315 GB)
    │
    ▼
JANGTQ2 quantization
    • 2-bit MXTQ (Multi-Codebook Turbo Quantization, Hadamard-rotated)
      on routed-expert weights
    • 8-bit affine (per-channel scale/bias) on attention, dense MLP,
      embed, lm_head
    • 16-bit on norms and router gate weights
    │
    ▼
Kimi-K2.6-Small-JANGTQ  (this release, 153 GB)

Architecture

Kimi K2.6 inherits from Moonshot's DeepSeek-V3-family MLA stack:

Component Detail
Attention Multi-head Latent Attention (MLA), dim-512 latent contraction
Routing Sigmoid + bias-corrected router, top-8 of 384 routed experts + shared expert
Layers 61 transformer blocks
Hidden size 7,168
Context ≥256k
Original dtype INT4 pack-quantized weights (compressed-tensors format)
Post-prune routed experts 211 per layer (55% of 384)
Post-quant expert bits 2-bit MXTQ
Activations dtype bfloat16 (required — fp16 overflows for big MoE)

The modeling code keeps model_type: kimi_k25 / architectures: KimiK25ForConditionalGeneration — unchanged from the Moonshot source, so the JANG loader + mlx-lm handle it via the DeepSeek V3 family adapter.

Calibration corpus (v3 mix)

REAP saliency was computed over a 31,338-sample stratified English+CJK mix (≈8.6 M tokens), bucketed to approximate typical JANGQ-AI workload. Each dataset was sampled to the listed share of total tokens, filtered for length and dedup'd via Jaccard near-dup detection.

Bucket Share Source datasets
Coding (22%) 7% · 6% · 4% · 3% · 2% ise-uiuc/Magicoder-OSS-Instruct-75K · nvidia/OpenCodeReasoning · m-a-p/CodeFeedback-Filtered-Instruction · HuggingFaceH4/CodeAlpaca_20K · iamtarun/python_code_instructions_18k_alpaca
Agentic / tool-use (19%) 7% · 5% · 3% · 2% · 2% NousResearch/hermes-function-calling-v1 · glaiveai/glaive-function-calling-v2 · lilacai/glaive-function-calling-v2-sharegpt · THUDM/AgentInstruct (os) · princeton-nlp/SWE-bench_oracle
General SFT (17%) 7% · 4% · 3% · 3% allenai/tulu-3-sft-mixture · open-thoughts/OpenThoughts-114k · teknium/OpenHermes-2.5 · HuggingFaceH4/ultrachat_200k
Academic / multiple-choice (11%) 5% · 3% · 1% · 1% · 1% · 0.5% · 0.5% cais/mmlu (all, auxiliary_train split — not the test split) · TIGER-Lab/MMLU-Pro · allenai/ai2_arc · allenai/openbookqa · allenai/sciq · tau/commonsense_qa · bigbio/med_qa
Science / math (10%) 4% · 3% · 1.5% · 1.5% AI-MO/NuminaMath-CoT · ccdv/arxiv-summarization · qiaojin/PubMedQA · camel-ai/physics
Chinese (9%) 4% · 2.5% · 2.5% silk-road/alpaca-data-gpt4-chinese · wangrui6/Zhihu-KOL · YeungNLP/firefly-train-1.1M
Cybersec (5%) 3% · 2% CyberNative/Code_Vulnerability_Security_DPO · Trendyol/cybersecurity-instruction-datasets
Long-context (3%) 2% · 1% emozilla/pg19 · ccdv/arxiv-summarization (long-doc subset)
Systems / SQL (3%) 1.5% · 1.5% b-mc2/sql-create-context · cognitivecomputations/dolphin-coder

Training-data isolation: only cais/mmlu appears both in the calibration corpus (auxiliary_train split, for REAP saliency) and as an eval target (test split). The two are disjoint per dataset design. No HumanEval data (training or test) is in the calibration mix — eval is clean.

Evaluation

HumanEval+ (primary benchmark)

  • Dataset: evalplus/humanevalplus test split — same 164 prompts as the original OpenAI HumanEval but with much harder hidden test cases from the EvalPlus project. Solutions that pass the original tests but miss edge cases (off-by-one, empty input, overflow, negative numbers, etc.) are caught here.
  • Protocol: sampled pass@1 (seed=42) + pass@5 retry on fails (seeds 43–46, early-exit once a Q passes so we don't waste compute).
  • Sampling: temp=0.6, top_p=0.95 — Moonshot AI's recommended sampling for Kimi K2.6.
  • Max tokens: 5000 (pass@1) · 8000 (pass@5 retry, extra room for harder Qs that exhausted the 5k budget on pass@1).
  • Grading: each candidate is executed as a Python subprocess with a 20-second wall-clock timeout; passes only if ALL EvalPlus tests pass.
  • Extractor: final python ... fenced block containing def <entry_point> (handles reasoning preamble cleanly).

Results:

Metric Score Notes
pass@1 88.41% (145 / 164) Sampled, temp=0.6, seed=42. 1 no-code-block fail (overthink), 18 real code fails.
pass@5 95.12% (156 / 164) 11 of 19 first-pass failures rescued by retries at seeds 43–46. 8 residual hard failures across all 5 seeds.
avg elapsed ~16 s / question M3 Ultra, 8-bit attention + 2-bit routed experts
avg generated ~235 tokens / question Kimi reasons concisely vs MiniMax

Usage

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load_jangtq_model("JANGQ-AI/Kimi-K2.6-Small-JANGTQ")

messages = [{"role": "user", "content": "Write a Python function that..."}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

# Moonshot's recommended sampling:
out = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=4096,
    sampler=make_sampler(temp=0.6, top_p=0.95),
)

In MLX Studio or vMLX HTTP server

Drop the bundle under ~/.mlxstudio/models/JANGQ-AI/Kimi-K2.6-Small-JANGTQ/ and it will be auto-detected. The JANG loader's auto-tuner sets Metal wired_limit to ~70% of physical RAM and handles the MLA pre-warmup before first generation.

JANGTQ format details

JANGTQ2 is our low-bit MoE format combining:

  1. MXTQ codebook quantization on routed-expert weights — 2-bit indices into a Hadamard-rotated codebook, with per-channel FP16 norms. Recovers most of the error vs plain RTN 2-bit.
  2. 8-bit affine per-channel on attention (Q/K/V/O), dense MLP, embedding, and lm_head — these see every forward pass and benefit from higher precision.
  3. 16-bit passthrough on RMSNorm weights and MoE router gates — tiny tensors, high sensitivity, no benefit from quantizing.

All of this plus the activation dtype (bfloat16 for 384-expert models to avoid fp16 overflow) is documented in the bundle's jang_config.json — the JANG loader reads this to reconstruct the appropriate TurboQuantLinear modules at load time.

Credits

  • Original model: Moonshot AI — Kimi K2.6.
  • Calibration + prune + quant: JANGQ-AI (eric@jangq.ai) using the JANG toolchain — REAP saliency observer + JANGTQ2 codebook format.
  • MLX kernels: Apple's MLX framework.

License

Modified MIT — inherited from Moonshot AI's Kimi K2.6 license. See LICENSE in the bundle. For commercial use, follow the original Kimi K2.6 license terms.

Downloads last month
863
Safetensors
Model size
39B params
Tensor type
U32
·
F16
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Kimi-K2.6-Small-JANGTQ

Finetuned
(7)
this model

Collections including JANGQ-AI/Kimi-K2.6-Small-JANGTQ