⚠️ Important — config metadata fix (2026-04-24)
Earlier versions of this bundle shipped with a wrong
config.json["quantization"]— top-levelbits=2only, no per-module overrides, nomodefield. MLX's loader applied the 2-bit dequant kernel to weights stored as 8-bit affine (attention / embed / head / shared experts / compressor / indexer), producing silent garbage on long prompts. Algorithms looked plausible on tiny prompts ("count to five") but coding/reasoning tasks scored far below the model's true capability.The fix is metadata-only. Weights on disk are unchanged.
config.jsonnow ships withbits=8 group_size=32 mode=affineat top level + per-module overrides matching the actual on-disk bit-widths (matches reference quants likeThump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affineandmlx-community/DeepSeek-V4-Flash-4bit).If you downloaded this bundle before 2026-04-24, redownload
config.json— nothing else changed. No runtime, decode, or weight changes are required in any consumer (Pythonload_jangtq_model, mlx-lm, Swift vMLX, etc).Verification:
pass@1on HumanEval first 12 problems jumped 42% → 67% by patching only the config. Math identity verified across all 522 (JANGTQ) / 129+ (JANGTQ4) / 34,314 (JANG_2L) quantized layers — every layer satisfies32 × weight.packed_dim = bits × n_groups × group_size.Background:
research/JANGTQ-CONFIG-METADATA-BUG-2026-04-24.mdin the JANGQ-AI/jang source repo (private).
DeepSeek-V4-Flash JANGTQ
DeepSeek V4 Flash 284B/13B-active MoE — 2-bit codebook + Hadamard, 79.5 GB
The smallest, highest-quality DSV4-Flash on Apple Silicon.
⚠️ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles the DSV4 chat template (thinking + tool calling), and uses the custom Metal kernels this model needs. Stock
mlx_lm.load()will NOT load this model — see usage instructions below.
Follow development on Twitter: @jangq_ai
What is JANGTQ?
JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG
quantization format. Routed expert weights stay in a compact codebook +
Hadamard-rotated form at runtime — no decompression to affine — and the matmul
path uses custom Metal kernels that read packed uint32 weights, look up
centroids in a 4-entry codebook (Lloyd-Max optimized), and accumulate dot
products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).
For DeepSeek-V4-Flash this is especially impactful: 256 routed experts × 43 layers × 3 projections per expert (gate, up, down) means MoE weights dominate the bundle. Codebook quant cuts the routed-expert footprint ~12% beyond uniform 2-bit affine while preserving signal-to-noise within ~0.01 logit RMS of fp16.
| JANG_2L (affine) | JANGTQ | Δ | |
|---|---|---|---|
| Disk size | 96.6 GB | 79.5 GB | −18% |
| GPU memory | ~96 GB | ~79 GB | −18% |
| Avg bits/param | 2.7 | ~2.25 | −0.45 |
| Decode speed (M3 Ultra) | ~22 tok/s | 20.5 tok/s | ~93% of affine |
| Routed-expert format | affine 2-bit gsz=32 | codebook 2-bit gsz=32 | smaller + tighter |
The codebook is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights. Same bit budget, more faithful reconstruction.
DSV4-Flash Architecture (what's special)
DeepSeek-V4-Flash introduces several departures from V3.x — all handled by the JANG runtime:
- MQA, not classic MLA:
n_kv_heads=1,head_dim=512. Single shared K/V projection across all 64 heads. KV cache stores 1024 floats per token — already minimal, no compression possible. - mHC residual stream (
hc_mult=4): hidden state between layers is(B, L, 4, 4096). Each block runs a Sinkhorn-balanced collapse → attn/ffn → expand. Substantial compute on top of the standard transformer block. - Hybrid Compressor + Indexer attention: 41 of 43 layers run a
pooled-KV compressor (compress_ratio 4 or 128 alternating). The
compress_ratio=4layers also run an Indexer to top-k pooled positions for global context. - Two RoPE configs per model: layers with
compress_ratio=0use baserope_theta=10000(no YaRN); others usecompress_rope_theta=160000with YaRN scaling. Mixing these wrong blows the residual stream past bf16 inf by layer 40. - Inverse-rope on attention output: required for coherent decode with our convention.
- MTP head (
num_nextn_predict_layers=1): held back from regular forward; intended for self-spec decode. - 1M max position via YaRN scaling. Compressor + Indexer make this affordable at decode.
Speed Benchmarks (Mac Studio M3 Ultra)
| Test | Result |
|---|---|
| Load time | 8.8 sec |
| Prefill (cold compile) | ~23 sec |
| Decode steady-state | 20.54 tok/s |
| First-token sample | "One, two, three, four, five. Five, four, three, two, one." |
Bench: python3 bench_dsv4.py <bundle> --tokens 25 --warmup 3. Greedy
decode, no chat template (raw prompt).
This is 93% of the public mlx-lm PR #1192 ceiling (21.86 tok/s) and within
striking distance of the absolute architectural ceiling. Further speedups
require the path documented in research/DSV4-RUNTIME-ARCHITECTURE.md §31:
lm_headmx.quantized_matmulswap (skip dequant materialization)- Lazy Compressor matmul deferral (buffer x, fire only at boundary)
- Cross-Projection Batching in Swift (50-100% gain)
- MTP self-spec decode (1.5-1.7×)
Usage
Easiest: MLX Studio
Download MLX Studio and pick this model from the JANGQ-AI catalog. It auto-installs the JANGTQ runtime, wires the DSV4 chat template, and routes through the custom Metal kernels.
Manual via jang-tools (Python)
pip install jang>=2.5.8
huggingface-cli download JANGQ-AI/DeepSeek-V4-Flash-JANGTQ --local-dir ./DSV4-Flash-JANGTQ
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("./DSV4-Flash-JANGTQ")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "What is the capital of France?"}],
tokenize=False,
add_generation_prompt=True,
)
import mlx.core as mx
input_ids = mx.array(tokenizer.encode(prompt))[None, :]
cache = model.make_cache()
logits = model(input_ids, cache=cache)
# ... greedy decode loop
The JANGTQ runtime patches SwitchGLU with TurboQuantSwitchGLU, applies
P15 mx.compile to the router, and folds wq_a + wkv into one quantized
matmul per attention layer (P19 MLA fusion).
Bundle contents
| File | Purpose |
|---|---|
model-*.safetensors |
Quantized weights (80 shards, ~79 GB) |
model.safetensors.index.json |
Tensor → shard mapping |
config.json |
DSV4 architecture + Swift Codable keys (routed_expert_bits, group_size, mxtq_seed) |
jang_config.json |
JANG profile metadata + chat encoding pointer |
jangtq_runtime.safetensors |
Sidecar: signs + codebooks for Swift consumers (Python regenerates at load) |
tokenizer.json, tokenizer_config.json |
DeepSeek tokenizer + baked-in chat_template |
generation_config.json |
EOS / pad token defaults |
encoding/encoding_dsv4.py |
Canonical Python encoder (full tool calling, thinking mode) |
Chat template
The bundle ships a Jinja port of encoding_dsv4.encode_messages baked
into tokenizer_config.json::chat_template. This covers:
- BOS / EOS tokens (
<|begin▁of▁sentence|>/<|end▁of▁sentence|>) - system / user / assistant / tool roles
- thinking mode (
enable_thinking=true→<think>…</think>) reasoning_effort='max'system prefix- Tool result inline formatting
For full upstream features (function-call markup, DSML token, internal
task tokens), use the encoding/encoding_dsv4.py module directly.
Sister bundles
| Bundle | Format | Size | Use case |
|---|---|---|---|
JANGQ-AI/DeepSeek-V4-Flash-JANG_2L |
jang affine 2-bit gsz=32 | 96.6 GB | Vanilla mlx_lm compatible |
JANGQ-AI/DeepSeek-V4-Flash-JANGTQ (this) |
jangtq codebook 2-bit | 79.5 GB | Smallest + highest quality 2-bit |
JANGQ-AI/DeepSeek-V4-Flash-JANGTQ4 |
jangtq codebook 4-bit | ~140 GB | 4-bit codebook for max quality |
Citation
If you use this model in research, please cite:
@misc{jang2026,
title={JANG: Adaptive Mixed-Precision Quantization for Apple Silicon},
author={Jinho Jang},
year={2026},
url={https://huggingface.co/JANGQ-AI},
}
License
DeepSeek-V4-Flash retains its original license. See the LICENSE file
shipped in the bundle and the upstream model card at
deepseek-ai/DeepSeek-V4-Flash.
- Downloads last month
- 2,336
Quantized
Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ
Base model
deepseek-ai/DeepSeek-V4-Flash