⚠️ Important — config metadata fix (2026-04-24)

Earlier versions of this bundle shipped with a wrong config.json["quantization"] — top-level bits=2 only, no per-module overrides, no mode field. MLX's loader applied the 2-bit dequant kernel to weights stored as 8-bit affine (attention / embed / head / shared experts / compressor / indexer), producing silent garbage on long prompts. Algorithms looked plausible on tiny prompts ("count to five") but coding/reasoning tasks scored far below the model's true capability.

The fix is metadata-only. Weights on disk are unchanged. config.json now ships with bits=8 group_size=32 mode=affine at top level + per-module overrides matching the actual on-disk bit-widths (matches reference quants like Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine and mlx-community/DeepSeek-V4-Flash-4bit).

If you downloaded this bundle before 2026-04-24, redownload config.json — nothing else changed. No runtime, decode, or weight changes are required in any consumer (Python load_jangtq_model, mlx-lm, Swift vMLX, etc).

Verification: pass@1 on HumanEval first 12 problems jumped 42% → 67% by patching only the config. Math identity verified across all 522 (JANGTQ) / 129+ (JANGTQ4) / 34,314 (JANG_2L) quantized layers — every layer satisfies 32 × weight.packed_dim = bits × n_groups × group_size.

Background: research/JANGTQ-CONFIG-METADATA-BUG-2026-04-24.md in the JANGQ-AI/jang source repo (private).

DeepSeek-V4-Flash JANGTQ

DeepSeek V4 Flash 284B/13B-active MoE — 2-bit codebook + Hadamard, 79.5 GB

The smallest, highest-quality DSV4-Flash on Apple Silicon.

⚠️ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles the DSV4 chat template (thinking + tool calling), and uses the custom Metal kernels this model needs. Stock mlx_lm.load() will NOT load this model — see usage instructions below.

Follow development on Twitter: @jangq_ai

What is JANGTQ?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 4-entry codebook (Lloyd-Max optimized), and accumulate dot products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).

For DeepSeek-V4-Flash this is especially impactful: 256 routed experts × 43 layers × 3 projections per expert (gate, up, down) means MoE weights dominate the bundle. Codebook quant cuts the routed-expert footprint ~12% beyond uniform 2-bit affine while preserving signal-to-noise within ~0.01 logit RMS of fp16.

	JANG_2L (affine)	JANGTQ	Δ
Disk size	96.6 GB	79.5 GB	−18%
GPU memory	~96 GB	~79 GB	−18%
Avg bits/param	2.7	~2.25	−0.45
Decode speed (M3 Ultra)	~22 tok/s	20.5 tok/s	~93% of affine
Routed-expert format	affine 2-bit gsz=32	codebook 2-bit gsz=32	smaller + tighter

The codebook is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights. Same bit budget, more faithful reconstruction.

DSV4-Flash Architecture (what's special)

DeepSeek-V4-Flash introduces several departures from V3.x — all handled by the JANG runtime:

MQA, not classic MLA: n_kv_heads=1, head_dim=512. Single shared K/V projection across all 64 heads. KV cache stores 1024 floats per token — already minimal, no compression possible.
mHC residual stream (hc_mult=4): hidden state between layers is (B, L, 4, 4096). Each block runs a Sinkhorn-balanced collapse → attn/ffn → expand. Substantial compute on top of the standard transformer block.
Hybrid Compressor + Indexer attention: 41 of 43 layers run a pooled-KV compressor (compress_ratio 4 or 128 alternating). The compress_ratio=4 layers also run an Indexer to top-k pooled positions for global context.
Two RoPE configs per model: layers with compress_ratio=0 use base rope_theta=10000 (no YaRN); others use compress_rope_theta=160000 with YaRN scaling. Mixing these wrong blows the residual stream past bf16 inf by layer 40.
Inverse-rope on attention output: required for coherent decode with our convention.
MTP head (num_nextn_predict_layers=1): held back from regular forward; intended for self-spec decode.
1M max position via YaRN scaling. Compressor + Indexer make this affordable at decode.

Speed Benchmarks (Mac Studio M3 Ultra)

Test	Result
Load time	8.8 sec
Prefill (cold compile)	~23 sec
Decode steady-state	20.54 tok/s
First-token sample	"One, two, three, four, five. Five, four, three, two, one."

Bench: python3 bench_dsv4.py <bundle> --tokens 25 --warmup 3. Greedy decode, no chat template (raw prompt).

This is 93% of the public mlx-lm PR #1192 ceiling (21.86 tok/s) and within striking distance of the absolute architectural ceiling. Further speedups require the path documented in research/DSV4-RUNTIME-ARCHITECTURE.md §31:

lm_head mx.quantized_matmul swap (skip dequant materialization)
Lazy Compressor matmul deferral (buffer x, fire only at boundary)
Cross-Projection Batching in Swift (50-100% gain)
MTP self-spec decode (1.5-1.7×)

Usage

Easiest: MLX Studio

Download MLX Studio and pick this model from the JANGQ-AI catalog. It auto-installs the JANGTQ runtime, wires the DSV4 chat template, and routes through the custom Metal kernels.

Manual via jang-tools (Python)

pip install jang>=2.5.8
huggingface-cli download JANGQ-AI/DeepSeek-V4-Flash-JANGTQ --local-dir ./DSV4-Flash-JANGTQ

from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("./DSV4-Flash-JANGTQ")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)

import mlx.core as mx
input_ids = mx.array(tokenizer.encode(prompt))[None, :]
cache = model.make_cache()
logits = model(input_ids, cache=cache)
# ... greedy decode loop

The JANGTQ runtime patches SwitchGLU with TurboQuantSwitchGLU, applies P15 mx.compile to the router, and folds wq_a + wkv into one quantized matmul per attention layer (P19 MLA fusion).

Bundle contents

File	Purpose
`model-*.safetensors`	Quantized weights (80 shards, ~79 GB)
`model.safetensors.index.json`	Tensor → shard mapping
`config.json`	DSV4 architecture + Swift Codable keys (`routed_expert_bits`, `group_size`, `mxtq_seed`)
`jang_config.json`	JANG profile metadata + chat encoding pointer
`jangtq_runtime.safetensors`	Sidecar: signs + codebooks for Swift consumers (Python regenerates at load)
`tokenizer.json`, `tokenizer_config.json`	DeepSeek tokenizer + baked-in chat_template
`generation_config.json`	EOS / pad token defaults
`encoding/encoding_dsv4.py`	Canonical Python encoder (full tool calling, thinking mode)

Chat template

The bundle ships a Jinja port of encoding_dsv4.encode_messages baked into tokenizer_config.json::chat_template. This covers:

BOS / EOS tokens (<｜begin▁of▁sentence｜> / <｜end▁of▁sentence｜>)
system / user / assistant / tool roles
thinking mode (enable_thinking=true → <think>…</think>)
reasoning_effort='max' system prefix
Tool result inline formatting

For full upstream features (function-call markup, DSML token, internal task tokens), use the encoding/encoding_dsv4.py module directly.

Sister bundles

Bundle	Format	Size	Use case
`JANGQ-AI/DeepSeek-V4-Flash-JANG_2L`	jang affine 2-bit gsz=32	96.6 GB	Vanilla mlx_lm compatible
`JANGQ-AI/DeepSeek-V4-Flash-JANGTQ` (this)	jangtq codebook 2-bit	79.5 GB	Smallest + highest quality 2-bit
`JANGQ-AI/DeepSeek-V4-Flash-JANGTQ4`	jangtq codebook 4-bit	~140 GB	4-bit codebook for max quality

Citation

If you use this model in research, please cite:

@misc{jang2026,
  title={JANG: Adaptive Mixed-Precision Quantization for Apple Silicon},
  author={Jinho Jang},
  year={2026},
  url={https://huggingface.co/JANGQ-AI},
}

License

DeepSeek-V4-Flash retains its original license. See the LICENSE file shipped in the bundle and the upstream model card at deepseek-ai/DeepSeek-V4-Flash.

Downloads last month: 2,336

Safetensors

Model size

20B params

Tensor type

U32

I32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

Base model

deepseek-ai/DeepSeek-V4-Flash

Quantized

(15)

this model

Collection including JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

JANG TurboQuantized Models

Collection

Using TurboQuant as a Quantization method • 8 items • Updated 2 days ago