⚠️ Important — config metadata fix (2026-04-24)

Earlier versions of this bundle shipped with a wrong config.json["quantization"] — top-level bits=2 only, no per-module overrides, no mode field. MLX's loader applied the 2-bit dequant kernel to weights stored as 8-bit affine (attention / embed / head / shared experts / compressor / indexer), producing silent garbage on long prompts. Algorithms looked plausible on tiny prompts ("count to five") but coding/reasoning tasks scored far below the model's true capability.

The fix is metadata-only. Weights on disk are unchanged. config.json now ships with bits=8 group_size=32 mode=affine at top level + per-module overrides matching the actual on-disk bit-widths (matches reference quants like Thump604/DeepSeek-V4-Flash-MLX-Q2-mixed-gs128-affine and mlx-community/DeepSeek-V4-Flash-4bit).

If you downloaded this bundle before 2026-04-24, redownload config.json — nothing else changed. No runtime, decode, or weight changes are required in any consumer (Python load_jangtq_model, mlx-lm, Swift vMLX, etc).

Verification: pass@1 on HumanEval first 12 problems jumped 42% → 67% by patching only the config. Math identity verified across all 522 (JANGTQ) / 129+ (JANGTQ4) / 34,314 (JANG_2L) quantized layers — every layer satisfies 32 × weight.packed_dim = bits × n_groups × group_size.

Background: research/JANGTQ-CONFIG-METADATA-BUG-2026-04-24.md in the JANGQ-AI/jang source repo (private).

MLX Studio

JANGQ

DeepSeek-V4-Flash JANGTQ

DeepSeek V4 Flash 284B/13B-active MoE — 2-bit codebook + Hadamard, 79.5 GB

The smallest, highest-quality DSV4-Flash on Apple Silicon.

⚠️ Recommended: Run in MLX Studio for the best experience. MLX Studio bundles the JANGTQ runtime, handles the DSV4 chat template (thinking + tool calling), and uses the custom Metal kernels this model needs. Stock mlx_lm.load() will NOT load this model — see usage instructions below.

Follow development on Twitter: @jangq_ai


What is JANGTQ?

JANGTQ (JANG TurboQuant) is the most-compressed, highest-quality JANG quantization format. Routed expert weights stay in a compact codebook + Hadamard-rotated form at runtime — no decompression to affine — and the matmul path uses custom Metal kernels that read packed uint32 weights, look up centroids in a 4-entry codebook (Lloyd-Max optimized), and accumulate dot products against a Hadamard-rotated input (QuIP# "rotate-input-once" math).

For DeepSeek-V4-Flash this is especially impactful: 256 routed experts × 43 layers × 3 projections per expert (gate, up, down) means MoE weights dominate the bundle. Codebook quant cuts the routed-expert footprint ~12% beyond uniform 2-bit affine while preserving signal-to-noise within ~0.01 logit RMS of fp16.

JANG_2L (affine) JANGTQ Δ
Disk size 96.6 GB 79.5 GB −18%
GPU memory ~96 GB ~79 GB −18%
Avg bits/param 2.7 ~2.25 −0.45
Decode speed (M3 Ultra) ~22 tok/s 20.5 tok/s ~93% of affine
Routed-expert format affine 2-bit gsz=32 codebook 2-bit gsz=32 smaller + tighter

The codebook is strictly more expressive than uniform 2-bit affine for the Gaussian-ish distribution of Hadamard-rotated weights. Same bit budget, more faithful reconstruction.


DSV4-Flash Architecture (what's special)

DeepSeek-V4-Flash introduces several departures from V3.x — all handled by the JANG runtime:

  • MQA, not classic MLA: n_kv_heads=1, head_dim=512. Single shared K/V projection across all 64 heads. KV cache stores 1024 floats per token — already minimal, no compression possible.
  • mHC residual stream (hc_mult=4): hidden state between layers is (B, L, 4, 4096). Each block runs a Sinkhorn-balanced collapse → attn/ffn → expand. Substantial compute on top of the standard transformer block.
  • Hybrid Compressor + Indexer attention: 41 of 43 layers run a pooled-KV compressor (compress_ratio 4 or 128 alternating). The compress_ratio=4 layers also run an Indexer to top-k pooled positions for global context.
  • Two RoPE configs per model: layers with compress_ratio=0 use base rope_theta=10000 (no YaRN); others use compress_rope_theta=160000 with YaRN scaling. Mixing these wrong blows the residual stream past bf16 inf by layer 40.
  • Inverse-rope on attention output: required for coherent decode with our convention.
  • MTP head (num_nextn_predict_layers=1): held back from regular forward; intended for self-spec decode.
  • 1M max position via YaRN scaling. Compressor + Indexer make this affordable at decode.

Speed Benchmarks (Mac Studio M3 Ultra)

Test Result
Load time 8.8 sec
Prefill (cold compile) ~23 sec
Decode steady-state 20.54 tok/s
First-token sample "One, two, three, four, five. Five, four, three, two, one."

Bench: python3 bench_dsv4.py <bundle> --tokens 25 --warmup 3. Greedy decode, no chat template (raw prompt).

This is 93% of the public mlx-lm PR #1192 ceiling (21.86 tok/s) and within striking distance of the absolute architectural ceiling. Further speedups require the path documented in research/DSV4-RUNTIME-ARCHITECTURE.md §31:

  • lm_head mx.quantized_matmul swap (skip dequant materialization)
  • Lazy Compressor matmul deferral (buffer x, fire only at boundary)
  • Cross-Projection Batching in Swift (50-100% gain)
  • MTP self-spec decode (1.5-1.7×)

Usage

Easiest: MLX Studio

Download MLX Studio and pick this model from the JANGQ-AI catalog. It auto-installs the JANGTQ runtime, wires the DSV4 chat template, and routes through the custom Metal kernels.

Manual via jang-tools (Python)

pip install jang>=2.5.8
huggingface-cli download JANGQ-AI/DeepSeek-V4-Flash-JANGTQ --local-dir ./DSV4-Flash-JANGTQ
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("./DSV4-Flash-JANGTQ")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)

import mlx.core as mx
input_ids = mx.array(tokenizer.encode(prompt))[None, :]
cache = model.make_cache()
logits = model(input_ids, cache=cache)
# ... greedy decode loop

The JANGTQ runtime patches SwitchGLU with TurboQuantSwitchGLU, applies P15 mx.compile to the router, and folds wq_a + wkv into one quantized matmul per attention layer (P19 MLA fusion).

Bundle contents

File Purpose
model-*.safetensors Quantized weights (80 shards, ~79 GB)
model.safetensors.index.json Tensor → shard mapping
config.json DSV4 architecture + Swift Codable keys (routed_expert_bits, group_size, mxtq_seed)
jang_config.json JANG profile metadata + chat encoding pointer
jangtq_runtime.safetensors Sidecar: signs + codebooks for Swift consumers (Python regenerates at load)
tokenizer.json, tokenizer_config.json DeepSeek tokenizer + baked-in chat_template
generation_config.json EOS / pad token defaults
encoding/encoding_dsv4.py Canonical Python encoder (full tool calling, thinking mode)

Chat template

The bundle ships a Jinja port of encoding_dsv4.encode_messages baked into tokenizer_config.json::chat_template. This covers:

  • BOS / EOS tokens (<|begin▁of▁sentence|> / <|end▁of▁sentence|>)
  • system / user / assistant / tool roles
  • thinking mode (enable_thinking=true<think>…</think>)
  • reasoning_effort='max' system prefix
  • Tool result inline formatting

For full upstream features (function-call markup, DSML token, internal task tokens), use the encoding/encoding_dsv4.py module directly.


Sister bundles

Bundle Format Size Use case
JANGQ-AI/DeepSeek-V4-Flash-JANG_2L jang affine 2-bit gsz=32 96.6 GB Vanilla mlx_lm compatible
JANGQ-AI/DeepSeek-V4-Flash-JANGTQ (this) jangtq codebook 2-bit 79.5 GB Smallest + highest quality 2-bit
JANGQ-AI/DeepSeek-V4-Flash-JANGTQ4 jangtq codebook 4-bit ~140 GB 4-bit codebook for max quality

Citation

If you use this model in research, please cite:

@misc{jang2026,
  title={JANG: Adaptive Mixed-Precision Quantization for Apple Silicon},
  author={Jinho Jang},
  year={2026},
  url={https://huggingface.co/JANGQ-AI},
}

License

DeepSeek-V4-Flash retains its original license. See the LICENSE file shipped in the bundle and the upstream model card at deepseek-ai/DeepSeek-V4-Flash.

Downloads last month
2,336
Safetensors
Model size
20B params
Tensor type
U32
·
I32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/DeepSeek-V4-Flash-JANGTQ

Quantized
(15)
this model

Collection including JANGQ-AI/DeepSeek-V4-Flash-JANGTQ