Qwen3.6-27B-JANG_4M-MTP

Qwen3.6-27B (dense) quantized with the JANG_4M importance-weighted mixed-precision profile for Apple Silicon, with the vision tower and the native Multi-Token-Prediction head preserved and enabled.


Source	Qwen/Qwen3.6-27B
License	Apache-2.0, inherited from upstream
Format	JANG v2 — `JANG_4M` profile (`mx.quantize`, asymmetric, `block_size=64`)
Architecture	`qwen3_5` dense — hybrid GatedDeltaNet + full attention, has vision
Modality	image + video + text
Bundle size	16.6 GB
Effective bits	4.45 avg (4-bit floor, 8-bit on important tensors)
MTP	native head preserved, enabled (`num_nextn_predict_layers=1`)

Why JANG_4M

JANG_4M is JANG's standard importance-weighted profile. Instead of a flat bit width, it scores each tensor by weight magnitude and spends 8 bits where it matters and a 4-bit floor elsewhere — MSE-calibrated, asymmetric affine via MLX-native mx.quantize. The result here is 4.45 effective bits: sharper than a flat MXFP4 bundle, materially smaller than flat MXFP8. Norms and control tensors stay in fp16 passthrough.

Multi-Token Prediction

This bundle keeps Qwen3.6's native MTP module and runs it as a self-speculative draft head: the MTP head proposes tokens that the main model verifies in a single pass, so decoded output stays bit-identical to plain autoregressive decoding — only faster.

Recorded on an M5 Max (vMLX runtime, 96-token deterministic prompt, output verified equal to baseline at every depth):

Draft depth	tok/s	Speedup
Baseline (MTP off)	24.2	1.00×
D1	37.6	1.55×
D2	43.3	1.79×
D3 (default)	44.1	1.82×

Absolute tok/s depends on free memory and system load. The speedup ratio — baseline vs. MTP measured back-to-back under identical conditions — is the stable figure.

Vision, MTP and caching together

This bundle runs image/video input, native MTP speculative decode and prefix/KV caching in the same session — a combination not every MTP-enabled Qwen build exposes.

Loading

Loads via stock mlx-lm / mlx-vlm on Apple Silicon — JANG_4M weights are native mx.quantize affine, no custom JANG runtime required for the core model.

from mlx_vlm import load, generate
model, processor = load("JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP")

The MTP draft path is exercised by an MTP-aware runtime (vMLX); other runtimes load and decode the main model normally and ignore the MTP head.

Related bundles

Flat-precision MXFP siblings of this model are published on OsaurusAI:

Variant	Format	Size	Best MTP speedup
`Qwen3.6-27B-MXFP4-MTP`	flat mxfp4	14.4 GB	1.85× (D2)
`Qwen3.6-27B-JANG_4M-MTP` (this)	JANG_4M mixed	16.6 GB	1.82× (D3)
`Qwen3.6-27B-MXFP8-MTP`	flat mxfp8	27.1 GB	1.83× (D3)

Credits

Quantization toolchain: JANG by Jinho Jang <eric@jangq.ai>
Base model: Qwen3.6-27B by Qwen

Downloads last month: 428

Safetensors

Model size

5B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/Qwen3.6-27B-JANG_4M-MTP

Base model

Qwen/Qwen3.6-27B

Quantized

(469)

this model