File size: 12,108 Bytes
e7c8b12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99c0d4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
library_name: transformers
pipeline_tag: image-text-to-text
language:
  - en
  - zh
tags:
  - prismaquant
  - compressed-tensors
  - nvfp4
  - mxfp8
  - quantized
  - multimodal
  - vision-language
  - mtp
  - speculative-decoding
  - vllm
  - qwen3.6
---

# Qwen3.6-27B — PrismaQuant 5.5 bpp

[![PrismaQuant source](https://img.shields.io/badge/PrismaQuant-GitHub-blue?logo=github)](https://github.com/RobTand/prismaquant)
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE)
[![vLLM native](https://img.shields.io/badge/vLLM-compressed--tensors-orange)](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html)

Mixed-precision quantization of `Qwen/Qwen3.6-27B` produced by
[**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear
sensitivity-driven allocator that chooses each Linear module's format
individually under a total-bit budget. Same allocator + activation-aware
export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated
into the DP so the achieved bpp hits the target exactly (5.500 not 5.28).

This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve —
see **[Why 5.5 bpp](#why-55-bpp)** below for the full sweep and
selection rationale.

---

## At a glance

| Metric | BF16 source | **This artifact** | Delta |
|---|---:|---:|---:|
| Size on disk | 54 GB | **~19 GB** | **−65 %** |
| Fraction of original weights | 100 % | **35 %** | |
| Average bits per param | 16 | **5.50** | |
| Multimodal (vision + text) | ✓ | **✓** | |
| MTP speculative decoding head | ✓ | **✓** | |
| Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | |
| Runtime backend | any | **vLLM only** | |

---

## Precision mix

Selected per-Linear by the allocator from measured Fisher sensitivity.
On this dense 27B the allocator hit the 5.5 bpp budget exactly:

| Format | W | A | Use | Count (after expansion) |
|---|---|---|---|---:|
| **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk dense MLPs + medium-sensitivity attention + most visual Linears | **349** |
| **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **35** |
| **BF16** | 16-bit | 16-bit | Router-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed | **112 (linear) + 352 (layer_passthrough)** |

The allocator **pre-aggregates fused-projection siblings** — `qkv_proj`
(q/k/v share one format) and `gate_up_proj` (gate+up share one format) —
as single DP items. Previously sibling coupling was enforced as a post-
pass that inflated the achieved bpp by up to 0.5 above target; the new
pre-aggregation path collapses each group into one multi-choice item so
the DP's solution is already sibling-consistent.

### Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along
   the group-quant structure using the calibration Hessian. Closed-form,
   not iterative.
2. **Closed-form per-group scale sweep** — for each 16-weight NVFP4
   group, enumerate `grid=32` candidate scales spanning
   `[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook
   neighbor at every candidate scale, pick the (scale, rounding-set)
   configuration minimizing activation-weighted per-group MSE. Sub-second
   per Linear. Closed-form analog of Intel's AutoRound.

**Measured per-Linear output-MSE vs RTN baseline (family-level
measurement on Qwen3.6-35B-A3B; same pipeline applied here):**

| Pipeline variant | out_mse ratio vs RTN |
|---|---:|
| RTN (no passes) | 1.00 |
| GPTQ only | 0.41 |
| **GPTQ + scale_sweep (this artifact)** | **0.33** |

---

## Why 5.5 bpp

Before quantizing we ran the allocator across the full target sweep
`{4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25}` on the same Fisher-
probed + RTN-costed stats this artifact was built from. Thanks to
allocator pre-aggregation of fused siblings + convergence-based
tightening, every target lands its budget exactly — achieved = target
within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off
across the Pareto frontier, not an apples-to-oranges approximation.

| Target bpp | Achieved bpp | Predicted Δloss | NVFP4 / MXFP8 / BF16 | vs 5.5 bpp |
|---:|---:|---:|---:|---|
| 4.5 | 4.500 | 948 | 416 / 1 / 0 | +99% Δloss, −18% size |
| 4.75 | 4.750 | 704 | 373 / 12 / 32 | +48% Δloss, −14% size |
| 5.0 | 5.000 | 604 | 347 / 14 / 56 | +27% Δloss, −9% size |
| 5.25 | 5.250 | 532 | 321 / 20 / 76 | +12% Δloss, −5% size |
| **5.5** | **5.500** | **477** | **300 / 30 / 87** | **← this artifact** |
| 6.0 | 6.000 | 393 | 270 / 35 / 112 | −18% Δloss, +9% size |
| 7.0 | 7.000 | 276 | 211 / 62 / 144 | −42% Δloss, +27% size |
| 8.25 | 8.249 | 180 | 152 / 73 / 192 | −62% Δloss, +50% size |

(Layer counts are at the un-expanded allocator level — per-Linear
expansion inflates each count 1.0-1.4× after broadcasting sibling-group
formats to members.)

**Selection rationale.** The Kneedle algorithm (Satopää et al.) places
the knee at **5.5 bpp**: on the normalized Δloss-vs-bpp curve, the
farthest point below the chord from `(min_bpp, max_Δloss)` to
`(max_bpp, min_Δloss)` is target 5.5. Reading across the frontier
instead of committing to a single anchor like "4.75" or "6" makes the
trade-off explicit:

- **Below 5.5** the loss curve steepens: 4.75 bpp saves 14% disk but
  pays **+48% Δloss**; 4.5 bpp saves 18% and pays **+99%**. Dense 27B
  can't be aggressively NVFP4'd the way MoE-A3B can, because every
  body Linear is active for every token — there are no "cheap"
  low-utilization experts to compress hard.
- **Above 5.5** the loss curve flattens: jumping to 6.0 bpp costs
  +9% disk for only −18% Δloss — a softer marginal gain than the
  knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction).
- **At the knee**, 5.5 bpp strikes the maximum distance from the
  chord — the point where further bit-budget buys less marginal
  Δloss reduction than the bits already spent.

PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk
dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high-
sensitivity dense Linears the allocator won't risk at 4-bit), 87 at
BF16 (highest-sensitivity Linears preserved lossless).

---

## Which layers are quantized

### Text body (DeltaNet linear-attention + dense MLP, 64 layers)

- **Full attention** Linears (`q_proj` / `k_proj` / `v_proj` / `o_proj`):
  qkv siblings share one format per layer (pre-aggregated)
- **DeltaNet linear-attention** Linears (`in_proj_qkv` / `in_proj_z` /
  `in_proj_a` / `in_proj_b` / `in_proj_ba` / `out_proj`): each Linear's
  format chosen independently
- **Dense MLP** (`gate_proj` / `up_proj` / `down_proj`): gate+up
  siblings share one format per layer; down chosen independently

### Multi-token-prediction (MTP) head

- One full-attention + dense-MLP decoder layer at the model tail,
  quantized by the same per-Linear policy — so
  `--speculative-config method=mtp` drafts at the same precision
  profile as the body.

### Visual encoder (27 blocks — Qwen3.6-VL vision tower)

- **Fisher-driven per-Linear allocation:** 108 of 110 visual Linears
  got placed by the full DP allocator on the basis of per-Linear
  activation-weighted cost (8 multimodal calibration samples).
- **Remaining 2 un-probed visual Linears** (`patch_embed.proj` edges
  the probe didn't tap) stamped at NVFP4 uniformly.
- **`model.visual.pos_embed`** stays BF16 — it's a learnable Parameter,
  not an `nn.Linear`, and vLLM's compressed-tensors loader cannot
  consume a quantized Parameter layout.

### Passthrough (unquantized)

- `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only
  accepts a single `weight` parameter. The allocator measures
  lm_head's Fisher sensitivity and would pick NVFP4 for it, but the
  compressed-tensors runtime rejects a compressed lm_head with
  `KeyError: lm_head.input_global_scale`. This is a vLLM runtime
  limitation, not a PrismaQuant design decision.
- RMSNorm weights (all layers + MTP + visual)
- All biases
- `embed_tokens`
- `model.visual.pos_embed`

---

## Serving (vLLM only)

This artifact is **only** runnable via vLLM's stock `compressed-tensors`
support — there is no transformers-native runtime path for mixed NVFP4 +
MXFP8 today. vLLM 0.11+ or equivalent is required.

```bash
vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

- **FlashInfer** NVFP4 attention is picked up automatically; set
  `VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass` to make the preference
  explicit.
- **MTP speculative decoding** at `n=3` is the measured optimum for
  this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4
  regresses).
- **Visual inputs** work via vLLM's standard `image-text-to-text` chat
  API — no special flags.

A full recipe with the flashinfer-cutlass backends, reasoning/tool
parsers and chat-template pinning is available at
[`spark-vllm-fresh/recipes/qwen3.6-27b.yaml`](https://github.com/RobTand/prismaquant).

---

## Reproducing this artifact

Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant):

1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace
   (diagonal) across body + MTP + visual Linears. Shard granularity
   and layer-cache budget are auto-derived from available RAM via
   `prismaquant.autoscale`. Checkpoint-level reuse (per-Linear stats
   are pooled across prior shard pickles) means mid-run crashes resume
   cleanly regardless of `LAYERS_PER_SHARD` changes.
2. **Per-(Linear, format) cost measurement** — for each Linear and each
   candidate format, the per-group RTN error weighted by cached input
   activations.
3. **Multi-choice knapsack allocator** — picks one format per Linear
   minimizing total predicted Δloss under the bit budget. Fused-sibling
   groups pre-aggregated into DP items to avoid post-pass overshoot.
   Target 5.5 bpp; achieved 5.500 bpp.
4. **Export** — streams each body / visual / MTP shard, applies GPTQ +
   scale_sweep to its NVFP4 entries, writes the compressed-tensors
   format. `lm_head` passthrough at BF16 enforced at this stage.

Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe +
~15 min cost + ~20 min export. Subsequent iterations at different bpp
targets reuse probe + cost artifacts and complete in minutes.

---

## Known issues / limitations

- **vLLM only at serve time.** No transformers-runtime path for this
  precision mix today.
- **lm_head stays BF16** because vLLM's `ParallelLMHead` does not
  register the NVFP4/MXFP8 compressed-tensors schemes. Allocator
  measured it and would have picked NVFP4; the runtime limitation
  forces BF16. Costs ~770 MB on the disk footprint.
- **MTP n=4 regresses on this family.** Stick to `n=3` unless you
  verify against the draft-head acceptance-rate trace.

---

## Links

- **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant)
- **Base model:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Sibling 35B-A3B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm)
- **Sibling 122B-A10B:** [Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm)

## Citation

```bibtex
@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
```