---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
library_name: transformers
pipeline_tag: image-text-to-text
language:
  - en
  - zh
tags:
  - prismaquant
  - compressed-tensors
  - nvfp4
  - mxfp8
  - quantized
  - multimodal
  - vision-language
  - mtp
  - speculative-decoding
  - vllm
  - qwen3.6
---

# Qwen3.6-27B — PrismaQuant 5.5 bpp

[![PrismaQuant source](https://img.shields.io/badge/PrismaQuant-GitHub-blue?logo=github)](https://github.com/RobTand/prismaquant)
[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE)
[![vLLM native](https://img.shields.io/badge/vLLM-compressed--tensors-orange)](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html)

Mixed-precision quantization of `Qwen/Qwen3.6-27B` produced by
[**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear
sensitivity-driven allocator that chooses each Linear module's format
individually under a total-bit budget. Same allocator + activation-aware
export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated
into the DP so the achieved bpp hits the target exactly (5.500 not 5.28).

This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve —
see **[Why 5.5 bpp](#why-55-bpp)** below for the full sweep and
selection rationale.

---

## At a glance

| Metric | BF16 source | **This artifact** | Delta |
|---|---:|---:|---:|
| Size on disk | 54 GB | **~19 GB** | **−65 %** |
| Fraction of original weights | 100 % | **35 %** | |
| Average bits per param | 16 | **5.50** | |
| Multimodal (vision + text) | ✓ | **✓** | |
| MTP speculative decoding head | ✓ | **✓** | |
| Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | |
| Runtime backend | any | **vLLM only** | |

---

## Precision mix

Selected per-Linear by the allocator from measured Fisher sensitivity.
On this dense 27B the allocator hit the 5.5 bpp budget exactly:

| Format | W | A | Use | Count (after expansion) |
|---|---|---|---|---:|
| **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk dense MLPs + medium-sensitivity attention + most visual Linears | **349** |
| **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **35** |
| **BF16** | 16-bit | 16-bit | Router-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed | **112 (linear) + 352 (layer_passthrough)** |

The allocator **pre-aggregates fused-projection siblings** — `qkv_proj`
(q/k/v share one format) and `gate_up_proj` (gate+up share one format) —
as single DP items. Previously sibling coupling was enforced as a post-
pass that inflated the achieved bpp by up to 0.5 above target; the new
pre-aggregation path collapses each group into one multi-choice item so
the DP's solution is already sibling-consistent.

### Activation-aware passes applied during export

On every NVFP4 weight the exporter runs, in order:

1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along
   the group-quant structure using the calibration Hessian. Closed-form,
   not iterative.
2. **Closed-form per-group scale sweep** — for each 16-weight NVFP4
   group, enumerate `grid=32` candidate scales spanning
   `[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook
   neighbor at every candidate scale, pick the (scale, rounding-set)
   configuration minimizing activation-weighted per-group MSE. Sub-second
   per Linear. Closed-form analog of Intel's AutoRound.

**Measured per-Linear output-MSE vs RTN baseline (family-level
measurement on Qwen3.6-35B-A3B; same pipeline applied here):**

| Pipeline variant | out_mse ratio vs RTN |
|---|---:|
| RTN (no passes) | 1.00 |
| GPTQ only | 0.41 |
| **GPTQ + scale_sweep (this artifact)** | **0.33** |

---

## Why 5.5 bpp

Before quantizing we ran the allocator across the full target sweep
`{4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25}` on the same Fisher-
probed + RTN-costed stats this artifact was built from. Thanks to
allocator pre-aggregation of fused siblings + convergence-based
tightening, every target lands its budget exactly — achieved = target
within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off
across the Pareto frontier, not an apples-to-oranges approximation.

| Target bpp | Achieved bpp | Predicted Δloss | NVFP4 / MXFP8 / BF16 | vs 5.5 bpp |
|---:|---:|---:|---:|---|
| 4.5 | 4.500 | 948 | 416 / 1 / 0 | +99% Δloss, −18% size |
| 4.75 | 4.750 | 704 | 373 / 12 / 32 | +48% Δloss, −14% size |
| 5.0 | 5.000 | 604 | 347 / 14 / 56 | +27% Δloss, −9% size |
| 5.25 | 5.250 | 532 | 321 / 20 / 76 | +12% Δloss, −5% size |
| **5.5** | **5.500** | **477** | **300 / 30 / 87** | **← this artifact** |
| 6.0 | 6.000 | 393 | 270 / 35 / 112 | −18% Δloss, +9% size |
| 7.0 | 7.000 | 276 | 211 / 62 / 144 | −42% Δloss, +27% size |
| 8.25 | 8.249 | 180 | 152 / 73 / 192 | −62% Δloss, +50% size |

(Layer counts are at the un-expanded allocator level — per-Linear
expansion inflates each count 1.0-1.4× after broadcasting sibling-group
formats to members.)

**Selection rationale.** The Kneedle algorithm (Satopää et al.) places
the knee at **5.5 bpp**: on the normalized Δloss-vs-bpp curve, the
farthest point below the chord from `(min_bpp, max_Δloss)` to
`(max_bpp, min_Δloss)` is target 5.5. Reading across the frontier
instead of committing to a single anchor like "4.75" or "6" makes the
trade-off explicit:

- **Below 5.5** the loss curve steepens: 4.75 bpp saves 14% disk but
  pays **+48% Δloss**; 4.5 bpp saves 18% and pays **+99%**. Dense 27B
  can't be aggressively NVFP4'd the way MoE-A3B can, because every
  body Linear is active for every token — there are no "cheap"
  low-utilization experts to compress hard.
- **Above 5.5** the loss curve flattens: jumping to 6.0 bpp costs
  +9% disk for only −18% Δloss — a softer marginal gain than the
  knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction).
- **At the knee**, 5.5 bpp strikes the maximum distance from the
  chord — the point where further bit-budget buys less marginal
  Δloss reduction than the bits already spent.

PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk
dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high-
sensitivity dense Linears the allocator won't risk at 4-bit), 87 at
BF16 (highest-sensitivity Linears preserved lossless).

---

## Which layers are quantized

### Text body (DeltaNet linear-attention + dense MLP, 64 layers)

- **Full attention** Linears (`q_proj` / `k_proj` / `v_proj` / `o_proj`):
  qkv siblings share one format per layer (pre-aggregated)
- **DeltaNet linear-attention** Linears (`in_proj_qkv` / `in_proj_z` /
  `in_proj_a` / `in_proj_b` / `in_proj_ba` / `out_proj`): each Linear's
  format chosen independently
- **Dense MLP** (`gate_proj` / `up_proj` / `down_proj`): gate+up
  siblings share one format per layer; down chosen independently

### Multi-token-prediction (MTP) head

- One full-attention + dense-MLP decoder layer at the model tail,
  quantized by the same per-Linear policy — so
  `--speculative-config method=mtp` drafts at the same precision
  profile as the body.

### Visual encoder (27 blocks — Qwen3.6-VL vision tower)

- **Fisher-driven per-Linear allocation:** 108 of 110 visual Linears
  got placed by the full DP allocator on the basis of per-Linear
  activation-weighted cost (8 multimodal calibration samples).
- **Remaining 2 un-probed visual Linears** (`patch_embed.proj` edges
  the probe didn't tap) stamped at NVFP4 uniformly.
- **`model.visual.pos_embed`** stays BF16 — it's a learnable Parameter,
  not an `nn.Linear`, and vLLM's compressed-tensors loader cannot
  consume a quantized Parameter layout.

### Passthrough (unquantized)

- `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only
  accepts a single `weight` parameter. The allocator measures
  lm_head's Fisher sensitivity and would pick NVFP4 for it, but the
  compressed-tensors runtime rejects a compressed lm_head with
  `KeyError: lm_head.input_global_scale`. This is a vLLM runtime
  limitation, not a PrismaQuant design decision.
- RMSNorm weights (all layers + MTP + visual)
- All biases
- `embed_tokens`
- `model.visual.pos_embed`

---

## Serving (vLLM only)

This artifact is **only** runnable via vLLM's stock `compressed-tensors`
support — there is no transformers-native runtime path for mixed NVFP4 +
MXFP8 today. vLLM 0.11+ or equivalent is required.

```bash
vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \
    --trust-remote-code \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

- **FlashInfer** NVFP4 attention is picked up automatically; set
  `VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass` to make the preference
  explicit.
- **MTP speculative decoding** at `n=3` is the measured optimum for
  this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4
  regresses).
- **Visual inputs** work via vLLM's standard `image-text-to-text` chat
  API — no special flags.

A full recipe with the flashinfer-cutlass backends, reasoning/tool
parsers and chat-template pinning is available at
[`spark-vllm-fresh/recipes/qwen3.6-27b.yaml`](https://github.com/RobTand/prismaquant).

---

## Reproducing this artifact

Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant):

1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace
   (diagonal) across body + MTP + visual Linears. Shard granularity
   and layer-cache budget are auto-derived from available RAM via
   `prismaquant.autoscale`. Checkpoint-level reuse (per-Linear stats
   are pooled across prior shard pickles) means mid-run crashes resume
   cleanly regardless of `LAYERS_PER_SHARD` changes.
2. **Per-(Linear, format) cost measurement** — for each Linear and each
   candidate format, the per-group RTN error weighted by cached input
   activations.
3. **Multi-choice knapsack allocator** — picks one format per Linear
   minimizing total predicted Δloss under the bit budget. Fused-sibling
   groups pre-aggregated into DP items to avoid post-pass overshoot.
   Target 5.5 bpp; achieved 5.500 bpp.
4. **Export** — streams each body / visual / MTP shard, applies GPTQ +
   scale_sweep to its NVFP4 entries, writes the compressed-tensors
   format. `lm_head` passthrough at BF16 enforced at this stage.

Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe +
~15 min cost + ~20 min export. Subsequent iterations at different bpp
targets reuse probe + cost artifacts and complete in minutes.

---

## Known issues / limitations

- **vLLM only at serve time.** No transformers-runtime path for this
  precision mix today.
- **lm_head stays BF16** because vLLM's `ParallelLMHead` does not
  register the NVFP4/MXFP8 compressed-tensors schemes. Allocator
  measured it and would have picked NVFP4; the runtime limitation
  forces BF16. Costs ~770 MB on the disk footprint.
- **MTP n=4 regresses on this family.** Stick to `n=3` unless you
  verify against the draft-head acceptance-rate trace.

---

## Links

- **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant)
- **Base model:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Sibling 35B-A3B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm)
- **Sibling 122B-A10B:** [Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm)

## Citation

```bibtex
@software{prismaquant2026,
  title        = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
                  quantization for LLMs},
  author       = {Tand, Rob},
  year         = 2026,
  url          = {https://github.com/RobTand/prismaquant},
}
```