Image-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen3_5
prismaquant
compressed-tensors
nvfp4
mxfp8
quantized
multimodal
vision-language
mtp
speculative-decoding
vllm
qwen3.6
conversational
8-bit precision
Instructions to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") model = AutoModelForMultimodalLM.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
- SGLang
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Docker Model Runner:
docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
File size: 12,108 Bytes
e7c8b12 99c0d4d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | ---
license: apache-2.0
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
library_name: transformers
pipeline_tag: image-text-to-text
language:
- en
- zh
tags:
- prismaquant
- compressed-tensors
- nvfp4
- mxfp8
- quantized
- multimodal
- vision-language
- mtp
- speculative-decoding
- vllm
- qwen3.6
---
# Qwen3.6-27B — PrismaQuant 5.5 bpp
[](https://github.com/RobTand/prismaquant)
[](https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE)
[](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html)
Mixed-precision quantization of `Qwen/Qwen3.6-27B` produced by
[**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear
sensitivity-driven allocator that chooses each Linear module's format
individually under a total-bit budget. Same allocator + activation-aware
export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated
into the DP so the achieved bpp hits the target exactly (5.500 not 5.28).
This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve —
see **[Why 5.5 bpp](#why-55-bpp)** below for the full sweep and
selection rationale.
---
## At a glance
| Metric | BF16 source | **This artifact** | Delta |
|---|---:|---:|---:|
| Size on disk | 54 GB | **~19 GB** | **−65 %** |
| Fraction of original weights | 100 % | **35 %** | |
| Average bits per param | 16 | **5.50** | |
| Multimodal (vision + text) | ✓ | **✓** | |
| MTP speculative decoding head | ✓ | **✓** | |
| Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | |
| Runtime backend | any | **vLLM only** | |
---
## Precision mix
Selected per-Linear by the allocator from measured Fisher sensitivity.
On this dense 27B the allocator hit the 5.5 bpp budget exactly:
| Format | W | A | Use | Count (after expansion) |
|---|---|---|---|---:|
| **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk dense MLPs + medium-sensitivity attention + most visual Linears | **349** |
| **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **35** |
| **BF16** | 16-bit | 16-bit | Router-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed | **112 (linear) + 352 (layer_passthrough)** |
The allocator **pre-aggregates fused-projection siblings** — `qkv_proj`
(q/k/v share one format) and `gate_up_proj` (gate+up share one format) —
as single DP items. Previously sibling coupling was enforced as a post-
pass that inflated the achieved bpp by up to 0.5 above target; the new
pre-aggregation path collapses each group into one multi-choice item so
the DP's solution is already sibling-consistent.
### Activation-aware passes applied during export
On every NVFP4 weight the exporter runs, in order:
1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along
the group-quant structure using the calibration Hessian. Closed-form,
not iterative.
2. **Closed-form per-group scale sweep** — for each 16-weight NVFP4
group, enumerate `grid=32` candidate scales spanning
`[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook
neighbor at every candidate scale, pick the (scale, rounding-set)
configuration minimizing activation-weighted per-group MSE. Sub-second
per Linear. Closed-form analog of Intel's AutoRound.
**Measured per-Linear output-MSE vs RTN baseline (family-level
measurement on Qwen3.6-35B-A3B; same pipeline applied here):**
| Pipeline variant | out_mse ratio vs RTN |
|---|---:|
| RTN (no passes) | 1.00 |
| GPTQ only | 0.41 |
| **GPTQ + scale_sweep (this artifact)** | **0.33** |
---
## Why 5.5 bpp
Before quantizing we ran the allocator across the full target sweep
`{4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25}` on the same Fisher-
probed + RTN-costed stats this artifact was built from. Thanks to
allocator pre-aggregation of fused siblings + convergence-based
tightening, every target lands its budget exactly — achieved = target
within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off
across the Pareto frontier, not an apples-to-oranges approximation.
| Target bpp | Achieved bpp | Predicted Δloss | NVFP4 / MXFP8 / BF16 | vs 5.5 bpp |
|---:|---:|---:|---:|---|
| 4.5 | 4.500 | 948 | 416 / 1 / 0 | +99% Δloss, −18% size |
| 4.75 | 4.750 | 704 | 373 / 12 / 32 | +48% Δloss, −14% size |
| 5.0 | 5.000 | 604 | 347 / 14 / 56 | +27% Δloss, −9% size |
| 5.25 | 5.250 | 532 | 321 / 20 / 76 | +12% Δloss, −5% size |
| **5.5** | **5.500** | **477** | **300 / 30 / 87** | **← this artifact** |
| 6.0 | 6.000 | 393 | 270 / 35 / 112 | −18% Δloss, +9% size |
| 7.0 | 7.000 | 276 | 211 / 62 / 144 | −42% Δloss, +27% size |
| 8.25 | 8.249 | 180 | 152 / 73 / 192 | −62% Δloss, +50% size |
(Layer counts are at the un-expanded allocator level — per-Linear
expansion inflates each count 1.0-1.4× after broadcasting sibling-group
formats to members.)
**Selection rationale.** The Kneedle algorithm (Satopää et al.) places
the knee at **5.5 bpp**: on the normalized Δloss-vs-bpp curve, the
farthest point below the chord from `(min_bpp, max_Δloss)` to
`(max_bpp, min_Δloss)` is target 5.5. Reading across the frontier
instead of committing to a single anchor like "4.75" or "6" makes the
trade-off explicit:
- **Below 5.5** the loss curve steepens: 4.75 bpp saves 14% disk but
pays **+48% Δloss**; 4.5 bpp saves 18% and pays **+99%**. Dense 27B
can't be aggressively NVFP4'd the way MoE-A3B can, because every
body Linear is active for every token — there are no "cheap"
low-utilization experts to compress hard.
- **Above 5.5** the loss curve flattens: jumping to 6.0 bpp costs
+9% disk for only −18% Δloss — a softer marginal gain than the
knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction).
- **At the knee**, 5.5 bpp strikes the maximum distance from the
chord — the point where further bit-budget buys less marginal
Δloss reduction than the bits already spent.
PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk
dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high-
sensitivity dense Linears the allocator won't risk at 4-bit), 87 at
BF16 (highest-sensitivity Linears preserved lossless).
---
## Which layers are quantized
### Text body (DeltaNet linear-attention + dense MLP, 64 layers)
- **Full attention** Linears (`q_proj` / `k_proj` / `v_proj` / `o_proj`):
qkv siblings share one format per layer (pre-aggregated)
- **DeltaNet linear-attention** Linears (`in_proj_qkv` / `in_proj_z` /
`in_proj_a` / `in_proj_b` / `in_proj_ba` / `out_proj`): each Linear's
format chosen independently
- **Dense MLP** (`gate_proj` / `up_proj` / `down_proj`): gate+up
siblings share one format per layer; down chosen independently
### Multi-token-prediction (MTP) head
- One full-attention + dense-MLP decoder layer at the model tail,
quantized by the same per-Linear policy — so
`--speculative-config method=mtp` drafts at the same precision
profile as the body.
### Visual encoder (27 blocks — Qwen3.6-VL vision tower)
- **Fisher-driven per-Linear allocation:** 108 of 110 visual Linears
got placed by the full DP allocator on the basis of per-Linear
activation-weighted cost (8 multimodal calibration samples).
- **Remaining 2 un-probed visual Linears** (`patch_embed.proj` edges
the probe didn't tap) stamped at NVFP4 uniformly.
- **`model.visual.pos_embed`** stays BF16 — it's a learnable Parameter,
not an `nn.Linear`, and vLLM's compressed-tensors loader cannot
consume a quantized Parameter layout.
### Passthrough (unquantized)
- `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only
accepts a single `weight` parameter. The allocator measures
lm_head's Fisher sensitivity and would pick NVFP4 for it, but the
compressed-tensors runtime rejects a compressed lm_head with
`KeyError: lm_head.input_global_scale`. This is a vLLM runtime
limitation, not a PrismaQuant design decision.
- RMSNorm weights (all layers + MTP + visual)
- All biases
- `embed_tokens`
- `model.visual.pos_embed`
---
## Serving (vLLM only)
This artifact is **only** runnable via vLLM's stock `compressed-tensors`
support — there is no transformers-native runtime path for mixed NVFP4 +
MXFP8 today. vLLM 0.11+ or equivalent is required.
```bash
vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```
- **FlashInfer** NVFP4 attention is picked up automatically; set
`VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass` to make the preference
explicit.
- **MTP speculative decoding** at `n=3` is the measured optimum for
this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4
regresses).
- **Visual inputs** work via vLLM's standard `image-text-to-text` chat
API — no special flags.
A full recipe with the flashinfer-cutlass backends, reasoning/tool
parsers and chat-template pinning is available at
[`spark-vllm-fresh/recipes/qwen3.6-27b.yaml`](https://github.com/RobTand/prismaquant).
---
## Reproducing this artifact
Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant):
1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace
(diagonal) across body + MTP + visual Linears. Shard granularity
and layer-cache budget are auto-derived from available RAM via
`prismaquant.autoscale`. Checkpoint-level reuse (per-Linear stats
are pooled across prior shard pickles) means mid-run crashes resume
cleanly regardless of `LAYERS_PER_SHARD` changes.
2. **Per-(Linear, format) cost measurement** — for each Linear and each
candidate format, the per-group RTN error weighted by cached input
activations.
3. **Multi-choice knapsack allocator** — picks one format per Linear
minimizing total predicted Δloss under the bit budget. Fused-sibling
groups pre-aggregated into DP items to avoid post-pass overshoot.
Target 5.5 bpp; achieved 5.500 bpp.
4. **Export** — streams each body / visual / MTP shard, applies GPTQ +
scale_sweep to its NVFP4 entries, writes the compressed-tensors
format. `lm_head` passthrough at BF16 enforced at this stage.
Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe +
~15 min cost + ~20 min export. Subsequent iterations at different bpp
targets reuse probe + cost artifacts and complete in minutes.
---
## Known issues / limitations
- **vLLM only at serve time.** No transformers-runtime path for this
precision mix today.
- **lm_head stays BF16** because vLLM's `ParallelLMHead` does not
register the NVFP4/MXFP8 compressed-tensors schemes. Allocator
measured it and would have picked NVFP4; the runtime limitation
forces BF16. Costs ~770 MB on the disk footprint.
- **MTP n=4 regresses on this family.** Stick to `n=3` unless you
verify against the draft-head acceptance-rate trace.
---
## Links
- **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant)
- **Base model:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
- **Sibling 35B-A3B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm)
- **Sibling 122B-A10B:** [Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm)
## Citation
```bibtex
@software{prismaquant2026,
title = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
quantization for LLMs},
author = {Tand, Rob},
year = 2026,
url = {https://github.com/RobTand/prismaquant},
}
```
|