Instructions to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm")
model = AutoModelForMultimodalLM.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm

SGLang

How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Docker Model Runner:
```
docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
```

Qwen3.6-27B-PrismaQuant-5.5bit-vllm / README.md

rdtand

Add YAML frontmatter (license/base_model/tags/pipeline_tag)

e7c8b12 verified about 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3.6-27B
	base_model_relation: quantized
	library_name: transformers
	pipeline_tag: image-text-to-text
	language:
	- en
	- zh
	tags:
	- prismaquant
	- compressed-tensors
	- nvfp4
	- mxfp8
	- quantized
	- multimodal
	- vision-language
	- mtp
	- speculative-decoding
	- vllm
	- qwen3.6
	---

	# Qwen3.6-27B — PrismaQuant 5.5 bpp

	[![PrismaQuant source](https://img.shields.io/badge/PrismaQuant-GitHub-blue?logo=github)](https://github.com/RobTand/prismaquant)
	[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE)
	[![vLLM native](https://img.shields.io/badge/vLLM-compressed--tensors-orange)](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html)

	Mixed-precision quantization of `Qwen/Qwen3.6-27B` produced by
	[PrismaQuant](https://github.com/RobTand/prismaquant) — a per-Linear
	sensitivity-driven allocator that chooses each Linear module's format
	individually under a total-bit budget. Same allocator + activation-aware
	export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated
	into the DP so the achieved bpp hits the target exactly (5.500 not 5.28).

	This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve —
	see [Why 5.5 bpp](#why-55-bpp) below for the full sweep and
	selection rationale.

	---

	## At a glance

	\| Metric \| BF16 source \| This artifact \| Delta \|
	\|---\|---:\|---:\|---:\|
	\| Size on disk \| 54 GB \| ~19 GB \| −65 % \|
	\| Fraction of original weights \| 100 % \| 35 % \| \|
	\| Average bits per param \| 16 \| 5.50 \| \|
	\| Multimodal (vision + text) \| ✓ \| ✓ \| \|
	\| MTP speculative decoding head \| ✓ \| ✓ \| \|
	\| Loads in vLLM (stock `compressed-tensors`) \| ✓ \| ✓ \| \|
	\| Runtime backend \| any \| vLLM only \| \|

	---

	## Precision mix

	Selected per-Linear by the allocator from measured Fisher sensitivity.
	On this dense 27B the allocator hit the 5.5 bpp budget exactly:

	\| Format \| W \| A \| Use \| Count (after expansion) \|
	\|---\|---\|---\|---\|---:\|
	\| NVFP4 \| 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) \| 4-bit (dynamic) \| Bulk dense MLPs + medium-sensitivity attention + most visual Linears \| 349 \|
	\| MXFP8 \| 8-bit (E4M3, group_size=32 with per-group E8M0 scale) \| 8-bit (dynamic) \| High-sensitivity dense Linears the allocator won't risk at 4-bit \| 35 \|
	\| BF16 \| 16-bit \| 16-bit \| Router-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed \| 112 (linear) + 352 (layer_passthrough) \|

	The allocator pre-aggregates fused-projection siblings — `qkv_proj`
	(q/k/v share one format) and `gate_up_proj` (gate+up share one format) —
	as single DP items. Previously sibling coupling was enforced as a post-
	pass that inflated the achieved bpp by up to 0.5 above target; the new
	pre-aggregation path collapses each group into one multi-choice item so
	the DP's solution is already sibling-consistent.

	### Activation-aware passes applied during export

	On every NVFP4 weight the exporter runs, in order:

	1. GPTQ-OBS one-shot rounding — block-wise error propagation along
	the group-quant structure using the calibration Hessian. Closed-form,
	not iterative.
	2. Closed-form per-group scale sweep — for each 16-weight NVFP4
	group, enumerate `grid=32` candidate scales spanning
	`[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook
	neighbor at every candidate scale, pick the (scale, rounding-set)
	configuration minimizing activation-weighted per-group MSE. Sub-second
	per Linear. Closed-form analog of Intel's AutoRound.

	**Measured per-Linear output-MSE vs RTN baseline (family-level
	measurement on Qwen3.6-35B-A3B; same pipeline applied here):**

	\| Pipeline variant \| out_mse ratio vs RTN \|
	\|---\|---:\|
	\| RTN (no passes) \| 1.00 \|
	\| GPTQ only \| 0.41 \|
	\| GPTQ + scale_sweep (this artifact) \| 0.33 \|

	---

	## Why 5.5 bpp

	Before quantizing we ran the allocator across the full target sweep
	`{4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25}` on the same Fisher-
	probed + RTN-costed stats this artifact was built from. Thanks to
	allocator pre-aggregation of fused siblings + convergence-based
	tightening, every target lands its budget exactly — achieved = target
	within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off
	across the Pareto frontier, not an apples-to-oranges approximation.

	\| Target bpp \| Achieved bpp \| Predicted Δloss \| NVFP4 / MXFP8 / BF16 \| vs 5.5 bpp \|
	\|---:\|---:\|---:\|---:\|---\|
	\| 4.5 \| 4.500 \| 948 \| 416 / 1 / 0 \| +99% Δloss, −18% size \|
	\| 4.75 \| 4.750 \| 704 \| 373 / 12 / 32 \| +48% Δloss, −14% size \|
	\| 5.0 \| 5.000 \| 604 \| 347 / 14 / 56 \| +27% Δloss, −9% size \|
	\| 5.25 \| 5.250 \| 532 \| 321 / 20 / 76 \| +12% Δloss, −5% size \|
	\| 5.5 \| 5.500 \| 477 \| 300 / 30 / 87 \| ← this artifact \|
	\| 6.0 \| 6.000 \| 393 \| 270 / 35 / 112 \| −18% Δloss, +9% size \|
	\| 7.0 \| 7.000 \| 276 \| 211 / 62 / 144 \| −42% Δloss, +27% size \|
	\| 8.25 \| 8.249 \| 180 \| 152 / 73 / 192 \| −62% Δloss, +50% size \|

	(Layer counts are at the un-expanded allocator level — per-Linear
	expansion inflates each count 1.0-1.4× after broadcasting sibling-group
	formats to members.)

	Selection rationale. The Kneedle algorithm (Satopää et al.) places
	the knee at 5.5 bpp: on the normalized Δloss-vs-bpp curve, the
	farthest point below the chord from `(min_bpp, max_Δloss)` to
	`(max_bpp, min_Δloss)` is target 5.5. Reading across the frontier
	instead of committing to a single anchor like "4.75" or "6" makes the
	trade-off explicit:

	- Below 5.5 the loss curve steepens: 4.75 bpp saves 14% disk but
	pays +48% Δloss; 4.5 bpp saves 18% and pays +99%. Dense 27B
	can't be aggressively NVFP4'd the way MoE-A3B can, because every
	body Linear is active for every token — there are no "cheap"
	low-utilization experts to compress hard.
	- Above 5.5 the loss curve flattens: jumping to 6.0 bpp costs
	+9% disk for only −18% Δloss — a softer marginal gain than the
	knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction).
	- At the knee, 5.5 bpp strikes the maximum distance from the
	chord — the point where further bit-budget buys less marginal
	Δloss reduction than the bits already spent.

	PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk
	dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high-
	sensitivity dense Linears the allocator won't risk at 4-bit), 87 at
	BF16 (highest-sensitivity Linears preserved lossless).

	---

	## Which layers are quantized

	### Text body (DeltaNet linear-attention + dense MLP, 64 layers)

	- Full attention Linears (`q_proj` / `k_proj` / `v_proj` / `o_proj`):
	qkv siblings share one format per layer (pre-aggregated)
	- DeltaNet linear-attention Linears (`in_proj_qkv` / `in_proj_z` /
	`in_proj_a` / `in_proj_b` / `in_proj_ba` / `out_proj`): each Linear's
	format chosen independently
	- Dense MLP (`gate_proj` / `up_proj` / `down_proj`): gate+up
	siblings share one format per layer; down chosen independently

	### Multi-token-prediction (MTP) head

	- One full-attention + dense-MLP decoder layer at the model tail,
	quantized by the same per-Linear policy — so
	`--speculative-config method=mtp` drafts at the same precision
	profile as the body.

	### Visual encoder (27 blocks — Qwen3.6-VL vision tower)

	- Fisher-driven per-Linear allocation: 108 of 110 visual Linears
	got placed by the full DP allocator on the basis of per-Linear
	activation-weighted cost (8 multimodal calibration samples).
	- Remaining 2 un-probed visual Linears (`patch_embed.proj` edges
	the probe didn't tap) stamped at NVFP4 uniformly.
	- `model.visual.pos_embed` stays BF16 — it's a learnable Parameter,
	not an `nn.Linear`, and vLLM's compressed-tensors loader cannot
	consume a quantized Parameter layout.

	### Passthrough (unquantized)

	- `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only
	accepts a single `weight` parameter. The allocator measures
	lm_head's Fisher sensitivity and would pick NVFP4 for it, but the
	compressed-tensors runtime rejects a compressed lm_head with
	`KeyError: lm_head.input_global_scale`. This is a vLLM runtime
	limitation, not a PrismaQuant design decision.
	- RMSNorm weights (all layers + MTP + visual)
	- All biases
	- `embed_tokens`
	- `model.visual.pos_embed`

	---

	## Serving (vLLM only)

	This artifact is only runnable via vLLM's stock `compressed-tensors`
	support — there is no transformers-native runtime path for mixed NVFP4 +
	MXFP8 today. vLLM 0.11+ or equivalent is required.

	```bash
	vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \
	--trust-remote-code \
	--max-model-len 32768 \
	--gpu-memory-utilization 0.90 \
	--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
	```

	- FlashInfer NVFP4 attention is picked up automatically; set
	`VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass` to make the preference
	explicit.
	- MTP speculative decoding at `n=3` is the measured optimum for
	this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4
	regresses).
	- Visual inputs work via vLLM's standard `image-text-to-text` chat
	API — no special flags.

	A full recipe with the flashinfer-cutlass backends, reasoning/tool
	parsers and chat-template pinning is available at
	[`spark-vllm-fresh/recipes/qwen3.6-27b.yaml`](https://github.com/RobTand/prismaquant).

	---

	## Reproducing this artifact

	Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant):

	1. Sensitivity probe — streaming per-shard empirical-Fisher trace
	(diagonal) across body + MTP + visual Linears. Shard granularity
	and layer-cache budget are auto-derived from available RAM via
	`prismaquant.autoscale`. Checkpoint-level reuse (per-Linear stats
	are pooled across prior shard pickles) means mid-run crashes resume
	cleanly regardless of `LAYERS_PER_SHARD` changes.
	2. Per-(Linear, format) cost measurement — for each Linear and each
	candidate format, the per-group RTN error weighted by cached input
	activations.
	3. Multi-choice knapsack allocator — picks one format per Linear
	minimizing total predicted Δloss under the bit budget. Fused-sibling
	groups pre-aggregated into DP items to avoid post-pass overshoot.
	Target 5.5 bpp; achieved 5.500 bpp.
	4. Export — streams each body / visual / MTP shard, applies GPTQ +
	scale_sweep to its NVFP4 entries, writes the compressed-tensors
	format. `lm_head` passthrough at BF16 enforced at this stage.

	Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe +
	~15 min cost + ~20 min export. Subsequent iterations at different bpp
	targets reuse probe + cost artifacts and complete in minutes.

	---

	## Known issues / limitations

	- vLLM only at serve time. No transformers-runtime path for this
	precision mix today.
	- lm_head stays BF16 because vLLM's `ParallelLMHead` does not
	register the NVFP4/MXFP8 compressed-tensors schemes. Allocator
	measured it and would have picked NVFP4; the runtime limitation
	forces BF16. Costs ~770 MB on the disk footprint.
	- MTP n=4 regresses on this family. Stick to `n=3` unless you
	verify against the draft-head acceptance-rate trace.

	---

	## Links

	- Source: [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant)
	- Base model: [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B)
	- Sibling 35B-A3B: [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm)
	- Sibling 122B-A10B: [Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm)

	## Citation

	```bibtex
	@software{prismaquant2026,
	title = {PrismaQuant: per-Linear sensitivity-driven mixed-precision
	quantization for LLMs},
	author = {Tand, Rob},
	year = 2026,
	url = {https://github.com/RobTand/prismaquant},
	}
	```