Image-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen3_5
prismaquant
compressed-tensors
nvfp4
mxfp8
quantized
multimodal
vision-language
mtp
speculative-decoding
vllm
qwen3.6
conversational
8-bit precision
Instructions to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") model = AutoModelForMultimodalLM.from_pretrained("rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
- SGLang
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm with Docker Model Runner:
docker model run hf.co/rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm
| license: apache-2.0 | |
| base_model: Qwen/Qwen3.6-27B | |
| base_model_relation: quantized | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| language: | |
| - en | |
| - zh | |
| tags: | |
| - prismaquant | |
| - compressed-tensors | |
| - nvfp4 | |
| - mxfp8 | |
| - quantized | |
| - multimodal | |
| - vision-language | |
| - mtp | |
| - speculative-decoding | |
| - vllm | |
| - qwen3.6 | |
| # Qwen3.6-27B — PrismaQuant 5.5 bpp | |
| [](https://github.com/RobTand/prismaquant) | |
| [](https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE) | |
| [](https://docs.vllm.ai/en/latest/features/quantization/compressed_tensors.html) | |
| Mixed-precision quantization of `Qwen/Qwen3.6-27B` produced by | |
| [**PrismaQuant**](https://github.com/RobTand/prismaquant) — a per-Linear | |
| sensitivity-driven allocator that chooses each Linear module's format | |
| individually under a total-bit budget. Same allocator + activation-aware | |
| export stack as the 35B-A3B sibling; sibling-coupling is pre-aggregated | |
| into the DP so the achieved bpp hits the target exactly (5.500 not 5.28). | |
| This checkpoint sits at the Pareto knee of the Δloss-vs-bpp curve — | |
| see **[Why 5.5 bpp](#why-55-bpp)** below for the full sweep and | |
| selection rationale. | |
| --- | |
| ## At a glance | |
| | Metric | BF16 source | **This artifact** | Delta | | |
| |---|---:|---:|---:| | |
| | Size on disk | 54 GB | **~19 GB** | **−65 %** | | |
| | Fraction of original weights | 100 % | **35 %** | | | |
| | Average bits per param | 16 | **5.50** | | | |
| | Multimodal (vision + text) | ✓ | **✓** | | | |
| | MTP speculative decoding head | ✓ | **✓** | | | |
| | Loads in vLLM (stock `compressed-tensors`) | ✓ | **✓** | | | |
| | Runtime backend | any | **vLLM only** | | | |
| --- | |
| ## Precision mix | |
| Selected per-Linear by the allocator from measured Fisher sensitivity. | |
| On this dense 27B the allocator hit the 5.5 bpp budget exactly: | |
| | Format | W | A | Use | Count (after expansion) | | |
| |---|---|---|---|---:| | |
| | **NVFP4** | 4-bit (FP4, group_size=16 with per-group FP8 scale + per-tensor global) | 4-bit (dynamic) | Bulk dense MLPs + medium-sensitivity attention + most visual Linears | **349** | | |
| | **MXFP8** | 8-bit (E4M3, group_size=32 with per-group E8M0 scale) | 8-bit (dynamic) | High-sensitivity dense Linears the allocator won't risk at 4-bit | **35** | | |
| | **BF16** | 16-bit | 16-bit | Router-free dense top-k sensitivity + norms + biases + embed / lm_head / pos_embed | **112 (linear) + 352 (layer_passthrough)** | | |
| The allocator **pre-aggregates fused-projection siblings** — `qkv_proj` | |
| (q/k/v share one format) and `gate_up_proj` (gate+up share one format) — | |
| as single DP items. Previously sibling coupling was enforced as a post- | |
| pass that inflated the achieved bpp by up to 0.5 above target; the new | |
| pre-aggregation path collapses each group into one multi-choice item so | |
| the DP's solution is already sibling-consistent. | |
| ### Activation-aware passes applied during export | |
| On every NVFP4 weight the exporter runs, in order: | |
| 1. **GPTQ-OBS one-shot rounding** — block-wise error propagation along | |
| the group-quant structure using the calibration Hessian. Closed-form, | |
| not iterative. | |
| 2. **Closed-form per-group scale sweep** — for each 16-weight NVFP4 | |
| group, enumerate `grid=32` candidate scales spanning | |
| `[0.5·s₀, 1.5·s₀]`, round each weight to its nearest codebook | |
| neighbor at every candidate scale, pick the (scale, rounding-set) | |
| configuration minimizing activation-weighted per-group MSE. Sub-second | |
| per Linear. Closed-form analog of Intel's AutoRound. | |
| **Measured per-Linear output-MSE vs RTN baseline (family-level | |
| measurement on Qwen3.6-35B-A3B; same pipeline applied here):** | |
| | Pipeline variant | out_mse ratio vs RTN | | |
| |---|---:| | |
| | RTN (no passes) | 1.00 | | |
| | GPTQ only | 0.41 | | |
| | **GPTQ + scale_sweep (this artifact)** | **0.33** | | |
| --- | |
| ## Why 5.5 bpp | |
| Before quantizing we ran the allocator across the full target sweep | |
| `{4.5, 4.75, 5.0, 5.25, 5.5, 6.0, 7.0, 8.25}` on the same Fisher- | |
| probed + RTN-costed stats this artifact was built from. Thanks to | |
| allocator pre-aggregation of fused siblings + convergence-based | |
| tightening, every target lands its budget exactly — achieved = target | |
| within 0.001 bpp — so the curve below is a true Δloss-vs-bpp trade-off | |
| across the Pareto frontier, not an apples-to-oranges approximation. | |
| | Target bpp | Achieved bpp | Predicted Δloss | NVFP4 / MXFP8 / BF16 | vs 5.5 bpp | | |
| |---:|---:|---:|---:|---| | |
| | 4.5 | 4.500 | 948 | 416 / 1 / 0 | +99% Δloss, −18% size | | |
| | 4.75 | 4.750 | 704 | 373 / 12 / 32 | +48% Δloss, −14% size | | |
| | 5.0 | 5.000 | 604 | 347 / 14 / 56 | +27% Δloss, −9% size | | |
| | 5.25 | 5.250 | 532 | 321 / 20 / 76 | +12% Δloss, −5% size | | |
| | **5.5** | **5.500** | **477** | **300 / 30 / 87** | **← this artifact** | | |
| | 6.0 | 6.000 | 393 | 270 / 35 / 112 | −18% Δloss, +9% size | | |
| | 7.0 | 7.000 | 276 | 211 / 62 / 144 | −42% Δloss, +27% size | | |
| | 8.25 | 8.249 | 180 | 152 / 73 / 192 | −62% Δloss, +50% size | | |
| (Layer counts are at the un-expanded allocator level — per-Linear | |
| expansion inflates each count 1.0-1.4× after broadcasting sibling-group | |
| formats to members.) | |
| **Selection rationale.** The Kneedle algorithm (Satopää et al.) places | |
| the knee at **5.5 bpp**: on the normalized Δloss-vs-bpp curve, the | |
| farthest point below the chord from `(min_bpp, max_Δloss)` to | |
| `(max_bpp, min_Δloss)` is target 5.5. Reading across the frontier | |
| instead of committing to a single anchor like "4.75" or "6" makes the | |
| trade-off explicit: | |
| - **Below 5.5** the loss curve steepens: 4.75 bpp saves 14% disk but | |
| pays **+48% Δloss**; 4.5 bpp saves 18% and pays **+99%**. Dense 27B | |
| can't be aggressively NVFP4'd the way MoE-A3B can, because every | |
| body Linear is active for every token — there are no "cheap" | |
| low-utilization experts to compress hard. | |
| - **Above 5.5** the loss curve flattens: jumping to 6.0 bpp costs | |
| +9% disk for only −18% Δloss — a softer marginal gain than the | |
| knee's 5.25→5.5 step (−5% size, −12% Δloss in the right direction). | |
| - **At the knee**, 5.5 bpp strikes the maximum distance from the | |
| chord — the point where further bit-budget buys less marginal | |
| Δloss reduction than the bits already spent. | |
| PrismaQuant's precision mix at this knee: 300 Linears at NVFP4 (bulk | |
| dense MLP + medium-sensitivity attention + visual), 30 at MXFP8 (high- | |
| sensitivity dense Linears the allocator won't risk at 4-bit), 87 at | |
| BF16 (highest-sensitivity Linears preserved lossless). | |
| --- | |
| ## Which layers are quantized | |
| ### Text body (DeltaNet linear-attention + dense MLP, 64 layers) | |
| - **Full attention** Linears (`q_proj` / `k_proj` / `v_proj` / `o_proj`): | |
| qkv siblings share one format per layer (pre-aggregated) | |
| - **DeltaNet linear-attention** Linears (`in_proj_qkv` / `in_proj_z` / | |
| `in_proj_a` / `in_proj_b` / `in_proj_ba` / `out_proj`): each Linear's | |
| format chosen independently | |
| - **Dense MLP** (`gate_proj` / `up_proj` / `down_proj`): gate+up | |
| siblings share one format per layer; down chosen independently | |
| ### Multi-token-prediction (MTP) head | |
| - One full-attention + dense-MLP decoder layer at the model tail, | |
| quantized by the same per-Linear policy — so | |
| `--speculative-config method=mtp` drafts at the same precision | |
| profile as the body. | |
| ### Visual encoder (27 blocks — Qwen3.6-VL vision tower) | |
| - **Fisher-driven per-Linear allocation:** 108 of 110 visual Linears | |
| got placed by the full DP allocator on the basis of per-Linear | |
| activation-weighted cost (8 multimodal calibration samples). | |
| - **Remaining 2 un-probed visual Linears** (`patch_embed.proj` edges | |
| the probe didn't tap) stamped at NVFP4 uniformly. | |
| - **`model.visual.pos_embed`** stays BF16 — it's a learnable Parameter, | |
| not an `nn.Linear`, and vLLM's compressed-tensors loader cannot | |
| consume a quantized Parameter layout. | |
| ### Passthrough (unquantized) | |
| - `lm_head` — kept at BF16 because vLLM's `ParallelLMHead` module only | |
| accepts a single `weight` parameter. The allocator measures | |
| lm_head's Fisher sensitivity and would pick NVFP4 for it, but the | |
| compressed-tensors runtime rejects a compressed lm_head with | |
| `KeyError: lm_head.input_global_scale`. This is a vLLM runtime | |
| limitation, not a PrismaQuant design decision. | |
| - RMSNorm weights (all layers + MTP + visual) | |
| - All biases | |
| - `embed_tokens` | |
| - `model.visual.pos_embed` | |
| --- | |
| ## Serving (vLLM only) | |
| This artifact is **only** runnable via vLLM's stock `compressed-tensors` | |
| support — there is no transformers-native runtime path for mixed NVFP4 + | |
| MXFP8 today. vLLM 0.11+ or equivalent is required. | |
| ```bash | |
| vllm serve rdtand/Qwen3.6-27B-PrismaQuant-5.5bit-vllm \ | |
| --trust-remote-code \ | |
| --max-model-len 32768 \ | |
| --gpu-memory-utilization 0.90 \ | |
| --speculative-config '{"method":"mtp","num_speculative_tokens":3}' | |
| ``` | |
| - **FlashInfer** NVFP4 attention is picked up automatically; set | |
| `VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass` to make the preference | |
| explicit. | |
| - **MTP speculative decoding** at `n=3` is the measured optimum for | |
| this family on DGX Spark (n=2 leaves ~10% tok/s on the table, n=4 | |
| regresses). | |
| - **Visual inputs** work via vLLM's standard `image-text-to-text` chat | |
| API — no special flags. | |
| A full recipe with the flashinfer-cutlass backends, reasoning/tool | |
| parsers and chat-template pinning is available at | |
| [`spark-vllm-fresh/recipes/qwen3.6-27b.yaml`](https://github.com/RobTand/prismaquant). | |
| --- | |
| ## Reproducing this artifact | |
| Full pipeline is in the [PrismaQuant repo](https://github.com/RobTand/prismaquant): | |
| 1. **Sensitivity probe** — streaming per-shard empirical-Fisher trace | |
| (diagonal) across body + MTP + visual Linears. Shard granularity | |
| and layer-cache budget are auto-derived from available RAM via | |
| `prismaquant.autoscale`. Checkpoint-level reuse (per-Linear stats | |
| are pooled across prior shard pickles) means mid-run crashes resume | |
| cleanly regardless of `LAYERS_PER_SHARD` changes. | |
| 2. **Per-(Linear, format) cost measurement** — for each Linear and each | |
| candidate format, the per-group RTN error weighted by cached input | |
| activations. | |
| 3. **Multi-choice knapsack allocator** — picks one format per Linear | |
| minimizing total predicted Δloss under the bit budget. Fused-sibling | |
| groups pre-aggregated into DP items to avoid post-pass overshoot. | |
| Target 5.5 bpp; achieved 5.500 bpp. | |
| 4. **Export** — streams each body / visual / MTP shard, applies GPTQ + | |
| scale_sweep to its NVFP4 entries, writes the compressed-tensors | |
| format. `lm_head` passthrough at BF16 enforced at this stage. | |
| Wall-clock on a DGX Spark (128 GB unified memory): ~2 h cold probe + | |
| ~15 min cost + ~20 min export. Subsequent iterations at different bpp | |
| targets reuse probe + cost artifacts and complete in minutes. | |
| --- | |
| ## Known issues / limitations | |
| - **vLLM only at serve time.** No transformers-runtime path for this | |
| precision mix today. | |
| - **lm_head stays BF16** because vLLM's `ParallelLMHead` does not | |
| register the NVFP4/MXFP8 compressed-tensors schemes. Allocator | |
| measured it and would have picked NVFP4; the runtime limitation | |
| forces BF16. Costs ~770 MB on the disk footprint. | |
| - **MTP n=4 regresses on this family.** Stick to `n=3` unless you | |
| verify against the draft-head acceptance-rate trace. | |
| --- | |
| ## Links | |
| - **Source:** [github.com/RobTand/prismaquant](https://github.com/RobTand/prismaquant) | |
| - **Base model:** [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) | |
| - **Sibling 35B-A3B:** [Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm) | |
| - **Sibling 122B-A10B:** [Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm](https://huggingface.co/rdtand/Qwen3.5-122B-A10B-PrismaQuant-4.75bit-vllm) | |
| ## Citation | |
| ```bibtex | |
| @software{prismaquant2026, | |
| title = {PrismaQuant: per-Linear sensitivity-driven mixed-precision | |
| quantization for LLMs}, | |
| author = {Tand, Rob}, | |
| year = 2026, | |
| url = {https://github.com/RobTand/prismaquant}, | |
| } | |
| ``` | |