Qwen3.6-27B โ PrismaSCOUT (Blackwell, NVFP4 + BF16)
PrismaQuant export of Qwen/Qwen3.6-27B for vLLM compressed-tensors serving on NVIDIA Blackwell. This is the 5.31 bpp PrismaSCOUT artifact โ selected by end-to-end held-out KL on the validated Pareto frontier, replacing the prior 5.5 bpp PrismaQuant artifact.
Same source weights, same export-time quantization tricks (HALO, GPTQ, block-output match, scale sweeps). The only thing that changed was the bit-allocation routine: PrismaSCOUT selects on real measured divergence from the original model, not on summed per-layer cost surrogates.
Smaller and better than the prior 5.5 bpp artifact
| Artifact | Size | bpp | Held-out KL |
|---|---|---|---|
| PrismaQuant v1 (5.5 bpp) | 22.67 GB | 5.50 | 0.0475 |
| PrismaSCOUT (this artifact) | 20.17 GB | 5.31 | 0.0151 |
| Change | โ2.5 GB (โ11%) | โ0.19 | โ0.0324 (โ68%) |
Practical impact: smaller serving footprint with substantially more VRAM headroom for KV cache, longer contexts, and concurrent request batches.
About KL
KL divergence (Kullback-Leibler divergence) measures how different two probability distributions are. In quantization, the two distributions are: what the original full-precision model would predict at a given token, and what the quantized model predicts at that same token. Average across many tokens of held-out text and you get a single number where 0 means the quantized model behaves identically to the original, and larger numbers mean it has drifted further.
KL is the natural metric for "did we preserve what the original model does?" because it captures the entire output distribution, not just the top guess. Two quantized models can both pick the same most-likely next token while producing very different distributions over the rest of the vocabulary โ and that difference matters for sampling, instruction-following, and especially tool-use, where small shifts in token probability at decision points can flip a tool call or change an argument.
A 68% reduction in KL means PrismaSCOUT preserves substantially more of the original model's distribution at every prediction step, not just at the most-confident ones.
What PrismaSCOUT does
PrismaSCOUT was always the goal โ an allocator that selects on real end-to-end behavior, not on summed per-layer surrogates. We released the first PrismaQuant artifacts (built on the standard per-layer toolkit) to get the ball rolling and to see whether mixed-format quantization really had juice in it before pouring engineering effort into the harder pipeline. The community reaction made the answer clear: tens of thousands of downloads across the family in the first few weeks. That was the signal to commit to delivering on the original promise.
A modern LLM has thousands of weight matrices, and each can be stored at one of roughly six precision formats. The previous version of PrismaQuant assigned formats per-layer using cost surrogates from the standard quantization toolkit โ Hessian sensitivity (HAWQ), HALO, AutoRound, GPTQ, block-output match. Each of those techniques is good at one job: take a single matrix and quantize it well. The pipeline composed them per-layer and trusted that "each layer is well-quantized" would imply "the whole model is well-quantized."
It usually did. But sometimes a small perturbation in one layer would compound through later layers and produce a noticeably worse model than the per-layer surrogates predicted. The reverse also happened: an allocation that looked expensive layer-by-layer turned out to be fine end-to-end, because the perturbations cancelled or got absorbed downstream. Neither effect was visible to summed per-layer cost.
PrismaSCOUT (Surrogate-Cascaded Optimization Under Tradeoff) was designed to fix this. It generates many candidate per-layer format assignments, measures each one's true end-to-end KL divergence from the original model on held-out text, filters them to the empirical Pareto frontier (achieved-bpp vs measured KL), and selects the kneedle โ the point on the frontier where you stop gaining much quality back per extra bit. A coordinate-descent polish step then perturbs the selected assignment locally and accepts only moves that strictly improve real held-out KL.
Three things make this practical at LLM scale:
- A three-stage cost cascade. L1 ranks formats per matrix in milliseconds; L2 refines via more careful per-matrix loss; L3 actually rebuilds candidate models and measures their end-to-end behavior. Each stage feeds candidates to the next, so the slow measurement only lands on points that already look promising cheaply.
- A held-out KL gate. The kneedle is selected on text the cost surrogates never saw, so the choice is an empirical sweet spot rather than an artifact of the calibration set.
- Non-regressive polish. The coordinate-descent step is provably no worse than the chosen knee โ every move that doesn't strictly improve real held-out KL is rolled back.
Under the hood: L2 builds a sparse pairwise interaction model over per-layer format choices and solves it as a Lagrangian-relaxed QUBO โ to our knowledge, the first time this game-theoretic decomposition has been applied at LLM scale for mixed-precision allocation. L3 rebuilds candidate models in a small neighborhood around L2's converged point and re-solves using actual end-to-end KL. Sitting above the cascade, a ฮป-sweep with one-pass Pareto archive DP traces the achievable size-vs-quality frontier directly from L3 measurements: we deliberately abandon the traditional "fix a target bpp, pack to fit" formulation and let the kneedle pick the best point on the full frontier instead. The L3 measurement loop went through roughly half a dozen design rounds โ fused NVFP4 Triton kernels, multi-lane CUDA graphs, a replay cache, aggressive memory management โ before we landed on something with both respectable accuracy and tractable wall-clock at 27B+ scale.
The output is a per-matrix format assignment plus a full audit trail: every candidate measured, every dominance decision, the held-out KL of the chosen point, and a leave-one-out stability check confirming the knee isn't an artifact of one or two frontier points.
PrismaSCOUT draws on the mixed-precision quantization literature โ Hessian-aware allocation (HAWQ), cross-layer error coupling (CLADO), Pareto-optimal bit budgeting (ParoQuant, IMPQ, AMQ), and the geometry-aware quantization line โ much of which had not been run end-to-end at LLM scale before. We adapted those ideas, added a few new techniques of our own to make the measurement loop tractable on real models, and built the pipeline around directly measuring how each candidate assignment behaves end-to-end. The result is that bits land where they actually preserve model behavior, including in places per-layer scoring would have missed.
Artifact details
- Source model:
Qwen/Qwen3.6-27B - Export format: vLLM
compressed-tensors, mixed precision - Main quantized format: NVFP4 (with selected layers held at BF16 per PrismaSCOUT allocation)
- Target hardware: NVIDIA Blackwell (NVFP4-native)
- MTP tensors: included
- Size on disk: 20.17 GB
- Passthrough dtype policy: source dtype preserved (no silent FP32 upcasting of normalization passthrough tensors)
- MXFP4 variant: forthcoming, targeting hardware without NVFP4 acceleration
Local validation
Correctness smoke/eval results from local vLLM serving with MTP k=3:
- GSM8K strict exact match: 96.66%
- GSM8K flexible exact match: 96.59%
- IFEval prompt strict: 85.40%
- IFEval prompt loose: 88.72%
- IFEval instruction strict: 89.93%
- IFEval instruction loose: 92.45%
- MMLU 5-shot, limit 20 per subject: 86.23% +/- 1.00%
- tool-eval-bench full sequential: 88/100 (vs 85/100 for the prior 5.5 bpp artifact)
Same-harness comparison against the prior shipped 5.5 bpp PrismaQuant artifact:
| Eval metric | Prior 5.5 bpp | PrismaSCOUT 5.31 bpp | Delta |
|---|---|---|---|
| GSM8K strict exact match | 96.74% | 96.66% | -0.08 pp |
| GSM8K flexible exact match | 96.66% | 96.59% | -0.08 pp |
| IFEval prompt strict | 84.66% | 85.40% | +0.74 pp |
| IFEval prompt loose | 87.80% | 88.72% | +0.92 pp |
| IFEval instruction strict | 89.81% | 89.93% | +0.12 pp |
| IFEval instruction loose | 92.09% | 92.45% | +0.36 pp |
| MMLU 5-shot, limit 20 per subject | 87.19% | 86.23% | -0.96 pp |
These GSM8K/IFEval numbers used the same local lm-eval setup for both artifacts: vLLM compressed-tensors serving, MTP k=3, temperature=0, enable_thinking=False, and the chat-completions API. MMLU used the same standard 5-shot loglikelihood setup with limit=20 per subject. The result is effectively tied on GSM8K, slightly ahead on IFEval, and within about one percentage point on sampled MMLU, while using about 11% less disk footprint than the prior 5.5 bpp artifact.
Serving
Example vLLM command:
vllm serve rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm \
--quantization compressed-tensors \
--trust-remote-code \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
- Downloads last month
- 8,518
Model tree for rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
Base model
Qwen/Qwen3.6-27B