Gemma 4 31B IT PrismaQuant 5.5bit vLLM
This is a PrismaQuant 5.5-bit mixed-native quantization of
google/gemma-4-31b-it, exported for vLLM using the
compressed-tensors format. Text-only: the multimodal vision and
audio towers from the source checkpoint are not included in this
artifact (this is a language-model-only export). The artifact is
intended for vLLM/FlashInfer serving and is not a vanilla
Transformers checkpoint.
The 5.5 bpp recipe was selected at the Pareto knee of the allocator's predicted-Δloss vs bits-per-weight curve — the smallest size at which the predicted quality penalty has not yet bent upward. See the Pareto curve below.
What is PrismaQuant?
PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Rather
than forcing the whole model into one dtype, it predicts the loss penalty of
quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W,
where H_trace is the Linear's Fisher diagonal trace from a calibration probe
and MSE_W is the format-specific weight reconstruction error. A small ILP
then picks per-Linear formats from a menu of {NVFP4, MXFP8_E4M3, FP8_SOURCE, BF16} to minimize total predicted Δloss under a target bits-per-weight budget.
Because the cost model is principled and additive across Linears, it lets the toolkit stack several layers of optimization without each one stepping on the others. The shipped checkpoint was produced with the full stack below.
1. Sensitivity-driven format allocation
- Fisher-weighted ILP picks one of
NVFP4 / MXFP8 / FP8_SOURCE / BF16per Linear under the global bpp budget. - Pareto sweep across bpp targets surfaces the knee — the smallest size at which predicted Δloss has not yet bent upward.
- Fused-sibling joint scales: q/k/v projections (and gate/up) share one
NVFP4
weight_global_scale, matching vLLM's per-tensor expectation. - Per-Linear input_global_scale calibration from cached activations
(
max_abs / 6.0), so vLLM's runtime activation quantization uses a correctly-sized dynamic range.
2. Per-Linear act-aware optimization passes
For each NVFP4-assigned Linear, in order, each gated to "improve or keep":
- GPTQ one-shot OBS rounding with block-wise error propagation.
- Activation clipping at the 99.9th-percentile per token before computing the Hessian, to bound its condition number. Validated as the largest single win on a Qwen3-0.6B audit (≈ −0.91 PPL on the validator suite).
- GPTQ damping sweep — try 5 candidate Hessian regularizers, keep the one with smallest activation-weighted reconstruction error.
- Closed-form scale sweep — joint
(per-group scale, rounding)search on the NVFP4 codebook (a closed-form analog of AutoRound's SGD-based search). - Block-output match — greedy per-Linear scale refinement against a surrounding-block FP16 reference forward, capturing inter-Linear composition error that per-Linear MSE can't see.
- Do-no-harm gate — for every Linear, compare the post-pass weight to a pure RTN baseline on the cached activation distribution. If RTN is better, revert. Guarantees no Linear is worse than RTN at the cost of those passes.
3. Numerical hygiene
- Norm parameters (RMSNorm γ, LayerNorm γ/β) are kept at FP32 — they multiply every token's hidden state at every block, so BF16 rounding compounds aggressively. The size cost is a few MB total.
- Activation cache stored at FP32 so downstream Hessian computations don't inherit BF16 rounding.
4. Format-level
- NVFP4 — 4-bit FP4 + per-16-group FP8 scale (~4.5 bpp effective).
- MXFP8_E4M3 — 8-bit FP8 + per-32-group E8M0 scale (~8.25 bpp).
- FP8_SOURCE — for natively-FP8 source checkpoints, the source FP8
weights and
weight_scale_invare copied verbatim; the BF16 view of these tensors is a lossless dequant, so the allocator treats this format asΔloss = 0for any Linear that ships in the source at FP8. - BF16 — passthrough for tensors the allocator marks ineligible.
5. Pre-ship validation
- Generation-sanity check (filters NaN, repetition loops, nonsense).
- Chat-deployment perplexity gate — measures NLL on the model's assistant-response tokens given chat-formatted user turns, which is the deployment-relevant signal for instruction-tuned models. This artifact passed at PPL = 1.14, mean NLL = 0.13 nats/token, p99 NLL = 0.20 across factual continuation tasks.
- Cache-fingerprint manifest on the per-layer export cache: every quality-affecting flag is hashed in, and a mismatched flag set on resume invalidates the cache rather than silently emitting a layer that was quantized under a different recipe.
6. vLLM-native serving
The output is compressed-tensors format, runnable in stock vLLM. No custom
kernels needed. Mixed-precision per Linear is preserved at load — NVFP4
Linears use FP4 kernels, MXFP8 Linears use FP8 kernels, FP8_SOURCE Linears
use the source's native FP8 weights with their original block scales.
Project repository: https://github.com/RobTand/prismaquant
Pareto curve
The allocator builds a Pareto curve over feasible target_bits and reports
the knee as the recommended ship target — the smallest size at which
predicted Δloss has not yet bent upward. For Gemma 4 31B IT the knee
landed cleanly at 5.5 bpp; this artifact ships that point.
| target_bits | achieved | predicted Δloss | NVFP4 | MXFP8 | BF16 |
|---|---|---|---|---|---|
| 4.50 | 4.500 | 26 203 | 100% (29.3G) | 0% (0.0G) | 0% (0.0G) |
| 4.60 | 4.597 | 21 708 | 98% (28.7G) | 2% (0.5G) | 0% (0.1G) |
| 4.70 | 4.700 | 19 759 | 96% (28.1G) | 3% (1.0G) | 1% (0.2G) |
| 4.75 | 4.750 | 18 983 | 96% (28.2G) | 2% (0.7G) | 1% (0.4G) |
| 4.85 | 4.849 | 17 631 | 95% (27.8G) | 3% (0.9G) | 2% (0.6G) |
| 5.00 | 4.999 | 15 893 | 93% (27.4G) | 3% (1.0G) | 3% (1.0G) |
| 5.25 | 5.249 | 13 499 | 90% (26.3G) | 6% (1.7G) | 5% (1.4G) |
| 5.50 | 5.498 | 11 815 | 86% (25.2G) | 8% (2.3G) | 6% (1.8G) |
| 6.00 | 5.999 | 9 133 | 79% (23.1G) | 12% (3.5G) | 9% (2.7G) |
| 7.00 | 6.998 | 5 142 | 69% (20.2G) | 14% (4.1G) | 17% (5.0G) |
| 8.25 | 8.249 | 1 896 | 61% (18.0G) | 9% (2.6G) | 30% (8.7G) |
predicted Δloss is the allocator's ILP objective in arbitrary units — useful
for comparing points on the same curve, not for absolute interpretation.
The full curve is also included in this repository as pareto.csv.
The knee is at the largest predicted-Δloss reduction per bit-per-weight added: moving from 4.50 → 5.50 bpp halves the predicted Δloss (26 203 → 11 815); moving from 5.50 → 7.00 bpp halves it again (11 815 → 5 142) but costs an additional 1.5 bpp. 5.5 sits at the diminishing-returns inflection.
Quantization
- Source model:
google/gemma-4-31b-it(Apache 2.0, BF16 dense, 60 layers, hidden 5376, 32 heads × 256 kv-head-dim, multimodal — vision + audio towers present in source but not quantized here) - Export target: 5.5 bits/weight budget (Pareto knee)
- Achieved: 5.498 bpp
- Export format: mixed native
compressed-tensors - Shards: 6 safetensors shards (~23 GB total)
- Per-Linear assignment summary across 410 quantizable Linears:
linear/NVFP4: 321 moduleslinear/MXFP8_E4M3: 43 moduleslinear/BF16passthrough: 46 modules
- Calibration: 16 samples × 1536 tokens from a heterogeneous English text mix, FP32 activation cache, per-token 99.9-percentile activation clipping
- Tied embeddings preserved (lm_head shares weight tensor with embed_tokens)
The model mixes NVFP4 (most attention/MLP Linears), MXFP8 (sensitivity-elevated Linears the allocator pushed up the ladder), and BF16 (a small set of norm/passthrough tensors plus the highest-Fisher-trace Linears). It should not be treated as a uniform FP4 checkpoint.
Validation Status
Smoke tested with:
- vLLM
0.19.2rc1.dev86+g9a6a66f3b.d20260421 - FlashInfer kernels for FP8 and NVFP4
--quantization compressed-tensors- text-only serving via
--language-model-only --kv-cache-dtype fp8
Smoke checks passed:
/v1/modelsregistration- Multi-domain
/v1/chat/completions(math, code, reasoning, factual recall, long-form continuation): all coherent, no NaN, no repetition loops, refusal boundaries respected on harmful prompts. - Thinking-mode regression:
chat_template_kwargs={"enable_thinking": true}correctly emits the<|think|>channel; system-prompt path correctly triggers structured reasoning per the Gemma 4 chat template. - Tool-calling sanity: model emits well-formed
tool_callswhen given a function definition in the request. - Chat-deployment perplexity: PPL = 1.14, mean NLL = 0.13 nats/token, p99 NLL = 0.20 (measured on assistant response tokens given chat-formatted factual-continuation prompts).
No formal benchmark or downstream evaluation is included with this upload. Treat this as an experimental serving artifact.
Important: chat-template required
Gemma 4 31B IT is heavily chat-template-saturated. The model expects every
prompt to be wrapped in chat-template structure (<bos><start_of_turn>user ...<end_of_turn><start_of_turn>model) and reliably falls into degenerate
patterns (onononon..., repetitive tautologies) when given raw text via
plain /v1/completions without <bos> and turn markers.
Use /v1/chat/completions for all generation. This is also the API path
the model card examples below assume. If you need per-token logprobs for
analysis, send the prompt through /v1/chat/completions with logprobs: true — that path applies the chat template and returns logprobs on the
assistant's generated tokens.
vLLM Startup
This is the tested launch shape. Replace paths as needed for your environment. The image must contain vLLM support for compressed-tensors mixed FP8/NVFP4 serving.
docker run -d \
--name vllm-gemma4-31b-prismaquant \
--gpus all \
--ipc=host \
--shm-size=16g \
-p 8000:8000 \
-e HF_HOME=/hfcache \
-e HF_HUB_CACHE=/hfcache/hub \
-v /home/rob/.cache/huggingface:/hfcache \
vllm-fresh-b12x:latest \
vllm serve rdtand/Gemma4-31B-IT-PrismaQuant-5.5bit-vllm \
--host 0.0.0.0 \
--port 8000 \
--served-model-name gemma4-31b-prismaquant-5p5bit \
--quantization compressed-tensors \
--trust-remote-code \
--max-model-len 8192 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--language-model-only
Thinking mode
Gemma 4's chat template supports a <|think|> channel that activates when:
- The request passes
chat_template_kwargs={"enable_thinking": true}, OR - The request includes a
toolsfield, OR - The first message is a
systemordeveloperturn.
Without any of those triggers, the model produces direct answers. With thinking enabled, expect a structured-reasoning preamble before the final answer.
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gemma4-31b-prismaquant-5p5bit",
"messages": [{"role":"user","content":"What is 17 * 23 * 11? Show your reasoning."}],
"max_tokens": 400,
"temperature": 0,
"chat_template_kwargs": {"enable_thinking": true}
}'
Tool Calling
Gemma 4's chat template includes tool-calling support. Send the function
definitions in the standard OpenAI tools field and use tool_choice: "auto" (or a specific function name) to surface tool calls in the
assistant message's tool_calls field:
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gemma4-31b-prismaquant-5p5bit",
"messages": [{"role":"user","content":"What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 200
}'
Reproducibility notes
The export pipeline used PrismaQuant on the feat/quality-wins-batch1 branch.
Two artifact-side amendments were applied after the export to make the
checkpoint loadable in stock vLLM; both have since been turned into tracked
fixes in the PrismaQuant codebase:
config.jsonqkv_proj target additions — the export'sbuild_quantization_configoriginally skipped emitting fusedqkv_projtarget patterns when any expected sibling was absent from the assignment. Gemma 4's full_attention layers (every 6th layer, withattention_k_eq_v=True) have nov_projweights on disk because k doubles as v at runtime, so vLLM never got a matching scheme for the fused QKVParallelLinear it constructs at load time. Fixed in code: the emitter now requires ≥2 present siblings agreeing on format and emits the fused target from those.- Aliased
v_projweight tensors for k_eq_v layers — this artifact includes a small additional safetensors shard (model-00006-of-00006.safetensors, ~35 MB) containing 24v_proj.*tensors (4 per layer × 6 full_attention layers) that mirror the layer'sk_proj.*weights. This makes vLLM's QKVParallelLinear concatenation (q + k + v) behave identically to the source model's runtime k_eq_v alias (V = K). Tracked for automation in a future PrismaQuant export step so subsequent Gemma 4 exports include these tensors without manual intervention.
Both amendments are baked into the uploaded checkpoint — no user action is required to load it. The notes are included for transparency about how the artifact was produced.
Limitations
- This is an experimental quantized artifact.
- Text-only. The multimodal vision and audio towers from the source
Gemma 4 checkpoint are not included. Image and audio input will not work.
Use the
--language-model-onlyvLLM flag. - Use
/v1/chat/completionsrather than/v1/completions. The model is heavily chat-template-saturated and produces degenerate output on raw text continuation. - Startup can be slow on first launch because vLLM compiles graphs, autotunes FlashInfer kernels, and captures CUDA graphs. Expect ~3 minutes on a fresh image, faster on subsequent runs (compile cache).
- The original Gemma 4 license applies to this derivative.
License
This checkpoint is a derivative of google/gemma-4-31b-it and is provided
under the same Apache 2.0 license. See the original model repository and
Gemma 4 license for details.
- Downloads last month
- 1,096