Gemma 4 31B IT PrismaQuant 5.5bit vLLM

This is a PrismaQuant 5.5-bit mixed-native quantization of google/gemma-4-31b-it, exported for vLLM using the compressed-tensors format. Text-only: the multimodal vision and audio towers from the source checkpoint are not included in this artifact (this is a language-model-only export). The artifact is intended for vLLM/FlashInfer serving and is not a vanilla Transformers checkpoint.

The 5.5 bpp recipe was selected at the Pareto knee of the allocator's predicted-Δloss vs bits-per-weight curve — the smallest size at which the predicted quality penalty has not yet bent upward. See the Pareto curve below.

What is PrismaQuant?

PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Rather than forcing the whole model into one dtype, it predicts the loss penalty of quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W, where H_trace is the Linear's Fisher diagonal trace from a calibration probe and MSE_W is the format-specific weight reconstruction error. A small ILP then picks per-Linear formats from a menu of {NVFP4, MXFP8_E4M3, FP8_SOURCE, BF16} to minimize total predicted Δloss under a target bits-per-weight budget.

Because the cost model is principled and additive across Linears, it lets the toolkit stack several layers of optimization without each one stepping on the others. The shipped checkpoint was produced with the full stack below.

1. Sensitivity-driven format allocation

Fisher-weighted ILP picks one of NVFP4 / MXFP8 / FP8_SOURCE / BF16 per Linear under the global bpp budget.
Pareto sweep across bpp targets surfaces the knee — the smallest size at which predicted Δloss has not yet bent upward.
Fused-sibling joint scales: q/k/v projections (and gate/up) share one NVFP4 weight_global_scale, matching vLLM's per-tensor expectation.
Per-Linear input_global_scale calibration from cached activations (max_abs / 6.0), so vLLM's runtime activation quantization uses a correctly-sized dynamic range.

2. Per-Linear act-aware optimization passes

For each NVFP4-assigned Linear, in order, each gated to "improve or keep":

GPTQ one-shot OBS rounding with block-wise error propagation.
Activation clipping at the 99.9th-percentile per token before computing the Hessian, to bound its condition number. Validated as the largest single win on a Qwen3-0.6B audit (≈ −0.91 PPL on the validator suite).
GPTQ damping sweep — try 5 candidate Hessian regularizers, keep the one with smallest activation-weighted reconstruction error.
Closed-form scale sweep — joint (per-group scale, rounding) search on the NVFP4 codebook (a closed-form analog of AutoRound's SGD-based search).
Block-output match — greedy per-Linear scale refinement against a surrounding-block FP16 reference forward, capturing inter-Linear composition error that per-Linear MSE can't see.
Do-no-harm gate — for every Linear, compare the post-pass weight to a pure RTN baseline on the cached activation distribution. If RTN is better, revert. Guarantees no Linear is worse than RTN at the cost of those passes.

3. Numerical hygiene

Norm parameters (RMSNorm γ, LayerNorm γ/β) are kept at FP32 — they multiply every token's hidden state at every block, so BF16 rounding compounds aggressively. The size cost is a few MB total.
Activation cache stored at FP32 so downstream Hessian computations don't inherit BF16 rounding.

4. Format-level

NVFP4 — 4-bit FP4 + per-16-group FP8 scale (~4.5 bpp effective).
MXFP8_E4M3 — 8-bit FP8 + per-32-group E8M0 scale (~8.25 bpp).
FP8_SOURCE — for natively-FP8 source checkpoints, the source FP8 weights and weight_scale_inv are copied verbatim; the BF16 view of these tensors is a lossless dequant, so the allocator treats this format as Δloss = 0 for any Linear that ships in the source at FP8.
BF16 — passthrough for tensors the allocator marks ineligible.

5. Pre-ship validation

Generation-sanity check (filters NaN, repetition loops, nonsense).
Chat-deployment perplexity gate — measures NLL on the model's assistant-response tokens given chat-formatted user turns, which is the deployment-relevant signal for instruction-tuned models. This artifact passed at PPL = 1.14, mean NLL = 0.13 nats/token, p99 NLL = 0.20 across factual continuation tasks.
Cache-fingerprint manifest on the per-layer export cache: every quality-affecting flag is hashed in, and a mismatched flag set on resume invalidates the cache rather than silently emitting a layer that was quantized under a different recipe.

6. vLLM-native serving

The output is compressed-tensors format, runnable in stock vLLM. No custom kernels needed. Mixed-precision per Linear is preserved at load — NVFP4 Linears use FP4 kernels, MXFP8 Linears use FP8 kernels, FP8_SOURCE Linears use the source's native FP8 weights with their original block scales.

Project repository: https://github.com/RobTand/prismaquant

Pareto curve

The allocator builds a Pareto curve over feasible target_bits and reports the knee as the recommended ship target — the smallest size at which predicted Δloss has not yet bent upward. For Gemma 4 31B IT the knee landed cleanly at 5.5 bpp; this artifact ships that point.

target_bits	achieved	predicted Δloss	NVFP4	MXFP8	BF16
4.50	4.500	26 203	100% (29.3G)	0% (0.0G)	0% (0.0G)
4.60	4.597	21 708	98% (28.7G)	2% (0.5G)	0% (0.1G)
4.70	4.700	19 759	96% (28.1G)	3% (1.0G)	1% (0.2G)
4.75	4.750	18 983	96% (28.2G)	2% (0.7G)	1% (0.4G)
4.85	4.849	17 631	95% (27.8G)	3% (0.9G)	2% (0.6G)
5.00	4.999	15 893	93% (27.4G)	3% (1.0G)	3% (1.0G)
5.25	5.249	13 499	90% (26.3G)	6% (1.7G)	5% (1.4G)
5.50	5.498	11 815	86% (25.2G)	8% (2.3G)	6% (1.8G)
6.00	5.999	9 133	79% (23.1G)	12% (3.5G)	9% (2.7G)
7.00	6.998	5 142	69% (20.2G)	14% (4.1G)	17% (5.0G)
8.25	8.249	1 896	61% (18.0G)	9% (2.6G)	30% (8.7G)

predicted Δloss is the allocator's ILP objective in arbitrary units — useful for comparing points on the same curve, not for absolute interpretation. The full curve is also included in this repository as pareto.csv.

The knee is at the largest predicted-Δloss reduction per bit-per-weight added: moving from 4.50 → 5.50 bpp halves the predicted Δloss (26 203 → 11 815); moving from 5.50 → 7.00 bpp halves it again (11 815 → 5 142) but costs an additional 1.5 bpp. 5.5 sits at the diminishing-returns inflection.

Quantization

Source model: google/gemma-4-31b-it (Apache 2.0, BF16 dense, 60 layers, hidden 5376, 32 heads × 256 kv-head-dim, multimodal — vision + audio towers present in source but not quantized here)
Export target: 5.5 bits/weight budget (Pareto knee)
Achieved: 5.498 bpp
Export format: mixed native compressed-tensors
Shards: 6 safetensors shards (~23 GB total)
Per-Linear assignment summary across 410 quantizable Linears:
- linear/NVFP4: 321 modules
- linear/MXFP8_E4M3: 43 modules
- linear/BF16 passthrough: 46 modules
Calibration: 16 samples × 1536 tokens from a heterogeneous English text mix, FP32 activation cache, per-token 99.9-percentile activation clipping
Tied embeddings preserved (lm_head shares weight tensor with embed_tokens)

The model mixes NVFP4 (most attention/MLP Linears), MXFP8 (sensitivity-elevated Linears the allocator pushed up the ladder), and BF16 (a small set of norm/passthrough tensors plus the highest-Fisher-trace Linears). It should not be treated as a uniform FP4 checkpoint.

Validation Status

Smoke tested with:

vLLM 0.19.2rc1.dev86+g9a6a66f3b.d20260421
FlashInfer kernels for FP8 and NVFP4
--quantization compressed-tensors
text-only serving via --language-model-only
--kv-cache-dtype fp8

Smoke checks passed:

/v1/models registration
Multi-domain /v1/chat/completions (math, code, reasoning, factual recall, long-form continuation): all coherent, no NaN, no repetition loops, refusal boundaries respected on harmful prompts.
Thinking-mode regression: chat_template_kwargs={"enable_thinking": true} correctly emits the <|think|> channel; system-prompt path correctly triggers structured reasoning per the Gemma 4 chat template.
Tool-calling sanity: model emits well-formed tool_calls when given a function definition in the request.
Chat-deployment perplexity: PPL = 1.14, mean NLL = 0.13 nats/token, p99 NLL = 0.20 (measured on assistant response tokens given chat-formatted factual-continuation prompts).

No formal benchmark or downstream evaluation is included with this upload. Treat this as an experimental serving artifact.

Important: chat-template required

Gemma 4 31B IT is heavily chat-template-saturated. The model expects every prompt to be wrapped in chat-template structure (<bos><start_of_turn>user ...<end_of_turn><start_of_turn>model) and reliably falls into degenerate patterns (onononon..., repetitive tautologies) when given raw text via plain /v1/completions without <bos> and turn markers.

Use /v1/chat/completions for all generation. This is also the API path the model card examples below assume. If you need per-token logprobs for analysis, send the prompt through /v1/chat/completions with logprobs: true — that path applies the chat template and returns logprobs on the assistant's generated tokens.

vLLM Startup

This is the tested launch shape. Replace paths as needed for your environment. The image must contain vLLM support for compressed-tensors mixed FP8/NVFP4 serving.

docker run -d \
  --name vllm-gemma4-31b-prismaquant \
  --gpus all \
  --ipc=host \
  --shm-size=16g \
  -p 8000:8000 \
  -e HF_HOME=/hfcache \
  -e HF_HUB_CACHE=/hfcache/hub \
  -v /home/rob/.cache/huggingface:/hfcache \
  vllm-fresh-b12x:latest \
  vllm serve rdtand/Gemma4-31B-IT-PrismaQuant-5.5bit-vllm \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name gemma4-31b-prismaquant-5p5bit \
    --quantization compressed-tensors \
    --trust-remote-code \
    --max-model-len 8192 \
    --max-num-seqs 4 \
    --gpu-memory-utilization 0.85 \
    --kv-cache-dtype fp8 \
    --language-model-only

Thinking mode

Gemma 4's chat template supports a <|think|> channel that activates when:

The request passes chat_template_kwargs={"enable_thinking": true}, OR
The request includes a tools field, OR
The first message is a system or developer turn.

Without any of those triggers, the model produces direct answers. With thinking enabled, expect a structured-reasoning preamble before the final answer.

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma4-31b-prismaquant-5p5bit",
    "messages": [{"role":"user","content":"What is 17 * 23 * 11? Show your reasoning."}],
    "max_tokens": 400,
    "temperature": 0,
    "chat_template_kwargs": {"enable_thinking": true}
  }'

Tool Calling

Gemma 4's chat template includes tool-calling support. Send the function definitions in the standard OpenAI tools field and use tool_choice: "auto" (or a specific function name) to surface tool calls in the assistant message's tool_calls field:

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gemma4-31b-prismaquant-5p5bit",
    "messages": [{"role":"user","content":"What is the weather in Tokyo?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 200
  }'

Reproducibility notes

The export pipeline used PrismaQuant on the feat/quality-wins-batch1 branch. Two artifact-side amendments were applied after the export to make the checkpoint loadable in stock vLLM; both have since been turned into tracked fixes in the PrismaQuant codebase:

config.json qkv_proj target additions — the export's build_quantization_config originally skipped emitting fused qkv_proj target patterns when any expected sibling was absent from the assignment. Gemma 4's full_attention layers (every 6th layer, with attention_k_eq_v=True) have no v_proj weights on disk because k doubles as v at runtime, so vLLM never got a matching scheme for the fused QKVParallelLinear it constructs at load time. Fixed in code: the emitter now requires ≥2 present siblings agreeing on format and emits the fused target from those.
Aliased v_proj weight tensors for k_eq_v layers — this artifact includes a small additional safetensors shard (model-00006-of-00006.safetensors, ~35 MB) containing 24 v_proj.* tensors (4 per layer × 6 full_attention layers) that mirror the layer's k_proj.* weights. This makes vLLM's QKVParallelLinear concatenation (q + k + v) behave identically to the source model's runtime k_eq_v alias (V = K). Tracked for automation in a future PrismaQuant export step so subsequent Gemma 4 exports include these tensors without manual intervention.

Both amendments are baked into the uploaded checkpoint — no user action is required to load it. The notes are included for transparency about how the artifact was produced.

Limitations

This is an experimental quantized artifact.
Text-only. The multimodal vision and audio towers from the source Gemma 4 checkpoint are not included. Image and audio input will not work. Use the --language-model-only vLLM flag.
Use /v1/chat/completions rather than /v1/completions. The model is heavily chat-template-saturated and produces degenerate output on raw text continuation.
Startup can be slow on first launch because vLLM compiles graphs, autotunes FlashInfer kernels, and captures CUDA graphs. Expect ~3 minutes on a fresh image, faster on subsequent runs (compile cache).
The original Gemma 4 license applies to this derivative.

License

This checkpoint is a derivative of google/gemma-4-31b-it and is provided under the same Apache 2.0 license. See the original model repository and Gemma 4 license for details.

Downloads last month: 1,096

Safetensors

Model size

20B params

Tensor type

F32

BF16

F8_E4M3