Mistral Medium 3.5 128B PrismaQuant 4.75 vLLM
This is a PrismaQuant 4.75-bit mixed-native quantization of
mistralai/Mistral-Medium-3.5-128B, exported for vLLM using the
compressed-tensors format.
This artifact is intended for vLLM/FlashInfer serving. It is not a vanilla Transformers checkpoint.
What is PrismaQuant?
PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Rather
than forcing the whole model into one dtype, it predicts the loss penalty of
quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W,
where H_trace is the Linear's Fisher diagonal trace from a calibration probe
and MSE_W is the format-specific weight reconstruction error. A small ILP
then picks per-Linear formats from a menu of {NVFP4, MXFP8_E4M3, FP8_SOURCE, BF16} to minimize total predicted Δloss under a target bits-per-weight budget.
Because the cost model is principled and additive across Linears, it lets the toolkit stack several layers of optimization without each one stepping on the others. The shipped checkpoint was produced with the full stack below.
1. Sensitivity-driven format allocation
- Fisher-weighted ILP picks one of
NVFP4 / MXFP8 / FP8_SOURCE / BF16per Linear under the global bpp budget. - Pareto sweep across bpp targets surfaces the knee — the smallest size at which predicted Δloss has not yet bent upward.
- Fused-sibling joint scales: q/k/v projections (and gate/up) share one
NVFP4
weight_global_scale, matching vLLM's per-tensor expectation. - Per-Linear input_global_scale calibration from cached activations
(
max_abs / 6.0), so vLLM's runtime activation quantization uses a correctly-sized dynamic range.
2. Per-Linear act-aware optimization passes
For each NVFP4-assigned Linear, in order, each gated to "improve or keep":
- GPTQ one-shot OBS rounding with block-wise error propagation.
- Activation clipping at the 99.9th-percentile per token before computing the Hessian, to bound its condition number. Validated as the largest single win on a Qwen3-0.6B audit (≈ −0.91 PPL on the validator suite).
- GPTQ damping sweep — try 5 candidate Hessian regularizers, keep the one with smallest activation-weighted reconstruction error.
- Closed-form scale sweep — joint
(per-group scale, rounding)search on the NVFP4 codebook (a closed-form analog of AutoRound's SGD-based search). - Block-output match — greedy per-Linear scale refinement against a surrounding-block FP16 reference forward, capturing inter-Linear composition error that per-Linear MSE can't see.
- Do-no-harm gate — for every Linear, compare the post-pass weight to a pure RTN baseline on the cached activation distribution. If RTN is better, revert. Guarantees no Linear is worse than RTN at the cost of those passes.
3. Numerical hygiene
- Norm parameters (RMSNorm γ, LayerNorm γ/β) are kept at FP32 — they multiply every token's hidden state at every block, so BF16 rounding compounds aggressively. The size cost is a few MB total.
- Activation cache stored at FP32 so downstream Hessian computations don't inherit BF16 rounding.
4. Format-level
- NVFP4 — 4-bit FP4 + per-16-group FP8 scale (~4.5 bpp effective).
- MXFP8_E4M3 — 8-bit FP8 + per-32-group E8M0 scale (~8.25 bpp).
- FP8_SOURCE — for natively-FP8 source checkpoints, the source FP8
weights and
weight_scale_invare copied verbatim; the BF16 view of these tensors is a lossless dequant, so the allocator treats this format asΔloss = 0for any Linear that ships in the source at FP8. - BF16 — passthrough for tensors the allocator marks ineligible.
5. Pre-ship validation
- Generation-sanity check (filters NaN, repetition loops, nonsense).
- Bimodal-failure perplexity gate with a hard p99 threshold on per-prompt NLL, not just mean. Caught a real recall during development of an earlier artifact, where 2 of 10 prompts scored normally and 8 scored at NLL ~10 — the mean check alone would have passed it.
- Cache-fingerprint manifest on the per-layer export cache: every quality-affecting flag is hashed in, and a mismatched flag set on resume invalidates the cache rather than silently emitting a layer that was quantized under a different recipe.
6. vLLM-native serving
The output is compressed-tensors format, runnable in stock vLLM. No custom
kernels needed. Mixed-precision per Linear is preserved at load — NVFP4
Linears use FP4 kernels, MXFP8 Linears use FP8 kernels, FP8_SOURCE Linears
use the source's native FP8 weights with their original block scales.
Project repository: https://github.com/RobTand/prismaquant
Quantization
- Source model:
mistralai/Mistral-Medium-3.5-128B - Export target: 4.75 bits/weight budget
- Export format: mixed native
compressed-tensors - Shards: 17 safetensors shards
- Approximate local size: 79 GiB
- Quantized linear assignment summary:
linear/NVFP4: 478 moduleslinear/FP8_SOURCE_STATIC: 89 modules- passthrough heads: BF16/FP32
- passthrough non-linear/layer tensors: FP32
The model mixes NVFP4 and static FP8 source-preserving layers. It should not be treated as a uniform FP4 checkpoint.
Validation Status
Smoke tested with:
- vLLM
0.20.1rc1.dev55+g3f1a4bb63.d20260429 - FlashInfer kernels for FP8 and NVFP4
--quantization compressed-tensors- text-only serving via
--language-model-only - optional EAGLE speculative decoding with
mistralai/Mistral-Medium-3.5-128B-EAGLE
Smoke checks passed:
/v1/models/v1/completions/v1/chat/completions
No formal benchmark, perplexity, or safety evaluation is included with this upload. Treat this as an experimental serving artifact.
vLLM Startup
This is the tested launch shape. Replace paths as needed for your environment. The image must contain vLLM support for compressed-tensors mixed FP8/NVFP4 serving.
docker run -d \
--name vllm-mistral-medium-35-prismaquant-475-vllm \
--gpus all \
--ipc=host \
--shm-size=16g \
--security-opt label=disable \
-p 8000:8000 \
-e HF_HOME=/hfcache \
-e HF_HUB_CACHE=/hfcache/hub \
-e HF_MODULES_CACHE=/work/hf_modules \
-e TRANSFORMERS_CACHE=/work/tf_cache \
-v /home/rob/.cache/huggingface:/hfcache \
-v /models:/models:ro \
vllm-eugr-v020:latest \
vllm serve rdtand/Mistral-Medium-3.5-128B-PrismaQuant-4.75-vllm \
--host 0.0.0.0 \
--port 8000 \
--served-model-name mistral-medium-3.5-prismaquant-4.75 \
--config-format hf \
--tokenizer mistralai/Mistral-Medium-3.5-128B \
--tokenizer-mode mistral \
--trust-remote-code \
--quantization compressed-tensors \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8 \
--language-model-only \
--enable-auto-tool-choice \
--tool-call-parser mistral
Optional EAGLE
The local smoke test used Mistral's EAGLE draft model:
mistralai/Mistral-Medium-3.5-128B-EAGLE.
At the time of testing, that EAGLE repository needed a Hugging Face
config.json sidecar for this vLLM path. If you have a local EAGLE checkout
with that config present, add:
--speculative-config '{"model":"/models/Mistral-Medium-3.5-128B-EAGLE","method":"eagle","num_speculative_tokens":3,"draft_tensor_parallel_size":1,"quantization":"fp8"}'
In a tiny smoke probe, vLLM reported average draft acceptance around 40%. That number is not a benchmark.
Tool Calling
Tool calling is not broken in the weights, but vLLM needs the Mistral tool-call
parser enabled. Without these flags, the model can emit raw Mistral tool markup
such as [TOOL_CALLS]... in message.content while tool_calls remains empty:
--enable-auto-tool-choice \
--tool-call-parser mistral
The startup command above includes these flags.
Limitations
- This is an experimental quantized artifact.
- Text-only serving was validated. Multimodal/image serving was not validated.
- Startup can be slow on first launch because vLLM compiles graphs, autotunes FlashInfer kernels, and captures CUDA graphs.
- EAGLE support depends on the vLLM build and the draft model config format.
- The original Mistral license applies to this derivative.
License
This checkpoint is a derivative of mistralai/Mistral-Medium-3.5-128B and is
provided under the same Modified MIT license terms. See the included LICENSE
file and the original model repository for details.
- Downloads last month
- 1,158
Model tree for rdtand/Mistral-Medium-3.5-128B-PrismaQuant-4.75-vllm
Base model
mistralai/Mistral-Medium-3.5-128B