Mistral Medium 3.5 128B PrismaQuant 4.75 vLLM

This is a PrismaQuant 4.75-bit mixed-native quantization of mistralai/Mistral-Medium-3.5-128B, exported for vLLM using the compressed-tensors format.

This artifact is intended for vLLM/FlashInfer serving. It is not a vanilla Transformers checkpoint.

What is PrismaQuant?

PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Rather than forcing the whole model into one dtype, it predicts the loss penalty of quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W, where H_trace is the Linear's Fisher diagonal trace from a calibration probe and MSE_W is the format-specific weight reconstruction error. A small ILP then picks per-Linear formats from a menu of {NVFP4, MXFP8_E4M3, FP8_SOURCE, BF16} to minimize total predicted Δloss under a target bits-per-weight budget.

Because the cost model is principled and additive across Linears, it lets the toolkit stack several layers of optimization without each one stepping on the others. The shipped checkpoint was produced with the full stack below.

1. Sensitivity-driven format allocation

  • Fisher-weighted ILP picks one of NVFP4 / MXFP8 / FP8_SOURCE / BF16 per Linear under the global bpp budget.
  • Pareto sweep across bpp targets surfaces the knee — the smallest size at which predicted Δloss has not yet bent upward.
  • Fused-sibling joint scales: q/k/v projections (and gate/up) share one NVFP4 weight_global_scale, matching vLLM's per-tensor expectation.
  • Per-Linear input_global_scale calibration from cached activations (max_abs / 6.0), so vLLM's runtime activation quantization uses a correctly-sized dynamic range.

2. Per-Linear act-aware optimization passes

For each NVFP4-assigned Linear, in order, each gated to "improve or keep":

  • GPTQ one-shot OBS rounding with block-wise error propagation.
  • Activation clipping at the 99.9th-percentile per token before computing the Hessian, to bound its condition number. Validated as the largest single win on a Qwen3-0.6B audit (≈ −0.91 PPL on the validator suite).
  • GPTQ damping sweep — try 5 candidate Hessian regularizers, keep the one with smallest activation-weighted reconstruction error.
  • Closed-form scale sweep — joint (per-group scale, rounding) search on the NVFP4 codebook (a closed-form analog of AutoRound's SGD-based search).
  • Block-output match — greedy per-Linear scale refinement against a surrounding-block FP16 reference forward, capturing inter-Linear composition error that per-Linear MSE can't see.
  • Do-no-harm gate — for every Linear, compare the post-pass weight to a pure RTN baseline on the cached activation distribution. If RTN is better, revert. Guarantees no Linear is worse than RTN at the cost of those passes.

3. Numerical hygiene

  • Norm parameters (RMSNorm γ, LayerNorm γ/β) are kept at FP32 — they multiply every token's hidden state at every block, so BF16 rounding compounds aggressively. The size cost is a few MB total.
  • Activation cache stored at FP32 so downstream Hessian computations don't inherit BF16 rounding.

4. Format-level

  • NVFP4 — 4-bit FP4 + per-16-group FP8 scale (~4.5 bpp effective).
  • MXFP8_E4M3 — 8-bit FP8 + per-32-group E8M0 scale (~8.25 bpp).
  • FP8_SOURCE — for natively-FP8 source checkpoints, the source FP8 weights and weight_scale_inv are copied verbatim; the BF16 view of these tensors is a lossless dequant, so the allocator treats this format as Δloss = 0 for any Linear that ships in the source at FP8.
  • BF16 — passthrough for tensors the allocator marks ineligible.

5. Pre-ship validation

  • Generation-sanity check (filters NaN, repetition loops, nonsense).
  • Bimodal-failure perplexity gate with a hard p99 threshold on per-prompt NLL, not just mean. Caught a real recall during development of an earlier artifact, where 2 of 10 prompts scored normally and 8 scored at NLL ~10 — the mean check alone would have passed it.
  • Cache-fingerprint manifest on the per-layer export cache: every quality-affecting flag is hashed in, and a mismatched flag set on resume invalidates the cache rather than silently emitting a layer that was quantized under a different recipe.

6. vLLM-native serving

The output is compressed-tensors format, runnable in stock vLLM. No custom kernels needed. Mixed-precision per Linear is preserved at load — NVFP4 Linears use FP4 kernels, MXFP8 Linears use FP8 kernels, FP8_SOURCE Linears use the source's native FP8 weights with their original block scales.

Project repository: https://github.com/RobTand/prismaquant

Quantization

  • Source model: mistralai/Mistral-Medium-3.5-128B
  • Export target: 4.75 bits/weight budget
  • Export format: mixed native compressed-tensors
  • Shards: 17 safetensors shards
  • Approximate local size: 79 GiB
  • Quantized linear assignment summary:
    • linear/NVFP4: 478 modules
    • linear/FP8_SOURCE_STATIC: 89 modules
    • passthrough heads: BF16/FP32
    • passthrough non-linear/layer tensors: FP32

The model mixes NVFP4 and static FP8 source-preserving layers. It should not be treated as a uniform FP4 checkpoint.

Validation Status

Smoke tested with:

  • vLLM 0.20.1rc1.dev55+g3f1a4bb63.d20260429
  • FlashInfer kernels for FP8 and NVFP4
  • --quantization compressed-tensors
  • text-only serving via --language-model-only
  • optional EAGLE speculative decoding with mistralai/Mistral-Medium-3.5-128B-EAGLE

Smoke checks passed:

  • /v1/models
  • /v1/completions
  • /v1/chat/completions

No formal benchmark, perplexity, or safety evaluation is included with this upload. Treat this as an experimental serving artifact.

vLLM Startup

This is the tested launch shape. Replace paths as needed for your environment. The image must contain vLLM support for compressed-tensors mixed FP8/NVFP4 serving.

docker run -d \
  --name vllm-mistral-medium-35-prismaquant-475-vllm \
  --gpus all \
  --ipc=host \
  --shm-size=16g \
  --security-opt label=disable \
  -p 8000:8000 \
  -e HF_HOME=/hfcache \
  -e HF_HUB_CACHE=/hfcache/hub \
  -e HF_MODULES_CACHE=/work/hf_modules \
  -e TRANSFORMERS_CACHE=/work/tf_cache \
  -v /home/rob/.cache/huggingface:/hfcache \
  -v /models:/models:ro \
  vllm-eugr-v020:latest \
  vllm serve rdtand/Mistral-Medium-3.5-128B-PrismaQuant-4.75-vllm \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name mistral-medium-3.5-prismaquant-4.75 \
    --config-format hf \
    --tokenizer mistralai/Mistral-Medium-3.5-128B \
    --tokenizer-mode mistral \
    --trust-remote-code \
    --quantization compressed-tensors \
    --dtype bfloat16 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --kv-cache-dtype fp8 \
    --language-model-only \
    --enable-auto-tool-choice \
    --tool-call-parser mistral

Optional EAGLE

The local smoke test used Mistral's EAGLE draft model: mistralai/Mistral-Medium-3.5-128B-EAGLE.

At the time of testing, that EAGLE repository needed a Hugging Face config.json sidecar for this vLLM path. If you have a local EAGLE checkout with that config present, add:

    --speculative-config '{"model":"/models/Mistral-Medium-3.5-128B-EAGLE","method":"eagle","num_speculative_tokens":3,"draft_tensor_parallel_size":1,"quantization":"fp8"}'

In a tiny smoke probe, vLLM reported average draft acceptance around 40%. That number is not a benchmark.

Tool Calling

Tool calling is not broken in the weights, but vLLM needs the Mistral tool-call parser enabled. Without these flags, the model can emit raw Mistral tool markup such as [TOOL_CALLS]... in message.content while tool_calls remains empty:

--enable-auto-tool-choice \
--tool-call-parser mistral

The startup command above includes these flags.

Limitations

  • This is an experimental quantized artifact.
  • Text-only serving was validated. Multimodal/image serving was not validated.
  • Startup can be slow on first launch because vLLM compiles graphs, autotunes FlashInfer kernels, and captures CUDA graphs.
  • EAGLE support depends on the vLLM build and the draft model config format.
  • The original Mistral license applies to this derivative.

License

This checkpoint is a derivative of mistralai/Mistral-Medium-3.5-128B and is provided under the same Modified MIT license terms. See the included LICENSE file and the original model repository for details.

Downloads last month
1,158
Safetensors
Model size
78B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rdtand/Mistral-Medium-3.5-128B-PrismaQuant-4.75-vllm

Quantized
(18)
this model