FusenAI/gemma-4-26B-A4B-it-NVFP4

NVFP4-quantized version of google/gemma-4-26B-A4B-it in modelopt format, optimized for vLLM on NVIDIA Blackwell (SM120) GPUs.

What makes this different

This model follows NVIDIA's quantization pattern from Gemma-4-31B-IT-NVFP4:

MoE experts + MLP layers: Quantized to NVFP4 (4-bit floating point)
Self-attention (q/k/v/o projections): Kept in BF16 (not quantized)
Router, vision tower, lm_head: Kept in BF16

Other available NVFP4 quantizations of this model quantize attention layers too, which causes:

Quality degradation from aggressive attention quantization
vLLM's QKV fusion scale bug (fused projections take max() of global scales, causing underflow)

Performance

Tested on RTX 5090 (Blackwell SM120, 32 GB VRAM) with vLLM 0.19.x:

Best config: CUDA graphs + BF16 KV cache

vllm serve FusenAI/gemma-4-26B-A4B-it-NVFP4 \
  --quantization modelopt \
  --max-model-len 4096

Batch	Gen tok/s	Avg Latency
B=1	127	2.0s
B=4	383	2.7s
B=16	1,221	3.3s
B=32	2,071	3.9s
B=128	2,059	4.0s

Config comparison (single request)

Config	tok/s	Notes
`--enforce-eager`	18	No CUDA graphs
Default (CUDA graphs)	127	7x faster
`--kv-cache-dtype fp8`	31	2x KV capacity but 4x slower (FlashInfer FP8 overhead)

Model details

Property	Value
VRAM (model weights)	17.2 GiB
KV cache (BF16)	~10 GiB / 43K tokens / 15x concurrency at 4096 ctx
KV cache (FP8)	~10 GiB / 87K tokens / 30x concurrency at 4096 ctx
Size on disk	17 GB
Quantization	NVFP4 (MLP/MoE only), BF16 (attention)
Group size	16

Quick Start

# Recommended (CUDA graphs enabled by default)
vllm serve FusenAI/gemma-4-26B-A4B-it-NVFP4 \
  --quantization modelopt \
  --max-model-len 4096

# For maximum KV cache capacity (2x tokens, slower per-request)
vllm serve FusenAI/gemma-4-26B-A4B-it-NVFP4 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --max-model-len 4096

How this was created

Started from RedHatAI/gemma-4-26B-A4B-it-NVFP4 (compressed-tensors format)
Dequantized all 115 self-attention projections (q/k/v/o across 30 layers) back to BF16 using FP4-E2M1 lookup table
Converted from compressed-tensors to modelopt format:
- weight_packed -> weight (same data)
- weight_global_scale (divisor) -> weight_scale_2 (1/divisor)
- input_global_scale (divisor) -> input_scale (1/divisor)
Updated quantization config to match NVIDIA's modelopt pattern with proper exclude_modules

Tensor Format

NVFP4 layers (MoE/MLP):

weight: uint8 (packed FP4-E2M1, two values per byte)
weight_scale: float8_e4m3fn (per-block scale, group_size=16)
weight_scale_2: float32 (global scale, scalar)
input_scale: float32 (activation global scale, scalar)

BF16 layers (attention):

weight: bfloat16 (standard dense weight)

vLLM Compatibility

Tested with: vLLM 0.19.x with --quantization modelopt
Required patches (for Gemma 4 heterogeneous attention): PRs #38891, #39084, #39406 (merged in vLLM 0.19.x)
GPU: NVIDIA Blackwell (SM120) or later for native NVFP4 support
Backends: FlashAttention v2 (sliding) + Triton (global attention) + FlashInfer+Cutlass (NVFP4 linear) + VLLM_CUTLASS (NVFP4 MoE)

License

This model inherits the Apache 2.0 license from Google's Gemma 4.

Acknowledgments

Google DeepMind for Gemma 4
RedHatAI for the initial NVFP4 quantization
NVIDIA for the modelopt format and Gemma-4-31B-IT-NVFP4 reference

Downloads last month: 4,108

Model tree for cklaus/gemma-4-26B-A4B-it-NVFP4

Base model

google/gemma-4-26B-A4B-it

Quantized

(110)

this model