Gemma 4 26B-A4B-it Uncensored NVFP4

NVFP4-quantized version of TrevorJS/gemma-4-26B-A4B-it-uncensored (an abliterated Gemma 4 26B MoE), optimized for deployment on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell-architecture GPUs.

Now running on native Blackwell FP4 tensor cores (FlashInfer CUTLASS + VLLM CUTLASS) instead of the Marlin W4A16 software fallback. Peak aggregate throughput: 1,812 tok/s at 256 concurrent requests.

Model Details

Property	Value
Architecture	Gemma 4 (Mixture of Experts)
Total Parameters	26B
Active Parameters	~4B per token (top-8 of 128 experts)
Layers	30 (25 sliding-window + 5 full-attention)
Experts	128 per MoE layer, top-8 routing
Sliding Window	1024 tokens
Max Context	262,144 tokens
Hidden Size	2816
Attention Heads	16 (8 KV heads), head_dim=256, global_head_dim=512
K=V Sharing	`attention_k_eq_v=true` on full-attention layers
Vision Encoder	27-layer ViT (1152 hidden, BF16)
Audio Encoder	Gemma4 audio (BF16)
Vocabulary	262,144 tokens
Quantization	NVFP4 (compressed-tensors format)
Quantized Model Size	~15.3 GB (single safetensors file)
VRAM (loaded)	~16.25 GB

Quantization Details

Method: llmcompressor NVFP4 quantization
Format: compressed-tensors with nvfp4-pack-quantized layout
Quantized layers: All language model attention projections (q/k/v/o), dense MLP layers, and MoE expert weights (46,080 expert tensors across 128 experts x 30 layers)
BF16 layers (intentionally not quantized): Vision tower (27 ViT layers), audio encoder, vision embedding projection, MoE router projections, layer norms, embeddings
Tensor naming: weight_packed (uint8), weight_scale (FP8 e4m3fn per-block), weight_global_scale + input_global_scale (FP32 per-tensor)
Calibration: Performed on NVIDIA H200 NVL (RunPod)
Total safetensors keys: 47,648

Performance Benchmarks

All benchmarks performed on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) using the pre-built container image ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest with native Blackwell FP4 tensor-core kernels (FlashInfer CUTLASS for linear, VLLM_CUTLASS for MoE).

Throughput Scaling (Native FP4, max-model-len=2048)

Each concurrency level sends N simultaneous chat completion requests (mixed prompts: code, math, QA, creative), each generating up to 150 tokens with streaming enabled. Zero errors across all concurrency levels.

Concurrent Requests	Aggregate tok/s	Per-Request tok/s	Min Per-Request tok/s	TTFT p50	TTFT p95	TTFT max
1	35.1	35.1	35.1	69ms	69ms	69ms
2	51.5	40.0	31.1	75ms	75ms	75ms
4	97.5	33.0	22.8	75ms	119ms	119ms
8	93.8	12.8	2.1	3,617ms	3,618ms	3,618ms
16	271.3	21.4	15.7	170ms	171ms	171ms
32	432.7	17.2	13.1	228ms	228ms	228ms
64	725.6	14.2	9.2	399ms	401ms	402ms
128	1,161.1	11.1	6.7	627ms	631ms	632ms
256	1,811.8	8.6	4.4	1,052ms	1,061ms	1,063ms

Key Performance Metrics

Metric	Value
Single-request decode	35.1 tok/s (streaming, mixed prompts)
Peak aggregate throughput	1,812 tok/s @ 256 concurrent
Peak server-reported generation	1,848 tok/s (vLLM engine stats)
Model load time	~118 seconds (with FP4 autotune)
Model memory footprint	16.25 GB
KV cache capacity	703,824 tokens (FP8)
GEMM backend	FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores)
MoE backend	VLLM_CUTLASS (native FP4 MoE)
Attention backend	TRITON_ATTN (heterogeneous head dims require Triton)
FP4 AutoTune	Enabled (FlashInfer kernel auto-profiling at startup)
Prefix cache hit rate	~72% (sustained, mixed workload)
CUDA graph sizes	1-512 (covers up to 256 concurrent sequences)

Backend Upgrade Impact: Marlin (Software) vs Native FP4 (Hardware)

These benchmarks were originally run with the Marlin W4A16 software fallback. After switching to native Blackwell FP4 tensor cores (auto-selected by the eugr-nightly vLLM image), every metric improved:

Metric	Marlin W4A16 (old)	Native FP4 CUTLASS (new)	Improvement
Peak aggregate throughput	1,430 tok/s @ 128	1,812 tok/s @ 256	+27%
Peak server-reported gen	—	1,848 tok/s	—
Max concurrency tested	128	256	2x
GEMM backend	Marlin (software dequant)	FLASHINFER_CUTLASS (hw tensor cores)	Hardware
MoE backend	VLLM_CUTLASS	VLLM_CUTLASS	Same
KV cache capacity	375K tokens	703K tokens	1.9x
Model load time	~90s	~118s (includes FP4 autotune)	Slower startup, faster inference
FP4 GEMM AutoTune	N/A	Enabled (kernel auto-profiling)	Better kernel selection

Important: Do not set VLLM_NVFP4_GEMM_BACKEND=marlin on images with native FP4 support. It forces the slower software path. Let vLLM auto-detect.

Scaling Analysis

Concurrency	Efficiency vs 1-req	Throughput Gain
1	100%	1.0x
4	69%	2.8x
16	48%	7.7x
64	32%	20.7x
128	26%	33.1x
256	24%	51.6x

Aggregate throughput scales 51.6x from 1 to 256 concurrent requests, demonstrating excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 35 tok/s (1-req) to 8.6 tok/s (256-req) — still very usable for agentic workloads with many short-lived subagents.

Why MoE is Fast on DGX Spark

The GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces bandwidth demand per token:

Model Type	Params Read/Token	Bandwidth Required @ 50 tok/s	Fits GB10?
Dense 27B (BF16)	~54 GB	2,700 GB/s	No
Dense 27B (NVFP4)	~13.5 GB	675 GB/s	No
MoE 26B top-8/128 (NVFP4)	~2.8 GB	140 GB/s	Yes (51% BW)

With only ~4B parameters active per token (top-8 of 128 experts), this MoE model reads ~2.8 GB per token vs ~13.5 GB for an equivalently quantized dense model. The remaining bandwidth headroom enables efficient batching across concurrent requests.

Native FP4 vs Marlin Backend Comparison

The eugr-nightly vLLM image (built from eugr/spark-vllm-docker) includes sm_120-compiled NVFP4 kernels from FlashInfer. Auto-selection picks native FP4 when available:

Backend	Type	Peak Aggregate tok/s	Notes
FLASHINFER_CUTLASS	Native FP4 tensor cores	1,812	Auto-selected; includes FP4 GEMM autotune
Marlin W4A16	Software dequant	~1,430	Fallback; set `VLLM_NVFP4_GEMM_BACKEND=marlin` to force

Do not set VLLM_NVFP4_GEMM_BACKEND=marlin — it forces the slower software path. Let vLLM auto-detect the native kernels.

Requirements

Hardware

Minimum: Any NVIDIA GPU with >= 20 GB VRAM (weights are ~16.25 GB)
Recommended: NVIDIA DGX Spark (GB10), RTX 5090, or any Blackwell/Ada GPU
Tested on: NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory)

Software

vLLM >= 0.19.1 compiled for your GPU architecture
transformers >= 5.5.0 (Gemma 4 support requires transformers v5+)
PyTorch >= 2.11 with CUDA 13.0+
FlashInfer with sm_120 FP4 kernels (included in eugr-nightly image)

DGX Spark users: Use the eugr/spark-vllm-docker build system with the --tf5 flag to get a vLLM image compiled for SM 12.1 with transformers v5 support and native FP4 kernels.

vLLM Patched Model File

This model uses compressed-tensors NVFP4 format (from llmcompressor), which requires a patched gemma4.py model file for vLLM's weight loader. The patch handles:

Per-expert tensor path remapping: Compressed-tensors names experts as layers.X.experts.{id}.{proj}.{suffix} -- the patch adds the missing .moe. segment that vLLM's FusedMoE expects
NVFP4 suffix handling: Maps compressed-tensors suffixes to vLLM's FusedMoE weight_loader format
K=V sharing: Duplicates k_proj as v_proj for full_attention layers when attention_k_eq_v=true

The patched file is included in this repo as gemma4_patched.py. Mount it into your vLLM container:

volumes:
  - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py

Serving with vLLM

Pre-built Docker Image (DGX Spark / Blackwell SM 12.1)

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest

Image contents:

vLLM 0.19.1rc1.dev110 compiled for SM 12.1 (Blackwell GB10)
PyTorch 2.12.0.dev + CUDA 13.0
transformers 5.5.x
FlashInfer with native FP4 sm_120 kernels (FLASHINFER_CUTLASS, VLLM_CUTLASS, etc.)
7 NVFP4 backends: FLASHINFER_CUTLASS, VLLM_CUTLASS, FLASHINFER_TRTLLM, FLASHINFER_CUDNN, FBGEMM, MARLIN, EMULATION

Docker Compose (Recommended)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest
    container_name: vllm-gemma4
    restart: unless-stopped
    network_mode: host
    environment:
      # Native FP4 kernels auto-select. DO NOT set VLLM_NVFP4_GEMM_BACKEND=marlin.
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    volumes:
      - ./model:/models/gemma4-uncensored
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/gemma4-uncensored \
          --served-model-name gemma4-26b-uncensored \
          --host 0.0.0.0 --port 8000 \
          --tensor-parallel-size 1 \
          --dtype auto \
          --quantization compressed-tensors \
          --load-format safetensors \
          --max-model-len 65536 \
          --max-num-seqs 128 \
          --max-num-batched-tokens 16384 \
          --gpu-memory-utilization 0.85 \
          --kv-cache-dtype fp8 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --trust-remote-code \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4

Key Flags

Flag	Purpose
`--quantization compressed-tensors`	Required for this model's NVFP4 format
`--kv-cache-dtype fp8`	FP8 KV cache saves memory, enables longer contexts or more concurrent seqs
`--enable-chunked-prefill`	Chunked prefill for large prompts without blocking decode
`--enable-auto-tool-choice --tool-call-parser gemma4`	Native Gemma 4 tool/function calling
`--enable-prefix-caching`	Big win for agent workloads (shared system prompts)
`--load-format safetensors`	Required for single-file safetensors
`--max-num-seqs N`	Tune for your workload: 8 for long-context, 256 for many short agents

Scaling Your Config

Workload	max-model-len	max-num-seqs	KV cache tokens	Best for
Long-context (RAG, docs)	65536	8	~375K	Few long conversations
Mixed (chat + agents)	8192	64	~700K	Balanced
Many short agents	2048	256	~700K	Max throughput (1,812 tok/s)
Single-stream quality	262144	1	~375K	Max context window

Thinking / Reasoning

Gemma 4 uses internal <think> blocks. Current vLLM (0.19.1) has a known issue (#38855) where --reasoning-parser gemma4 strips thinking tokens but may return empty content when the model doesn't close its thinking block within max_tokens.

Recommended: Omit --reasoning-parser gemma4 and let thinking tokens appear inline in content. Your gateway or client can strip <think>...</think> blocks client-side if needed.

To disable thinking per-request:

{"chat_template_kwargs": {"enable_thinking": false}}

Generation Config

Token ID	Token	Purpose
1	`<eos>`	End of sequence
106	`<turn\|>`	End of turn
50	`<\|tool_response>`	Tool response delimiter

Base Model

Quantized from TrevorJS/gemma-4-26B-A4B-it-uncensored, an abliterated (uncensored) version of Google's Gemma 4 26B-A4B-it. Abliteration removes safety refusals while preserving model capabilities and quality.

Files

File	Size	Description
`model.safetensors`	15.3 GB	NVFP4 quantized weights (47,648 keys, single file)
`config.json`	5 KB	Model + quantization configuration
`tokenizer.json`	31 MB	Tokenizer (262,144 vocab)
`tokenizer_config.json`	3 KB	Tokenizer settings + special tokens
`generation_config.json`	203 B	Generation defaults + EOS tokens
`chat_template.jinja`	12 KB	Gemma 4 chat template (tool calling + thinking support)
`preprocessor_config.json`	371 B	Image preprocessor config
`processor_config.json`	1.6 KB	Multimodal processor config (image + audio + video)
`recipe.yaml`	237 B	llmcompressor quantization recipe
`gemma4_patched.py`	63 KB	Patched vLLM model file for compressed-tensors NVFP4
`README.md`	this file

Building from Source

If you're not on a DGX Spark (SM 12.1), compile vLLM from source for your GPU architecture. See eugr/spark-vllm-docker for the recommended build system, or build manually:

export TORCH_CUDA_ARCH_LIST="your_sm_version"  # e.g., 8.9 for RTX 4090, 12.0 for B200
git clone https://github.com/vllm-project/vllm.git && cd vllm
pip install -e . --no-build-isolation
pip install "transformers>=5.5.0"

Limitations

NVFP4 scale mismatch warning: vLLM may warn about different global scales for fused parallel layers (q/k/v projections). This is inherent to compressed-tensors per-tensor quantization and has minimal accuracy impact.
Vision: Vision encoder weights are BF16 (not quantized). End-to-end vision + NVFP4 language model works but is less extensively tested than text-only.
Thinking tokens: Without --reasoning-parser, thinking blocks appear inline in content. This is by design (see Thinking / Reasoning section).

Disclaimer, Liability Waiver, and Assumption of Risk

THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, the associated container image (ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4), or any derivative works thereof, you expressly acknowledge and agree to the following:

Assumption of Risk

Uncensored language models present materially elevated risks compared to safety-aligned models, including but not limited to: generation of harmful, misleading, illegal, or objectionable content; susceptibility to adversarial misuse; potential for facilitating activities that violate applicable laws or regulations; and amplified risk in automated or agentic pipelines where outputs may be executed without human review.

These tools are powerful and serve a multitude of legitimate and essential purposes -- including security research, red-teaming, content analysis, creative work, and applications where safety filters interfere with valid use cases. However, the absence of safety guardrails demands a correspondingly higher standard of care from the operator. You must implement your own safeguards, content filtering, access controls, and monitoring appropriate to your use case and jurisdiction.

Limitation of Liability

The authors, contributors, and distributors of this model and container image ("Providers") are not responsible or liable, directly or indirectly, for any actions taken, content generated, damages incurred, or legal consequences arising from the use or misuse of these materials.

User Responsibility

You, the user, assume full and sole responsibility and liability for all outputs generated by the model under your operation, ensuring compliance with all applicable laws, and implementing appropriate access controls and human oversight.

Acceptance

By downloading or using any component of this release you indicate your acceptance of these terms and your assumption of all associated risks and liabilities. If you do not agree, do not download or use these materials.

License

This model inherits the Gemma license from Google.

Downloads last month: 7,803

Model tree for AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4

Base model

google/gemma-4-26B-A4B-it

Finetuned

TrevorJS/gemma-4-26B-A4B-it-uncensored

Quantized

(13)

this model