Gemma 4 26B-A4B-it Uncensored NVFP4
NVFP4-quantized version of TrevorJS/gemma-4-26B-A4B-it-uncensored (an abliterated Gemma 4 26B MoE), optimized for deployment on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell-architecture GPUs.
Now running on native Blackwell FP4 tensor cores (FlashInfer CUTLASS + VLLM CUTLASS) instead of the Marlin W4A16 software fallback. Peak aggregate throughput: 1,812 tok/s at 256 concurrent requests.
Model Details
| Property | Value |
|---|---|
| Architecture | Gemma 4 (Mixture of Experts) |
| Total Parameters | 26B |
| Active Parameters | ~4B per token (top-8 of 128 experts) |
| Layers | 30 (25 sliding-window + 5 full-attention) |
| Experts | 128 per MoE layer, top-8 routing |
| Sliding Window | 1024 tokens |
| Max Context | 262,144 tokens |
| Hidden Size | 2816 |
| Attention Heads | 16 (8 KV heads), head_dim=256, global_head_dim=512 |
| K=V Sharing | attention_k_eq_v=true on full-attention layers |
| Vision Encoder | 27-layer ViT (1152 hidden, BF16) |
| Audio Encoder | Gemma4 audio (BF16) |
| Vocabulary | 262,144 tokens |
| Quantization | NVFP4 (compressed-tensors format) |
| Quantized Model Size | ~15.3 GB (single safetensors file) |
| VRAM (loaded) | ~16.25 GB |
Quantization Details
- Method: llmcompressor NVFP4 quantization
- Format:
compressed-tensorswithnvfp4-pack-quantizedlayout - Quantized layers: All language model attention projections (q/k/v/o), dense MLP layers, and MoE expert weights (46,080 expert tensors across 128 experts x 30 layers)
- BF16 layers (intentionally not quantized): Vision tower (27 ViT layers), audio encoder, vision embedding projection, MoE router projections, layer norms, embeddings
- Tensor naming:
weight_packed(uint8),weight_scale(FP8 e4m3fn per-block),weight_global_scale+input_global_scale(FP32 per-tensor) - Calibration: Performed on NVIDIA H200 NVL (RunPod)
- Total safetensors keys: 47,648
Performance Benchmarks
All benchmarks performed on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) using the pre-built container image ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest with native Blackwell FP4 tensor-core kernels (FlashInfer CUTLASS for linear, VLLM_CUTLASS for MoE).
Throughput Scaling (Native FP4, max-model-len=2048)
Each concurrency level sends N simultaneous chat completion requests (mixed prompts: code, math, QA, creative), each generating up to 150 tokens with streaming enabled. Zero errors across all concurrency levels.
| Concurrent Requests | Aggregate tok/s | Per-Request tok/s | Min Per-Request tok/s | TTFT p50 | TTFT p95 | TTFT max |
|---|---|---|---|---|---|---|
| 1 | 35.1 | 35.1 | 35.1 | 69ms | 69ms | 69ms |
| 2 | 51.5 | 40.0 | 31.1 | 75ms | 75ms | 75ms |
| 4 | 97.5 | 33.0 | 22.8 | 75ms | 119ms | 119ms |
| 8 | 93.8 | 12.8 | 2.1 | 3,617ms | 3,618ms | 3,618ms |
| 16 | 271.3 | 21.4 | 15.7 | 170ms | 171ms | 171ms |
| 32 | 432.7 | 17.2 | 13.1 | 228ms | 228ms | 228ms |
| 64 | 725.6 | 14.2 | 9.2 | 399ms | 401ms | 402ms |
| 128 | 1,161.1 | 11.1 | 6.7 | 627ms | 631ms | 632ms |
| 256 | 1,811.8 | 8.6 | 4.4 | 1,052ms | 1,061ms | 1,063ms |
Key Performance Metrics
| Metric | Value |
|---|---|
| Single-request decode | 35.1 tok/s (streaming, mixed prompts) |
| Peak aggregate throughput | 1,812 tok/s @ 256 concurrent |
| Peak server-reported generation | 1,848 tok/s (vLLM engine stats) |
| Model load time | ~118 seconds (with FP4 autotune) |
| Model memory footprint | 16.25 GB |
| KV cache capacity | 703,824 tokens (FP8) |
| GEMM backend | FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores) |
| MoE backend | VLLM_CUTLASS (native FP4 MoE) |
| Attention backend | TRITON_ATTN (heterogeneous head dims require Triton) |
| FP4 AutoTune | Enabled (FlashInfer kernel auto-profiling at startup) |
| Prefix cache hit rate | ~72% (sustained, mixed workload) |
| CUDA graph sizes | 1-512 (covers up to 256 concurrent sequences) |
Backend Upgrade Impact: Marlin (Software) vs Native FP4 (Hardware)
These benchmarks were originally run with the Marlin W4A16 software fallback. After switching to native Blackwell FP4 tensor cores (auto-selected by the eugr-nightly vLLM image), every metric improved:
| Metric | Marlin W4A16 (old) | Native FP4 CUTLASS (new) | Improvement |
|---|---|---|---|
| Peak aggregate throughput | 1,430 tok/s @ 128 | 1,812 tok/s @ 256 | +27% |
| Peak server-reported gen | — | 1,848 tok/s | — |
| Max concurrency tested | 128 | 256 | 2x |
| GEMM backend | Marlin (software dequant) | FLASHINFER_CUTLASS (hw tensor cores) | Hardware |
| MoE backend | VLLM_CUTLASS | VLLM_CUTLASS | Same |
| KV cache capacity | 375K tokens | 703K tokens | 1.9x |
| Model load time | ~90s | ~118s (includes FP4 autotune) | Slower startup, faster inference |
| FP4 GEMM AutoTune | N/A | Enabled (kernel auto-profiling) | Better kernel selection |
Important: Do not set
VLLM_NVFP4_GEMM_BACKEND=marlinon images with native FP4 support. It forces the slower software path. Let vLLM auto-detect.
Scaling Analysis
| Concurrency | Efficiency vs 1-req | Throughput Gain |
|---|---|---|
| 1 | 100% | 1.0x |
| 4 | 69% | 2.8x |
| 16 | 48% | 7.7x |
| 64 | 32% | 20.7x |
| 128 | 26% | 33.1x |
| 256 | 24% | 51.6x |
Aggregate throughput scales 51.6x from 1 to 256 concurrent requests, demonstrating excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 35 tok/s (1-req) to 8.6 tok/s (256-req) — still very usable for agentic workloads with many short-lived subagents.
Why MoE is Fast on DGX Spark
The GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces bandwidth demand per token:
| Model Type | Params Read/Token | Bandwidth Required @ 50 tok/s | Fits GB10? |
|---|---|---|---|
| Dense 27B (BF16) | ~54 GB | 2,700 GB/s | No |
| Dense 27B (NVFP4) | ~13.5 GB | 675 GB/s | No |
| MoE 26B top-8/128 (NVFP4) | ~2.8 GB | 140 GB/s | Yes (51% BW) |
With only ~4B parameters active per token (top-8 of 128 experts), this MoE model reads ~2.8 GB per token vs ~13.5 GB for an equivalently quantized dense model. The remaining bandwidth headroom enables efficient batching across concurrent requests.
Native FP4 vs Marlin Backend Comparison
The eugr-nightly vLLM image (built from eugr/spark-vllm-docker) includes sm_120-compiled NVFP4 kernels from FlashInfer. Auto-selection picks native FP4 when available:
| Backend | Type | Peak Aggregate tok/s | Notes |
|---|---|---|---|
| FLASHINFER_CUTLASS | Native FP4 tensor cores | 1,812 | Auto-selected; includes FP4 GEMM autotune |
| Marlin W4A16 | Software dequant | ~1,430 | Fallback; set VLLM_NVFP4_GEMM_BACKEND=marlin to force |
Do not set VLLM_NVFP4_GEMM_BACKEND=marlin — it forces the slower software path. Let vLLM auto-detect the native kernels.
Requirements
Hardware
- Minimum: Any NVIDIA GPU with >= 20 GB VRAM (weights are ~16.25 GB)
- Recommended: NVIDIA DGX Spark (GB10), RTX 5090, or any Blackwell/Ada GPU
- Tested on: NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory)
Software
- vLLM >= 0.19.1 compiled for your GPU architecture
- transformers >= 5.5.0 (Gemma 4 support requires transformers v5+)
- PyTorch >= 2.11 with CUDA 13.0+
- FlashInfer with sm_120 FP4 kernels (included in eugr-nightly image)
DGX Spark users: Use the eugr/spark-vllm-docker build system with the
--tf5flag to get a vLLM image compiled for SM 12.1 with transformers v5 support and native FP4 kernels.
vLLM Patched Model File
This model uses compressed-tensors NVFP4 format (from llmcompressor), which requires a patched gemma4.py model file for vLLM's weight loader. The patch handles:
- Per-expert tensor path remapping: Compressed-tensors names experts as
layers.X.experts.{id}.{proj}.{suffix}-- the patch adds the missing.moe.segment that vLLM's FusedMoE expects - NVFP4 suffix handling: Maps compressed-tensors suffixes to vLLM's FusedMoE weight_loader format
- K=V sharing: Duplicates k_proj as v_proj for full_attention layers when
attention_k_eq_v=true
The patched file is included in this repo as gemma4_patched.py. Mount it into your vLLM container:
volumes:
- ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
Serving with vLLM
Pre-built Docker Image (DGX Spark / Blackwell SM 12.1)
docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest
Image contents:
- vLLM 0.19.1rc1.dev110 compiled for SM 12.1 (Blackwell GB10)
- PyTorch 2.12.0.dev + CUDA 13.0
- transformers 5.5.x
- FlashInfer with native FP4 sm_120 kernels (FLASHINFER_CUTLASS, VLLM_CUTLASS, etc.)
- 7 NVFP4 backends: FLASHINFER_CUTLASS, VLLM_CUTLASS, FLASHINFER_TRTLLM, FLASHINFER_CUDNN, FBGEMM, MARLIN, EMULATION
Docker Compose (Recommended)
services:
vllm:
image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest
container_name: vllm-gemma4
restart: unless-stopped
network_mode: host
environment:
# Native FP4 kernels auto-select. DO NOT set VLLM_NVFP4_GEMM_BACKEND=marlin.
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
volumes:
- ./model:/models/gemma4-uncensored
- ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
command:
- bash
- -c
- |
exec vllm serve /models/gemma4-uncensored \
--served-model-name gemma4-26b-uncensored \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--load-format safetensors \
--max-model-len 65536 \
--max-num-seqs 128 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4
Key Flags
| Flag | Purpose |
|---|---|
--quantization compressed-tensors |
Required for this model's NVFP4 format |
--kv-cache-dtype fp8 |
FP8 KV cache saves memory, enables longer contexts or more concurrent seqs |
--enable-chunked-prefill |
Chunked prefill for large prompts without blocking decode |
--enable-auto-tool-choice --tool-call-parser gemma4 |
Native Gemma 4 tool/function calling |
--enable-prefix-caching |
Big win for agent workloads (shared system prompts) |
--load-format safetensors |
Required for single-file safetensors |
--max-num-seqs N |
Tune for your workload: 8 for long-context, 256 for many short agents |
Scaling Your Config
| Workload | max-model-len | max-num-seqs | KV cache tokens | Best for |
|---|---|---|---|---|
| Long-context (RAG, docs) | 65536 | 8 | ~375K | Few long conversations |
| Mixed (chat + agents) | 8192 | 64 | ~700K | Balanced |
| Many short agents | 2048 | 256 | ~700K | Max throughput (1,812 tok/s) |
| Single-stream quality | 262144 | 1 | ~375K | Max context window |
Thinking / Reasoning
Gemma 4 uses internal <think> blocks. Current vLLM (0.19.1) has a known issue (#38855) where --reasoning-parser gemma4 strips thinking tokens but may return empty content when the model doesn't close its thinking block within max_tokens.
Recommended: Omit --reasoning-parser gemma4 and let thinking tokens appear inline in content. Your gateway or client can strip <think>...</think> blocks client-side if needed.
To disable thinking per-request:
{"chat_template_kwargs": {"enable_thinking": false}}
Generation Config
| Token ID | Token | Purpose |
|---|---|---|
| 1 | <eos> |
End of sequence |
| 106 | <turn|> |
End of turn |
| 50 | <|tool_response> |
Tool response delimiter |
Base Model
Quantized from TrevorJS/gemma-4-26B-A4B-it-uncensored, an abliterated (uncensored) version of Google's Gemma 4 26B-A4B-it. Abliteration removes safety refusals while preserving model capabilities and quality.
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
15.3 GB | NVFP4 quantized weights (47,648 keys, single file) |
config.json |
5 KB | Model + quantization configuration |
tokenizer.json |
31 MB | Tokenizer (262,144 vocab) |
tokenizer_config.json |
3 KB | Tokenizer settings + special tokens |
generation_config.json |
203 B | Generation defaults + EOS tokens |
chat_template.jinja |
12 KB | Gemma 4 chat template (tool calling + thinking support) |
preprocessor_config.json |
371 B | Image preprocessor config |
processor_config.json |
1.6 KB | Multimodal processor config (image + audio + video) |
recipe.yaml |
237 B | llmcompressor quantization recipe |
gemma4_patched.py |
63 KB | Patched vLLM model file for compressed-tensors NVFP4 |
README.md |
this file |
Building from Source
If you're not on a DGX Spark (SM 12.1), compile vLLM from source for your GPU architecture. See eugr/spark-vllm-docker for the recommended build system, or build manually:
export TORCH_CUDA_ARCH_LIST="your_sm_version" # e.g., 8.9 for RTX 4090, 12.0 for B200
git clone https://github.com/vllm-project/vllm.git && cd vllm
pip install -e . --no-build-isolation
pip install "transformers>=5.5.0"
Limitations
- NVFP4 scale mismatch warning: vLLM may warn about different global scales for fused parallel layers (q/k/v projections). This is inherent to compressed-tensors per-tensor quantization and has minimal accuracy impact.
- Vision: Vision encoder weights are BF16 (not quantized). End-to-end vision + NVFP4 language model works but is less extensively tested than text-only.
- Thinking tokens: Without
--reasoning-parser, thinking blocks appear inline incontent. This is by design (see Thinking / Reasoning section).
Disclaimer, Liability Waiver, and Assumption of Risk
THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, the associated container image (ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4), or any derivative works thereof, you expressly acknowledge and agree to the following:
Assumption of Risk
Uncensored language models present materially elevated risks compared to safety-aligned models, including but not limited to: generation of harmful, misleading, illegal, or objectionable content; susceptibility to adversarial misuse; potential for facilitating activities that violate applicable laws or regulations; and amplified risk in automated or agentic pipelines where outputs may be executed without human review.
These tools are powerful and serve a multitude of legitimate and essential purposes -- including security research, red-teaming, content analysis, creative work, and applications where safety filters interfere with valid use cases. However, the absence of safety guardrails demands a correspondingly higher standard of care from the operator. You must implement your own safeguards, content filtering, access controls, and monitoring appropriate to your use case and jurisdiction.
Limitation of Liability
The authors, contributors, and distributors of this model and container image ("Providers") are not responsible or liable, directly or indirectly, for any actions taken, content generated, damages incurred, or legal consequences arising from the use or misuse of these materials.
User Responsibility
You, the user, assume full and sole responsibility and liability for all outputs generated by the model under your operation, ensuring compliance with all applicable laws, and implementing appropriate access controls and human oversight.
Acceptance
By downloading or using any component of this release you indicate your acceptance of these terms and your assumption of all associated risks and liabilities. If you do not agree, do not download or use these materials.
License
This model inherits the Gemma license from Google.
- Downloads last month
- 7,803
Model tree for AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4
Base model
google/gemma-4-26B-A4B-it