Gemma 4 26B-A4B-it Uncensored NVFP4

NVFP4-quantized version of TrevorJS/gemma-4-26B-A4B-it-uncensored (an abliterated Gemma 4 26B MoE), optimized for deployment on NVIDIA DGX Spark (GB10, SM 12.1) and other Blackwell-architecture GPUs.

Now running on native Blackwell FP4 tensor cores (FlashInfer CUTLASS + VLLM CUTLASS) instead of the Marlin W4A16 software fallback. Peak aggregate throughput: 1,812 tok/s at 256 concurrent requests.

Model Details

Property Value
Architecture Gemma 4 (Mixture of Experts)
Total Parameters 26B
Active Parameters ~4B per token (top-8 of 128 experts)
Layers 30 (25 sliding-window + 5 full-attention)
Experts 128 per MoE layer, top-8 routing
Sliding Window 1024 tokens
Max Context 262,144 tokens
Hidden Size 2816
Attention Heads 16 (8 KV heads), head_dim=256, global_head_dim=512
K=V Sharing attention_k_eq_v=true on full-attention layers
Vision Encoder 27-layer ViT (1152 hidden, BF16)
Audio Encoder Gemma4 audio (BF16)
Vocabulary 262,144 tokens
Quantization NVFP4 (compressed-tensors format)
Quantized Model Size ~15.3 GB (single safetensors file)
VRAM (loaded) ~16.25 GB

Quantization Details

  • Method: llmcompressor NVFP4 quantization
  • Format: compressed-tensors with nvfp4-pack-quantized layout
  • Quantized layers: All language model attention projections (q/k/v/o), dense MLP layers, and MoE expert weights (46,080 expert tensors across 128 experts x 30 layers)
  • BF16 layers (intentionally not quantized): Vision tower (27 ViT layers), audio encoder, vision embedding projection, MoE router projections, layer norms, embeddings
  • Tensor naming: weight_packed (uint8), weight_scale (FP8 e4m3fn per-block), weight_global_scale + input_global_scale (FP32 per-tensor)
  • Calibration: Performed on NVIDIA H200 NVL (RunPod)
  • Total safetensors keys: 47,648

Performance Benchmarks

All benchmarks performed on NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory) using the pre-built container image ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest with native Blackwell FP4 tensor-core kernels (FlashInfer CUTLASS for linear, VLLM_CUTLASS for MoE).

Throughput Scaling (Native FP4, max-model-len=2048)

Each concurrency level sends N simultaneous chat completion requests (mixed prompts: code, math, QA, creative), each generating up to 150 tokens with streaming enabled. Zero errors across all concurrency levels.

Concurrent Requests Aggregate tok/s Per-Request tok/s Min Per-Request tok/s TTFT p50 TTFT p95 TTFT max
1 35.1 35.1 35.1 69ms 69ms 69ms
2 51.5 40.0 31.1 75ms 75ms 75ms
4 97.5 33.0 22.8 75ms 119ms 119ms
8 93.8 12.8 2.1 3,617ms 3,618ms 3,618ms
16 271.3 21.4 15.7 170ms 171ms 171ms
32 432.7 17.2 13.1 228ms 228ms 228ms
64 725.6 14.2 9.2 399ms 401ms 402ms
128 1,161.1 11.1 6.7 627ms 631ms 632ms
256 1,811.8 8.6 4.4 1,052ms 1,061ms 1,063ms

Key Performance Metrics

Metric Value
Single-request decode 35.1 tok/s (streaming, mixed prompts)
Peak aggregate throughput 1,812 tok/s @ 256 concurrent
Peak server-reported generation 1,848 tok/s (vLLM engine stats)
Model load time ~118 seconds (with FP4 autotune)
Model memory footprint 16.25 GB
KV cache capacity 703,824 tokens (FP8)
GEMM backend FLASHINFER_CUTLASS (native Blackwell FP4 tensor cores)
MoE backend VLLM_CUTLASS (native FP4 MoE)
Attention backend TRITON_ATTN (heterogeneous head dims require Triton)
FP4 AutoTune Enabled (FlashInfer kernel auto-profiling at startup)
Prefix cache hit rate ~72% (sustained, mixed workload)
CUDA graph sizes 1-512 (covers up to 256 concurrent sequences)

Backend Upgrade Impact: Marlin (Software) vs Native FP4 (Hardware)

These benchmarks were originally run with the Marlin W4A16 software fallback. After switching to native Blackwell FP4 tensor cores (auto-selected by the eugr-nightly vLLM image), every metric improved:

Metric Marlin W4A16 (old) Native FP4 CUTLASS (new) Improvement
Peak aggregate throughput 1,430 tok/s @ 128 1,812 tok/s @ 256 +27%
Peak server-reported gen 1,848 tok/s
Max concurrency tested 128 256 2x
GEMM backend Marlin (software dequant) FLASHINFER_CUTLASS (hw tensor cores) Hardware
MoE backend VLLM_CUTLASS VLLM_CUTLASS Same
KV cache capacity 375K tokens 703K tokens 1.9x
Model load time ~90s ~118s (includes FP4 autotune) Slower startup, faster inference
FP4 GEMM AutoTune N/A Enabled (kernel auto-profiling) Better kernel selection

Important: Do not set VLLM_NVFP4_GEMM_BACKEND=marlin on images with native FP4 support. It forces the slower software path. Let vLLM auto-detect.

Scaling Analysis

Concurrency Efficiency vs 1-req Throughput Gain
1 100% 1.0x
4 69% 2.8x
16 48% 7.7x
64 32% 20.7x
128 26% 33.1x
256 24% 51.6x

Aggregate throughput scales 51.6x from 1 to 256 concurrent requests, demonstrating excellent batching efficiency from the MoE architecture. Per-request throughput degrades gracefully from 35 tok/s (1-req) to 8.6 tok/s (256-req) — still very usable for agentic workloads with many short-lived subagents.

Why MoE is Fast on DGX Spark

The GB10's 273 GB/s memory bandwidth is the bottleneck for LLM decode. MoE dramatically reduces bandwidth demand per token:

Model Type Params Read/Token Bandwidth Required @ 50 tok/s Fits GB10?
Dense 27B (BF16) ~54 GB 2,700 GB/s No
Dense 27B (NVFP4) ~13.5 GB 675 GB/s No
MoE 26B top-8/128 (NVFP4) ~2.8 GB 140 GB/s Yes (51% BW)

With only ~4B parameters active per token (top-8 of 128 experts), this MoE model reads ~2.8 GB per token vs ~13.5 GB for an equivalently quantized dense model. The remaining bandwidth headroom enables efficient batching across concurrent requests.

Native FP4 vs Marlin Backend Comparison

The eugr-nightly vLLM image (built from eugr/spark-vllm-docker) includes sm_120-compiled NVFP4 kernels from FlashInfer. Auto-selection picks native FP4 when available:

Backend Type Peak Aggregate tok/s Notes
FLASHINFER_CUTLASS Native FP4 tensor cores 1,812 Auto-selected; includes FP4 GEMM autotune
Marlin W4A16 Software dequant ~1,430 Fallback; set VLLM_NVFP4_GEMM_BACKEND=marlin to force

Do not set VLLM_NVFP4_GEMM_BACKEND=marlin — it forces the slower software path. Let vLLM auto-detect the native kernels.

Requirements

Hardware

  • Minimum: Any NVIDIA GPU with >= 20 GB VRAM (weights are ~16.25 GB)
  • Recommended: NVIDIA DGX Spark (GB10), RTX 5090, or any Blackwell/Ada GPU
  • Tested on: NVIDIA DGX Spark (GB10, SM 12.1, 128 GB unified memory)

Software

  • vLLM >= 0.19.1 compiled for your GPU architecture
  • transformers >= 5.5.0 (Gemma 4 support requires transformers v5+)
  • PyTorch >= 2.11 with CUDA 13.0+
  • FlashInfer with sm_120 FP4 kernels (included in eugr-nightly image)

DGX Spark users: Use the eugr/spark-vllm-docker build system with the --tf5 flag to get a vLLM image compiled for SM 12.1 with transformers v5 support and native FP4 kernels.

vLLM Patched Model File

This model uses compressed-tensors NVFP4 format (from llmcompressor), which requires a patched gemma4.py model file for vLLM's weight loader. The patch handles:

  1. Per-expert tensor path remapping: Compressed-tensors names experts as layers.X.experts.{id}.{proj}.{suffix} -- the patch adds the missing .moe. segment that vLLM's FusedMoE expects
  2. NVFP4 suffix handling: Maps compressed-tensors suffixes to vLLM's FusedMoE weight_loader format
  3. K=V sharing: Duplicates k_proj as v_proj for full_attention layers when attention_k_eq_v=true

The patched file is included in this repo as gemma4_patched.py. Mount it into your vLLM container:

volumes:
  - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py

Serving with vLLM

Pre-built Docker Image (DGX Spark / Blackwell SM 12.1)

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest

Image contents:

  • vLLM 0.19.1rc1.dev110 compiled for SM 12.1 (Blackwell GB10)
  • PyTorch 2.12.0.dev + CUDA 13.0
  • transformers 5.5.x
  • FlashInfer with native FP4 sm_120 kernels (FLASHINFER_CUTLASS, VLLM_CUTLASS, etc.)
  • 7 NVFP4 backends: FLASHINFER_CUTLASS, VLLM_CUTLASS, FLASHINFER_TRTLLM, FLASHINFER_CUDNN, FBGEMM, MARLIN, EMULATION

Docker Compose (Recommended)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4:latest
    container_name: vllm-gemma4
    restart: unless-stopped
    network_mode: host
    environment:
      # Native FP4 kernels auto-select. DO NOT set VLLM_NVFP4_GEMM_BACKEND=marlin.
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    volumes:
      - ./model:/models/gemma4-uncensored
      - ./gemma4_patched.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4.py
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/gemma4-uncensored \
          --served-model-name gemma4-26b-uncensored \
          --host 0.0.0.0 --port 8000 \
          --tensor-parallel-size 1 \
          --dtype auto \
          --quantization compressed-tensors \
          --load-format safetensors \
          --max-model-len 65536 \
          --max-num-seqs 128 \
          --max-num-batched-tokens 16384 \
          --gpu-memory-utilization 0.85 \
          --kv-cache-dtype fp8 \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --trust-remote-code \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4

Key Flags

Flag Purpose
--quantization compressed-tensors Required for this model's NVFP4 format
--kv-cache-dtype fp8 FP8 KV cache saves memory, enables longer contexts or more concurrent seqs
--enable-chunked-prefill Chunked prefill for large prompts without blocking decode
--enable-auto-tool-choice --tool-call-parser gemma4 Native Gemma 4 tool/function calling
--enable-prefix-caching Big win for agent workloads (shared system prompts)
--load-format safetensors Required for single-file safetensors
--max-num-seqs N Tune for your workload: 8 for long-context, 256 for many short agents

Scaling Your Config

Workload max-model-len max-num-seqs KV cache tokens Best for
Long-context (RAG, docs) 65536 8 ~375K Few long conversations
Mixed (chat + agents) 8192 64 ~700K Balanced
Many short agents 2048 256 ~700K Max throughput (1,812 tok/s)
Single-stream quality 262144 1 ~375K Max context window

Thinking / Reasoning

Gemma 4 uses internal <think> blocks. Current vLLM (0.19.1) has a known issue (#38855) where --reasoning-parser gemma4 strips thinking tokens but may return empty content when the model doesn't close its thinking block within max_tokens.

Recommended: Omit --reasoning-parser gemma4 and let thinking tokens appear inline in content. Your gateway or client can strip <think>...</think> blocks client-side if needed.

To disable thinking per-request:

{"chat_template_kwargs": {"enable_thinking": false}}

Generation Config

Token ID Token Purpose
1 <eos> End of sequence
106 <turn|> End of turn
50 <|tool_response> Tool response delimiter

Base Model

Quantized from TrevorJS/gemma-4-26B-A4B-it-uncensored, an abliterated (uncensored) version of Google's Gemma 4 26B-A4B-it. Abliteration removes safety refusals while preserving model capabilities and quality.

Files

File Size Description
model.safetensors 15.3 GB NVFP4 quantized weights (47,648 keys, single file)
config.json 5 KB Model + quantization configuration
tokenizer.json 31 MB Tokenizer (262,144 vocab)
tokenizer_config.json 3 KB Tokenizer settings + special tokens
generation_config.json 203 B Generation defaults + EOS tokens
chat_template.jinja 12 KB Gemma 4 chat template (tool calling + thinking support)
preprocessor_config.json 371 B Image preprocessor config
processor_config.json 1.6 KB Multimodal processor config (image + audio + video)
recipe.yaml 237 B llmcompressor quantization recipe
gemma4_patched.py 63 KB Patched vLLM model file for compressed-tensors NVFP4
README.md this file

Building from Source

If you're not on a DGX Spark (SM 12.1), compile vLLM from source for your GPU architecture. See eugr/spark-vllm-docker for the recommended build system, or build manually:

export TORCH_CUDA_ARCH_LIST="your_sm_version"  # e.g., 8.9 for RTX 4090, 12.0 for B200
git clone https://github.com/vllm-project/vllm.git && cd vllm
pip install -e . --no-build-isolation
pip install "transformers>=5.5.0"

Limitations

  • NVFP4 scale mismatch warning: vLLM may warn about different global scales for fused parallel layers (q/k/v projections). This is inherent to compressed-tensors per-tensor quantization and has minimal accuracy impact.
  • Vision: Vision encoder weights are BF16 (not quantized). End-to-end vision + NVFP4 language model works but is less extensively tested than text-only.
  • Thinking tokens: Without --reasoning-parser, thinking blocks appear inline in content. This is by design (see Thinking / Reasoning section).

Disclaimer, Liability Waiver, and Assumption of Risk

THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, the associated container image (ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4), or any derivative works thereof, you expressly acknowledge and agree to the following:

Assumption of Risk

Uncensored language models present materially elevated risks compared to safety-aligned models, including but not limited to: generation of harmful, misleading, illegal, or objectionable content; susceptibility to adversarial misuse; potential for facilitating activities that violate applicable laws or regulations; and amplified risk in automated or agentic pipelines where outputs may be executed without human review.

These tools are powerful and serve a multitude of legitimate and essential purposes -- including security research, red-teaming, content analysis, creative work, and applications where safety filters interfere with valid use cases. However, the absence of safety guardrails demands a correspondingly higher standard of care from the operator. You must implement your own safeguards, content filtering, access controls, and monitoring appropriate to your use case and jurisdiction.

Limitation of Liability

The authors, contributors, and distributors of this model and container image ("Providers") are not responsible or liable, directly or indirectly, for any actions taken, content generated, damages incurred, or legal consequences arising from the use or misuse of these materials.

User Responsibility

You, the user, assume full and sole responsibility and liability for all outputs generated by the model under your operation, ensuring compliance with all applicable laws, and implementing appropriate access controls and human oversight.

Acceptance

By downloading or using any component of this release you indicate your acceptance of these terms and your assumption of all associated risks and liabilities. If you do not agree, do not download or use these materials.

License

This model inherits the Gemma license from Google.

Downloads last month
7,803
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4

Quantized
(13)
this model