Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 AWQ_FULL

NVFP4-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking, an abliterated/uncensored Gemma 4 31B dense model with thinking capabilities.

Quantized using NVIDIA ModelOpt 0.42.0 with AWQ_FULL (exhaustive grid search + clipping optimization) on a native B200 GPU for maximum fidelity at 4-bit precision.

SVDQuant variant also available: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant — same model with SVD decomposition for potentially higher quality.

What Makes This Model Special

This model was quantized using the most thorough NVFP4 quantization pipeline available:

  1. AWQ_FULL — Exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios. This is the most thorough AWQ variant, producing mathematically optimal per-channel scaling. (~75 min vs ~11 min for AWQ_LITE)

  2. Full NVFP4 Quantization — All attention projections (Q/K/V/O) AND all MLP layers (gate/up/down) quantized to FP4. No layers left at higher precision (except vision, embeddings, norms, and lm_head).

  3. Native B200 Calibration — Calibrated on NVIDIA B200 (Blackwell SM 12.0) with native FP4 hardware instructions, producing hardware-accurate scale factors.

Model Details

Property Value
Architecture Gemma 4 (Dense, 31B parameters)
Layers 60 (50 sliding-window + 10 full-attention)
Sliding Window 1024 tokens
Max Context 262,144 tokens
Hidden Size 5376
Intermediate Size 21,504
Attention Heads 32 (16 KV heads), head_dim=256, global_head_dim=512
Vision Encoder 27-layer ViT (1152 hidden)
Vocabulary 262,144 tokens
Quantization NVFP4 AWQ_FULL (ModelOpt format)
Model Size ~20.5 GB

Quantization Pipeline

Three quantization variants were produced and benchmarked on B200:

Variant Algorithm Calibration Size Time Notes
AWQ_LITE NVFP4_AWQ_LITE_CFG 512 samples 20.45 GB 10.6 min Single-pass heuristic
AWQ_FULL ⬅️ NVFP4_AWQ_FULL_CFG 2048 samples 20.45 GB 74.4 min Exhaustive grid search + clipping
SVDQuant NVFP4_SVDQUANT_DEFAULT_CFG 2048 samples 20.94 GB 69.1 min SVD decomposition + low-rank residual

This repo contains the AWQ_FULL variant, which provides the best balance of quality and throughput.

Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
    |
    v
[NVFP4 AWQ_FULL on B200]
    - ModelOpt 0.42.0 with NVFP4_AWQ_FULL_CFG
    - alpha_step=0.1 (10 scaling factors per layer)
    - awq_clip clipping ratio optimization
    - 2048 calibration samples (CNN DailyMail)
    - Native Blackwell FP4 hardware calibration (SM 12.0)
    - Excluded: vision tower, embed_vision, multi_modal_projector
    |
    v
Gemma-4-31B-DECKARD-HERETIC-NVFP4 (~20.5 GB)

Advanced Techniques

AWQ_FULL vs AWQ_LITE

Standard AWQ_LITE does a single-pass heuristic for channel scaling. AWQ_FULL performs an exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios. This produces mathematically optimal per-channel scaling at the cost of longer quantization time (~75 min vs ~11 min on B200). Same output format and size — strictly higher quality.

NVFP4 Weight Format

Each quantized layer stores:

  • weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
  • weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
  • weight_scale_2 (float32) — per-tensor global scale
  • pre_quant_scale (bfloat16) — AWQ per-channel pre-scaling factors
  • input_scale (float32) — static activation scale from calibration

Native B200 Calibration

Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The AWQ calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors than calibrating on non-FP4 hardware.

Quick Start (DGX Spark)

1. Pull the container

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

2. Download the model

pip install -U huggingface-hub

huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/deckard-31b

3. Launch

docker run -d --name vllm-deckard --gpus all --ipc host --network host \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e TORCH_MATMUL_PRECISION=high \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v ~/models/deckard-31b:/models/deckard \
  ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest \
  bash -c "vllm serve /models/deckard \
    --served-model-name deckard-31b \
    --quantization modelopt \
    --dtype auto --kv-cache-dtype auto \
    --max-model-len 65536 --max-num-seqs 4 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --enable-chunked-prefill --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 --reasoning-parser gemma4"

Startup takes ~5 minutes (weight loading + torch.compile + CUDA graph capture + FP4 GEMM autotuning). The server is ready when you see Application startup complete.

4. Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deckard-31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

The API is fully OpenAI-compatible — use it with any OpenAI SDK, LangChain, or other client at http://<your-ip>:8000/v1.

Docker Compose (DGX Spark)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-deckard-31b
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/deckard-31b:/models/deckard
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/deckard \
          --served-model-name deckard-31b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype auto \
          --max-model-len 65536 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Key Deployment Flags

Flag Purpose
--quantization modelopt Required — tells vLLM to use ModelOpt NVFP4 format
--kv-cache-dtype auto BF16 KV cache on DGX Spark (use fp8 on B200 for 2x compression)
--max-model-len 65536 64K context — conservative for DGX Spark. Model supports up to 256K; increase with fewer concurrent sequences
--max-num-seqs 4 Concurrent sequences. vLLM pre-allocates KV cache, so balance with context length
--reasoning-parser gemma4 Extracts <think> blocks for thinking/reasoning display
--tool-call-parser gemma4 Enables native function calling
--enable-chunked-prefill Processes long prompts in chunks to avoid OOM
--enable-prefix-caching Caches common prompt prefixes for faster responses

The container auto-selects FlashInfer CUTLASS for native FP4 GEMM on DGX Spark. No need to set VLLM_NVFP4_GEMM_BACKEND. torch.compile and CUDA graphs are enabled by default for maximum throughput.

Performance Expectations

DGX Spark Estimates

Configuration Estimated tok/s
BF16 (no quantization) ~3-5
NVFP4 AWQ_FULL ~12-14
NVFP4 SVDQuant ~10-13

Dense vs MoE Comparison

Metric This Model (31B Dense) Gemma 4 26B-A4B MoE
Active params/token 31.3B ~4B
NVFP4 model size 20.45 GB 15.3 GB
Expected tok/s (DGX Spark) ~12-14 ~43-50
Quality Higher (full dense) Good (MoE routing)
Best for Quality-critical tasks Speed, concurrency

Speculative Decoding with EAGLE Drafter

This model supports EAGLE-based speculative decoding using the DECKARD E4B drafter (9.6 GB NVFP4). Three patches to vLLM 0.19.1 are required — see the GitHub repo for patched files and full documentation.

Speculative Decoding Performance (DGX Spark)

Benchmarked with E4B drafter, 5 speculative tokens, 300 max tokens per request:

Concurrent Aggregate tok/s Per-Request tok/s Avg Latency (300 tok)
1 7.6 8.9 39.4s
2 21.7 10.8 27.7s
4 42.7 10.7 28.1s

Quick Start with Drafter

Add --speculative-config to your vLLM serve command:

vllm serve /models/deckard \
  --served-model-name deckard-31b \
  --quantization modelopt \
  --dtype auto --kv-cache-dtype fp8 \
  --max-model-len 131072 --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-chunked-prefill --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 --reasoning-parser gemma4 \
  --speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'

Requires three patched files mounted into the container. See GitHub repo for details.

Required Patches

  1. eagle_patched.py — Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)
  2. serving_chat_patched.py — Fixes non-streaming reasoning parser (<|channel> tokens stripped by skip_special_tokens=True)
  3. modelopt_patched.py — NVFP4 AWQ support + FP8 NaN scrubbing

Related Models

License

This model inherits the Gemma license from the base model.

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.

Downloads last month
558
Safetensors
Model size
18B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4