Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 AWQ_FULL

NVFP4-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking, an abliterated/uncensored Gemma 4 31B dense model with thinking capabilities.

Quantized using NVIDIA ModelOpt 0.42.0 with AWQ_FULL (exhaustive grid search + clipping optimization) on a native B200 GPU for maximum fidelity at 4-bit precision.

SVDQuant variant also available: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant — same model with SVD decomposition for potentially higher quality.

What Makes This Model Special

This model was quantized using the most thorough NVFP4 quantization pipeline available:

AWQ_FULL — Exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios. This is the most thorough AWQ variant, producing mathematically optimal per-channel scaling. (~75 min vs ~11 min for AWQ_LITE)
Full NVFP4 Quantization — All attention projections (Q/K/V/O) AND all MLP layers (gate/up/down) quantized to FP4. No layers left at higher precision (except vision, embeddings, norms, and lm_head).
Native B200 Calibration — Calibrated on NVIDIA B200 (Blackwell SM 12.0) with native FP4 hardware instructions, producing hardware-accurate scale factors.

Model Details

Property	Value
Architecture	Gemma 4 (Dense, 31B parameters)
Layers	60 (50 sliding-window + 10 full-attention)
Sliding Window	1024 tokens
Max Context	262,144 tokens
Hidden Size	5376
Intermediate Size	21,504
Attention Heads	32 (16 KV heads), head_dim=256, global_head_dim=512
Vision Encoder	27-layer ViT (1152 hidden)
Vocabulary	262,144 tokens
Quantization	NVFP4 AWQ_FULL (ModelOpt format)
Model Size	~20.5 GB

Quantization Pipeline

Three quantization variants were produced and benchmarked on B200:

Variant	Algorithm	Calibration	Size	Time	Notes
AWQ_LITE	`NVFP4_AWQ_LITE_CFG`	512 samples	20.45 GB	10.6 min	Single-pass heuristic
AWQ_FULL ⬅️	`NVFP4_AWQ_FULL_CFG`	2048 samples	20.45 GB	74.4 min	Exhaustive grid search + clipping
SVDQuant	`NVFP4_SVDQUANT_DEFAULT_CFG`	2048 samples	20.94 GB	69.1 min	SVD decomposition + low-rank residual

This repo contains the AWQ_FULL variant, which provides the best balance of quality and throughput.

Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
    |
    v
[NVFP4 AWQ_FULL on B200]
    - ModelOpt 0.42.0 with NVFP4_AWQ_FULL_CFG
    - alpha_step=0.1 (10 scaling factors per layer)
    - awq_clip clipping ratio optimization
    - 2048 calibration samples (CNN DailyMail)
    - Native Blackwell FP4 hardware calibration (SM 12.0)
    - Excluded: vision tower, embed_vision, multi_modal_projector
    |
    v
Gemma-4-31B-DECKARD-HERETIC-NVFP4 (~20.5 GB)

Advanced Techniques

AWQ_FULL vs AWQ_LITE

Standard AWQ_LITE does a single-pass heuristic for channel scaling. AWQ_FULL performs an exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios. This produces mathematically optimal per-channel scaling at the cost of longer quantization time (~75 min vs ~11 min on B200). Same output format and size — strictly higher quality.

NVFP4 Weight Format

Each quantized layer stores:

weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
weight_scale_2 (float32) — per-tensor global scale
pre_quant_scale (bfloat16) — AWQ per-channel pre-scaling factors
input_scale (float32) — static activation scale from calibration

Native B200 Calibration

Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The AWQ calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors than calibrating on non-FP4 hardware.

Quick Start (DGX Spark)

1. Pull the container

docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest

2. Download the model

pip install -U huggingface-hub

huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
  --local-dir ~/models/deckard-31b

3. Launch

docker run -d --name vllm-deckard --gpus all --ipc host --network host \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e TORCH_MATMUL_PRECISION=high \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v ~/models/deckard-31b:/models/deckard \
  ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest \
  bash -c "vllm serve /models/deckard \
    --served-model-name deckard-31b \
    --quantization modelopt \
    --dtype auto --kv-cache-dtype auto \
    --max-model-len 65536 --max-num-seqs 4 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --enable-chunked-prefill --enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 --reasoning-parser gemma4"

Startup takes ~5 minutes (weight loading + torch.compile + CUDA graph capture + FP4 GEMM autotuning). The server is ready when you see Application startup complete.

4. Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deckard-31b",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

The API is fully OpenAI-compatible — use it with any OpenAI SDK, LangChain, or other client at http://<your-ip>:8000/v1.

Docker Compose (DGX Spark)

services:
  vllm:
    image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
    container_name: vllm-deckard-31b
    restart: unless-stopped
    network_mode: host
    volumes:
      - ~/models/deckard-31b:/models/deckard
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - bash
      - -c
      - |
        exec vllm serve /models/deckard \
          --served-model-name deckard-31b \
          --quantization modelopt \
          --dtype auto \
          --kv-cache-dtype auto \
          --max-model-len 65536 \
          --max-num-seqs 4 \
          --gpu-memory-utilization 0.85 \
          --trust-remote-code \
          --enable-chunked-prefill \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser gemma4 \
          --reasoning-parser gemma4
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Key Deployment Flags

Flag	Purpose
`--quantization modelopt`	Required — tells vLLM to use ModelOpt NVFP4 format
`--kv-cache-dtype auto`	BF16 KV cache on DGX Spark (use `fp8` on B200 for 2x compression)
`--max-model-len 65536`	64K context — conservative for DGX Spark. Model supports up to 256K; increase with fewer concurrent sequences
`--max-num-seqs 4`	Concurrent sequences. vLLM pre-allocates KV cache, so balance with context length
`--reasoning-parser gemma4`	Extracts `<think>` blocks for thinking/reasoning display
`--tool-call-parser gemma4`	Enables native function calling
`--enable-chunked-prefill`	Processes long prompts in chunks to avoid OOM
`--enable-prefix-caching`	Caches common prompt prefixes for faster responses

The container auto-selects FlashInfer CUTLASS for native FP4 GEMM on DGX Spark. No need to set VLLM_NVFP4_GEMM_BACKEND. torch.compile and CUDA graphs are enabled by default for maximum throughput.

Performance Expectations

DGX Spark Estimates

Configuration	Estimated tok/s
BF16 (no quantization)	~3-5
NVFP4 AWQ_FULL	~12-14
NVFP4 SVDQuant	~10-13

Dense vs MoE Comparison

Metric	This Model (31B Dense)	Gemma 4 26B-A4B MoE
Active params/token	31.3B	~4B
NVFP4 model size	20.45 GB	15.3 GB
Expected tok/s (DGX Spark)	~12-14	~43-50
Quality	Higher (full dense)	Good (MoE routing)
Best for	Quality-critical tasks	Speed, concurrency

Speculative Decoding with EAGLE Drafter

This model supports EAGLE-based speculative decoding using the DECKARD E4B drafter (9.6 GB NVFP4). Three patches to vLLM 0.19.1 are required — see the GitHub repo for patched files and full documentation.

Speculative Decoding Performance (DGX Spark)

Benchmarked with E4B drafter, 5 speculative tokens, 300 max tokens per request:

Concurrent	Aggregate tok/s	Per-Request tok/s	Avg Latency (300 tok)
1	7.6	8.9	39.4s
2	21.7	10.8	27.7s
4	42.7	10.7	28.1s

Quick Start with Drafter

Add --speculative-config to your vLLM serve command:

vllm serve /models/deckard \
  --served-model-name deckard-31b \
  --quantization modelopt \
  --dtype auto --kv-cache-dtype fp8 \
  --max-model-len 131072 --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-chunked-prefill --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 --reasoning-parser gemma4 \
  --speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'

Requires three patched files mounted into the container. See GitHub repo for details.

Required Patches

eagle_patched.py — Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)
serving_chat_patched.py — Fixes non-streaming reasoning parser (<|channel> tokens stripped by skip_special_tokens=True)
modelopt_patched.py — NVFP4 AWQ support + FP8 NaN scrubbing

Related Models

GitHub repo: AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4 — deployment docs, Docker Compose, speculative decoding patches
EAGLE E4B Drafter: AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 — 9.6 GB EAGLE drafter for speculative decoding | GitHub
SVDQuant variant: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant — SVD decomposition for potentially higher quality
Gemma 4 MoE NVFP4: AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 — MoE variant, faster throughput
Docker container: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4
Base model: DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

License

This model inherits the Gemma license from the base model.

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.

Downloads last month: 558

Safetensors

Model size

18B params

Tensor type

BF16

F8_E4M3

Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4

Base model

google/gemma-4-31B-it

Finetuned

DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking

Quantized

(7)

this model