Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 AWQ_FULL
NVFP4-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking, an abliterated/uncensored Gemma 4 31B dense model with thinking capabilities.
Quantized using NVIDIA ModelOpt 0.42.0 with AWQ_FULL (exhaustive grid search + clipping optimization) on a native B200 GPU for maximum fidelity at 4-bit precision.
SVDQuant variant also available: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant — same model with SVD decomposition for potentially higher quality.
What Makes This Model Special
This model was quantized using the most thorough NVFP4 quantization pipeline available:
AWQ_FULL — Exhaustive grid search with
alpha_step=0.1across 10 scaling factors per layer, plus a secondawq_clippass that optimizes clipping ratios. This is the most thorough AWQ variant, producing mathematically optimal per-channel scaling. (~75 min vs ~11 min for AWQ_LITE)Full NVFP4 Quantization — All attention projections (Q/K/V/O) AND all MLP layers (gate/up/down) quantized to FP4. No layers left at higher precision (except vision, embeddings, norms, and lm_head).
Native B200 Calibration — Calibrated on NVIDIA B200 (Blackwell SM 12.0) with native FP4 hardware instructions, producing hardware-accurate scale factors.
Model Details
| Property | Value |
|---|---|
| Architecture | Gemma 4 (Dense, 31B parameters) |
| Layers | 60 (50 sliding-window + 10 full-attention) |
| Sliding Window | 1024 tokens |
| Max Context | 262,144 tokens |
| Hidden Size | 5376 |
| Intermediate Size | 21,504 |
| Attention Heads | 32 (16 KV heads), head_dim=256, global_head_dim=512 |
| Vision Encoder | 27-layer ViT (1152 hidden) |
| Vocabulary | 262,144 tokens |
| Quantization | NVFP4 AWQ_FULL (ModelOpt format) |
| Model Size | ~20.5 GB |
Quantization Pipeline
Three quantization variants were produced and benchmarked on B200:
| Variant | Algorithm | Calibration | Size | Time | Notes |
|---|---|---|---|---|---|
| AWQ_LITE | NVFP4_AWQ_LITE_CFG |
512 samples | 20.45 GB | 10.6 min | Single-pass heuristic |
| AWQ_FULL ⬅️ | NVFP4_AWQ_FULL_CFG |
2048 samples | 20.45 GB | 74.4 min | Exhaustive grid search + clipping |
| SVDQuant | NVFP4_SVDQUANT_DEFAULT_CFG |
2048 samples | 20.94 GB | 69.1 min | SVD decomposition + low-rank residual |
This repo contains the AWQ_FULL variant, which provides the best balance of quality and throughput.
Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
|
v
[NVFP4 AWQ_FULL on B200]
- ModelOpt 0.42.0 with NVFP4_AWQ_FULL_CFG
- alpha_step=0.1 (10 scaling factors per layer)
- awq_clip clipping ratio optimization
- 2048 calibration samples (CNN DailyMail)
- Native Blackwell FP4 hardware calibration (SM 12.0)
- Excluded: vision tower, embed_vision, multi_modal_projector
|
v
Gemma-4-31B-DECKARD-HERETIC-NVFP4 (~20.5 GB)
Advanced Techniques
AWQ_FULL vs AWQ_LITE
Standard AWQ_LITE does a single-pass heuristic for channel scaling. AWQ_FULL performs an exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios. This produces mathematically optimal per-channel scaling at the cost of longer quantization time (~75 min vs ~11 min on B200). Same output format and size — strictly higher quality.
NVFP4 Weight Format
Each quantized layer stores:
weight(uint8) — packed FP4 E2M1 pairs (16-element blocks)weight_scale(float8_e4m3fn) — per-block scale (1 per 16 elements)weight_scale_2(float32) — per-tensor global scalepre_quant_scale(bfloat16) — AWQ per-channel pre-scaling factorsinput_scale(float32) — static activation scale from calibration
Native B200 Calibration
Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The AWQ calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors than calibrating on non-FP4 hardware.
Quick Start (DGX Spark)
1. Pull the container
docker pull ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
2. Download the model
pip install -U huggingface-hub
huggingface-cli download AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 \
--local-dir ~/models/deckard-31b
3. Launch
docker run -d --name vllm-deckard --gpus all --ipc host --network host \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e TORCH_MATMUL_PRECISION=high \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v ~/models/deckard-31b:/models/deckard \
ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest \
bash -c "vllm serve /models/deckard \
--served-model-name deckard-31b \
--quantization modelopt \
--dtype auto --kv-cache-dtype auto \
--max-model-len 65536 --max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-chunked-prefill --enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 --reasoning-parser gemma4"
Startup takes ~5 minutes (weight loading + torch.compile + CUDA graph capture + FP4 GEMM autotuning). The server is ready when you see Application startup complete.
4. Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deckard-31b",
"messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
"max_tokens": 200
}'
The API is fully OpenAI-compatible — use it with any OpenAI SDK, LangChain, or other client at http://<your-ip>:8000/v1.
Docker Compose (DGX Spark)
services:
vllm:
image: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4-awq:latest
container_name: vllm-deckard-31b
restart: unless-stopped
network_mode: host
volumes:
- ~/models/deckard-31b:/models/deckard
environment:
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_MATMUL_PRECISION=high
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
command:
- bash
- -c
- |
exec vllm serve /models/deckard \
--served-model-name deckard-31b \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype auto \
--max-model-len 65536 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
Key Deployment Flags
| Flag | Purpose |
|---|---|
--quantization modelopt |
Required — tells vLLM to use ModelOpt NVFP4 format |
--kv-cache-dtype auto |
BF16 KV cache on DGX Spark (use fp8 on B200 for 2x compression) |
--max-model-len 65536 |
64K context — conservative for DGX Spark. Model supports up to 256K; increase with fewer concurrent sequences |
--max-num-seqs 4 |
Concurrent sequences. vLLM pre-allocates KV cache, so balance with context length |
--reasoning-parser gemma4 |
Extracts <think> blocks for thinking/reasoning display |
--tool-call-parser gemma4 |
Enables native function calling |
--enable-chunked-prefill |
Processes long prompts in chunks to avoid OOM |
--enable-prefix-caching |
Caches common prompt prefixes for faster responses |
The container auto-selects FlashInfer CUTLASS for native FP4 GEMM on DGX Spark. No need to set VLLM_NVFP4_GEMM_BACKEND. torch.compile and CUDA graphs are enabled by default for maximum throughput.
Performance Expectations
DGX Spark Estimates
| Configuration | Estimated tok/s |
|---|---|
| BF16 (no quantization) | ~3-5 |
| NVFP4 AWQ_FULL | ~12-14 |
| NVFP4 SVDQuant | ~10-13 |
Dense vs MoE Comparison
| Metric | This Model (31B Dense) | Gemma 4 26B-A4B MoE |
|---|---|---|
| Active params/token | 31.3B | ~4B |
| NVFP4 model size | 20.45 GB | 15.3 GB |
| Expected tok/s (DGX Spark) | ~12-14 | ~43-50 |
| Quality | Higher (full dense) | Good (MoE routing) |
| Best for | Quality-critical tasks | Speed, concurrency |
Speculative Decoding with EAGLE Drafter
This model supports EAGLE-based speculative decoding using the DECKARD E4B drafter (9.6 GB NVFP4). Three patches to vLLM 0.19.1 are required — see the GitHub repo for patched files and full documentation.
Speculative Decoding Performance (DGX Spark)
Benchmarked with E4B drafter, 5 speculative tokens, 300 max tokens per request:
| Concurrent | Aggregate tok/s | Per-Request tok/s | Avg Latency (300 tok) |
|---|---|---|---|
| 1 | 7.6 | 8.9 | 39.4s |
| 2 | 21.7 | 10.8 | 27.7s |
| 4 | 42.7 | 10.7 | 28.1s |
Quick Start with Drafter
Add --speculative-config to your vLLM serve command:
vllm serve /models/deckard \
--served-model-name deckard-31b \
--quantization modelopt \
--dtype auto --kv-cache-dtype fp8 \
--max-model-len 131072 --max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-chunked-prefill --enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 --reasoning-parser gemma4 \
--speculative-config '{"method":"draft_model","model":"/models/e4b-drafter","num_speculative_tokens":5,"quantization":"modelopt"}'
Requires three patched files mounted into the container. See GitHub repo for details.
Required Patches
eagle_patched.py— Removes multimodal spec decode guard, adds Gemma4 model whitelist, supports multi-group KV cache (heterogeneous head_dim=256/512)serving_chat_patched.py— Fixes non-streaming reasoning parser (<|channel>tokens stripped byskip_special_tokens=True)modelopt_patched.py— NVFP4 AWQ support + FP8 NaN scrubbing
Related Models
GitHub repo: AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4 — deployment docs, Docker Compose, speculative decoding patches
EAGLE E4B Drafter: AEON-7/Gemma-4-E4B-DECKARD-HERETIC-Uncensored-NVFP4 — 9.6 GB EAGLE drafter for speculative decoding | GitHub
SVDQuant variant: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant — SVD decomposition for potentially higher quality
Gemma 4 MoE NVFP4: AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 — MoE variant, faster throughput
Docker container: ghcr.io/aeon-7/vllm-spark-gemma4-nvfp4
Base model: DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking
License
This model inherits the Gemma license from the base model.
Legal Disclaimer
THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.
- Downloads last month
- 558
Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4
Base model
google/gemma-4-31B-it