Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored GPTQ Int4

GPTQ INT4 quantization of DavidAU/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking.

Original Model

Base Architecture: Qwen3.5 dense (40B parameters, 96 layers)
Expanded from: Qwen3.5-27B (64 layers → 96 layers for enhanced reasoning)
Hybrid attention: Linear attention (Gated DeltaNet) + full attention layers
Fine-tuned on: Claude 4.6 Opus Deckard-Heretic uncensored thinking data
Features: Deep reasoning, thinking mode, tool calling support, uncensored
Original Size: ~80 GB (BF16)

Quantization Details

Method: GPTQ via GPTQModel v5.8.0
Settings: Matching Qwen official GPTQ-Int4 recipe
- Bits: 4
- Group size: 128
- Symmetric: True
- Desc act: False
- True sequential: True
- Damp percent: 0.01
Calibration: 256 samples from allenai/c4
Dynamic exclusions (BF16): Matching Qwen official mixed-precision strategy — only MLP layers quantized to Int4:
- lm_head — output head (BF16)
- model.language_model.embed_tokens — input embeddings (BF16)
- .*attn.* — all attention layers, both linear and full (BF16)
- .*mtp.* — multi-token prediction layers (BF16)
- .*visual.* — vision encoder modules (BF16)
Quantized on: NVIDIA A100 80GB PCIe (RunPod)
Quantized model size: 38 GB (10 safetensors shards)
Quantization time: ~38 minutes on A100 80GB

Config Format

This model uses the nested Qwen3.5 config format (matching official Qwen models):

Top-level: model_type: "qwen3_5", architectures: ["Qwen3_5ForConditionalGeneration"]
Inner: text_config with model_type: "qwen3_5_text"
Weight keys use language_model.model.layers.* prefix (Qwen3.5 standard)
Includes preprocessor_config.json for compatibility

Compatible with vLLM and SGLang out of the box.

Serving

vLLM (tested and recommended)

Tested on 4x RTX 3060 (12GB each, TP=4) with vLLM 0.18.0:

vllm serve raydelossantos/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-GPTQ-Int4 \
    --quantization gptq \
    --tensor-parallel-size 4 \
    --dtype float16 \
    --max-model-len 4096 \
    --enforce-eager \
    --trust-remote-code \
    --served-model-name qwen3.5-40b-claude \
    --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice

Tested package versions (working as of 2026-03-20):

Package	Version	Notes
vllm	0.18.0	Stable release
transformers	5.3.0	Required for `qwen3_5` model_type support
torch	2.10.0	CUDA 12.8
huggingface_hub	1.7.2
flash-attn	2.8.3	Pre-built for cu128/torch2.10/sm80_86_90

Important notes:

--dtype float16 is required (GPTQ Exllama kernel needs FP16, not BF16)
--enforce-eager recommended for stability on consumer GPUs (disables CUDA graphs)
--quantization gptq forces the slower but more compatible GPTQ kernel. Omit to use gptq_marlin for faster inference (vLLM auto-detects)
Reasoning output uses <think>...</think> tags (qwen3 parser)
Tool calls use Qwen3 XML format (--tool-call-parser qwen3_xml)

Note on model size: This quant is ~38 GB (vs ~23 GB for the 4.5 Opus variant) because attention layers are kept in BF16 following the Qwen official recipe. This preserves attention quality at the cost of higher VRAM. On 4x RTX 3060 (48 GB), context length may need to be reduced compared to the fully-quantized version.

SGLang

python -m sglang.launch_server \
    --model-path raydelossantos/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-GPTQ-Int4 \
    --quantization gptq \
    --tp 4 \
    --dtype float16 \
    --context-length 8192 \
    --trust-remote-code

Note: SGLang requires transformers==4.57.1 for compatibility with SGLang 0.5.9. The model_type may need patching from qwen3_5 to match SGLang's internal config.

Hardware Requirements

Setup	VRAM	Context	Notes
4x RTX 3060 (TP=4)	48 GB	2-4K	Tight — model weights ~9.5 GiB/GPU
4x RTX 3090 (TP=4)	96 GB	32K+	Comfortable
1x A6000 48GB	48 GB	8K	Single GPU
1x A100 80GB	80 GB	64K+	Best single-GPU option

System RAM: 32+ GB recommended (16 GB + 32 GB swap works with vLLM)

Model Architecture

Type: Qwen3.5 dense (not MoE)
Parameters: 40B
Layers: 96 (expanded from 27B/64 layers)
Attention: Hybrid — 72 linear attention (Gated DeltaNet) + 24 full attention (3:1 ratio)
Attention heads: 24 (4 KV heads, GQA) — TP must divide both (TP=1,2,4)
Head dim: 256
Vocabulary: 248,320 tokens
Context: Up to 262K tokens (model native), limited by available KV cache memory

SHA256 Checksums

2adbaba0282af81fc3dfdc49e8a4077439c72f7a2d2768003c9430ca390e8579  model-00001-of-00010.safetensors
794ce95b633de3e2aa9761254568f639f99e243ab30dcc406b0f0efd01b174c4  model-00002-of-00010.safetensors
8dc6e58d8c27b7469ba8caf7213f52bdcdde0ad7634f725d29c934367d7ea434  model-00003-of-00010.safetensors
3164ac58be9facf281bf935ae48ca4bde48e7acb948fb38b8f0e70a7c3d1a1ef  model-00004-of-00010.safetensors
9906539a4b5093fb5ba7dcc869fbcfbfd54f654d9d4c9816bdbbb61c01ae1409  model-00005-of-00010.safetensors
c1da55b1fe156e2d7210fac427756b1b4dde2a0619c1aef1916a57bbf8602917  model-00006-of-00010.safetensors
c9de76d065c2d13133986875b648ae44ecc680b7b69d64f7eaa8bd4c5acf6594  model-00007-of-00010.safetensors
3a4310b5cc02c1d77503b12d4f37d7b3404422cc3007492ccbcdc1380f2d205c  model-00008-of-00010.safetensors
61d27d6255105e34c4ea6a0f0ae6be8c2c3210ee2764f0058112c7d050bc9fdd  model-00009-of-00010.safetensors
1e89195bf5c4551e62f65d1c4e301d630013e910920326f4716638c27c5e2c54  model-00010-of-00010.safetensors