Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored GPTQ Int4
GPTQ INT4 quantization of DavidAU/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking.
Original Model
- Base Architecture: Qwen3.5 dense (40B parameters, 96 layers)
- Expanded from: Qwen3.5-27B (64 layers → 96 layers for enhanced reasoning)
- Hybrid attention: Linear attention (Gated DeltaNet) + full attention layers
- Fine-tuned on: Claude 4.6 Opus Deckard-Heretic uncensored thinking data
- Features: Deep reasoning, thinking mode, tool calling support, uncensored
- Original Size: ~80 GB (BF16)
Quantization Details
- Method: GPTQ via GPTQModel v5.8.0
- Settings: Matching Qwen official GPTQ-Int4 recipe
- Bits: 4
- Group size: 128
- Symmetric: True
- Desc act: False
- True sequential: True
- Damp percent: 0.01
- Calibration: 256 samples from allenai/c4
- Dynamic exclusions (BF16): Matching Qwen official mixed-precision strategy — only MLP layers quantized to Int4:
lm_head— output head (BF16)model.language_model.embed_tokens— input embeddings (BF16).*attn.*— all attention layers, both linear and full (BF16).*mtp.*— multi-token prediction layers (BF16).*visual.*— vision encoder modules (BF16)
- Quantized on: NVIDIA A100 80GB PCIe (RunPod)
- Quantized model size: 38 GB (10 safetensors shards)
- Quantization time: ~38 minutes on A100 80GB
Config Format
This model uses the nested Qwen3.5 config format (matching official Qwen models):
- Top-level:
model_type: "qwen3_5",architectures: ["Qwen3_5ForConditionalGeneration"] - Inner:
text_configwithmodel_type: "qwen3_5_text" - Weight keys use
language_model.model.layers.*prefix (Qwen3.5 standard) - Includes
preprocessor_config.jsonfor compatibility
Compatible with vLLM and SGLang out of the box.
Serving
vLLM (tested and recommended)
Tested on 4x RTX 3060 (12GB each, TP=4) with vLLM 0.18.0:
vllm serve raydelossantos/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-GPTQ-Int4 \
--quantization gptq \
--tensor-parallel-size 4 \
--dtype float16 \
--max-model-len 4096 \
--enforce-eager \
--trust-remote-code \
--served-model-name qwen3.5-40b-claude \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--enable-auto-tool-choice
Tested package versions (working as of 2026-03-20):
| Package | Version | Notes |
|---|---|---|
| vllm | 0.18.0 | Stable release |
| transformers | 5.3.0 | Required for qwen3_5 model_type support |
| torch | 2.10.0 | CUDA 12.8 |
| huggingface_hub | 1.7.2 | |
| flash-attn | 2.8.3 | Pre-built for cu128/torch2.10/sm80_86_90 |
Important notes:
--dtype float16is required (GPTQ Exllama kernel needs FP16, not BF16)--enforce-eagerrecommended for stability on consumer GPUs (disables CUDA graphs)--quantization gptqforces the slower but more compatible GPTQ kernel. Omit to usegptq_marlinfor faster inference (vLLM auto-detects)- Reasoning output uses
<think>...</think>tags (qwen3 parser) - Tool calls use Qwen3 XML format (
--tool-call-parser qwen3_xml)
Note on model size: This quant is ~38 GB (vs ~23 GB for the 4.5 Opus variant) because attention layers are kept in BF16 following the Qwen official recipe. This preserves attention quality at the cost of higher VRAM. On 4x RTX 3060 (48 GB), context length may need to be reduced compared to the fully-quantized version.
SGLang
python -m sglang.launch_server \
--model-path raydelossantos/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-GPTQ-Int4 \
--quantization gptq \
--tp 4 \
--dtype float16 \
--context-length 8192 \
--trust-remote-code
Note: SGLang requires transformers==4.57.1 for compatibility with SGLang 0.5.9. The model_type may need patching from qwen3_5 to match SGLang's internal config.
Hardware Requirements
| Setup | VRAM | Context | Notes |
|---|---|---|---|
| 4x RTX 3060 (TP=4) | 48 GB | 2-4K | Tight — model weights ~9.5 GiB/GPU |
| 4x RTX 3090 (TP=4) | 96 GB | 32K+ | Comfortable |
| 1x A6000 48GB | 48 GB | 8K | Single GPU |
| 1x A100 80GB | 80 GB | 64K+ | Best single-GPU option |
System RAM: 32+ GB recommended (16 GB + 32 GB swap works with vLLM)
Model Architecture
- Type: Qwen3.5 dense (not MoE)
- Parameters: 40B
- Layers: 96 (expanded from 27B/64 layers)
- Attention: Hybrid — 72 linear attention (Gated DeltaNet) + 24 full attention (3:1 ratio)
- Attention heads: 24 (4 KV heads, GQA) — TP must divide both (TP=1,2,4)
- Head dim: 256
- Vocabulary: 248,320 tokens
- Context: Up to 262K tokens (model native), limited by available KV cache memory
SHA256 Checksums
2adbaba0282af81fc3dfdc49e8a4077439c72f7a2d2768003c9430ca390e8579 model-00001-of-00010.safetensors
794ce95b633de3e2aa9761254568f639f99e243ab30dcc406b0f0efd01b174c4 model-00002-of-00010.safetensors
8dc6e58d8c27b7469ba8caf7213f52bdcdde0ad7634f725d29c934367d7ea434 model-00003-of-00010.safetensors
3164ac58be9facf281bf935ae48ca4bde48e7acb948fb38b8f0e70a7c3d1a1ef model-00004-of-00010.safetensors
9906539a4b5093fb5ba7dcc869fbcfbfd54f654d9d4c9816bdbbb61c01ae1409 model-00005-of-00010.safetensors
c1da55b1fe156e2d7210fac427756b1b4dde2a0619c1aef1916a57bbf8602917 model-00006-of-00010.safetensors
c9de76d065c2d13133986875b648ae44ecc680b7b69d64f7eaa8bd4c5acf6594 model-00007-of-00010.safetensors
3a4310b5cc02c1d77503b12d4f37d7b3404422cc3007492ccbcdc1380f2d205c model-00008-of-00010.safetensors
61d27d6255105e34c4ea6a0f0ae6be8c2c3210ee2764f0058112c7d050bc9fdd model-00009-of-00010.safetensors
1e89195bf5c4551e62f65d1c4e301d630013e910920326f4716638c27c5e2c54 model-00010-of-00010.safetensors
Acknowledgments
- Original model by DavidAU
- Base architecture by Qwen Team
- Quantization recipe based on Qwen official GPTQ methodology
- Quantized using GPTQModel v5.8.0 by ModelCloud
- Infrastructure: RunPod (A100 80GB PCIe)
- Downloads last month
- 2,461
Model tree for raydelossantos/Qwen3.5-40B-Claude-4.6-Opus-Deckard-Heretic-GPTQ-Int4
Base model
Qwen/Qwen3.5-27B