Qwen3.5-35B-A3B GPTQ 8-bit

GPTQ 8-bit quantization of Qwen/Qwen3.5-35B-A3B, a 35B-parameter Mixture-of-Experts (MoE) multimodal model with 3B activated parameters per token.

Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

  • Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
  • Total parameters: ~35B
  • Activated parameters: ~3B per token (8 of 256 experts selected per token)
  • Layers: 40 (30 linear attention + 10 full attention, repeating 3:1 pattern)
  • Experts: 256 per layer + 1 shared expert per layer
  • Context length: 262,144 tokens
  • Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
  • MTP module: 1-layer speculative decoding head, BF16

Quantization Details

All 30,720 MoE expert modules (256 experts x 3 projections x 40 layers) are quantized to INT8 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.

Component Precision Notes
MoE experts (gate_proj, up_proj, down_proj) INT8 (GPTQ) 30,720 modules quantized
Full attention (q_proj, k_proj, v_proj, o_proj) FP16 Every 4th layer
Linear attention (in_proj_qkv, in_proj_z, out_proj) FP16 Full precision
Shared experts FP16 Full precision
Vision encoder (model.visual.*) BF16 333 tensors, full precision
MTP module (mtp.*) BF16 785 tensors, full precision
Embeddings, LM head, norms FP16 Full precision

GPTQ configuration:

  • Bits: 8
  • Group size: 32
  • Symmetric: Yes
  • desc_act: No
  • true_sequential: Yes
  • act_group_aware: Yes
  • Failsafe: RTN for poorly-calibrated rare experts (1,344 of 30,720 modules, ~4.4%)

Calibration

  • Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
  • Samples: 2,048
  • Quantizer: GPTQModel v5.7.1

Model Size

Version Size Compression
BF16 (original) 67 GB -
GPTQ 8-bit 40 GB 1.7x
GPTQ 4-bit 25 GB 2.7x

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model Perplexity Degradation
BF16 (original) 6.0695 -
GPTQ 8-bit 6.0748 +0.09%
GPTQ 4-bit 6.1260 +0.93%

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit \
  --gpu-memory-utilization 0.95 \
  --max-model-len 256000 \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'
Parameter Description
--gpu-memory-utilization 0.95 Use 95% of GPU VRAM for KV cache + weights
--max-model-len 256000 Full 256K context window support
--tensor-parallel-size 4 Shard across 4 GPUs (adjust to your setup)
--reasoning-parser qwen3 Enable thinking/reasoning token parsing
--enable-auto-tool-choice --tool-call-parser qwen3_coder Enable tool/function calling
--dtype float16 Run in FP16 (required for ROCm GPTQ kernels)
--skip-mm-profiling Skip multimodal memory profiling at startup
--limit-mm-per-prompt '{"image": 2}' Allow up to 2 images per request

vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in Qwen3_5MoeTextConfig where ignore_keys_at_rope_validation is defined as a list instead of a set, causing a TypeError during config parsing. Apply this fix before serving:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's Qwen3_5MoeGPTQ class expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates to optimum, which does not handle the fused-expert architecture. Use vLLM for inference.

Technical Notes

Qwen3.5-35B-A3B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized.

Credits

  • Base Model: Qwen - Qwen3.5-35B-A3B
  • Quantization: GPTQ via GPTQModel v5.7.1
  • Expert Converter: convert_qwen3_5_moe_expert_converter for fused 3D expert weights
  • Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
985
Safetensors
Model size
36B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit

Quantized
(216)
this model