Qwen3.5-35B-A3B GPTQ 8-bit

GPTQ 8-bit quantization of Qwen/Qwen3.5-35B-A3B, a 35B-parameter Mixture-of-Experts (MoE) multimodal model with 3B activated parameters per token.

Includes full vision encoder and MTP (Multi-Token Prediction) module for image understanding and speculative decoding support.

Model Overview

Architecture: Qwen3_5MoeForConditionalGeneration (multimodal: text + vision)
Total parameters: ~35B
Activated parameters: ~3B per token (8 of 256 experts selected per token)
Layers: 40 (30 linear attention + 10 full attention, repeating 3:1 pattern)
Experts: 256 per layer + 1 shared expert per layer
Context length: 262,144 tokens
Vision encoder: 27-block ViT (1152 hidden, 16x16 patches), BF16
MTP module: 1-layer speculative decoding head, BF16

Quantization Details

All 30,720 MoE expert modules (256 experts x 3 projections x 40 layers) are quantized to INT8 using GPTQ. Non-expert modules (including the full vision encoder and MTP module) remain at BF16/FP16 for quality preservation.

Component	Precision	Notes
MoE experts (`gate_proj`, `up_proj`, `down_proj`)	INT8 (GPTQ)	30,720 modules quantized
Full attention (`q_proj`, `k_proj`, `v_proj`, `o_proj`)	FP16	Every 4th layer
Linear attention (`in_proj_qkv`, `in_proj_z`, `out_proj`)	FP16	Full precision
Shared experts	FP16	Full precision
Vision encoder (`model.visual.*`)	BF16	333 tensors, full precision
MTP module (`mtp.*`)	BF16	785 tensors, full precision
Embeddings, LM head, norms	FP16	Full precision

GPTQ configuration:

Bits: 8
Group size: 32
Symmetric: Yes
desc_act: No
true_sequential: Yes
act_group_aware: Yes
Failsafe: RTN for poorly-calibrated rare experts (1,344 of 30,720 modules, ~4.4%)

Calibration

Dataset: Mixed - evol-codealpaca-v1 (code) + C4 (general text)
Samples: 2,048
Quantizer: GPTQModel v5.7.1

Model Size

Version	Size	Compression
BF16 (original)	67 GB	-
GPTQ 8-bit	40 GB	1.7x
GPTQ 4-bit	25 GB	2.7x

Perplexity

Evaluated on wikitext-2-raw-v1 (test set), seq_len=2048, stride=512:

Model	Perplexity	Degradation
BF16 (original)	6.0695	-
GPTQ 8-bit	6.0748	+0.09%
GPTQ 4-bit	6.1260	+0.93%

Usage

vLLM (Recommended for Serving)

vllm serve btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit \
  --gpu-memory-utilization 0.95 \
  --max-model-len 256000 \
  --tensor-parallel-size 4 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --dtype float16 \
  --skip-mm-profiling \
  --limit-mm-per-prompt '{"image": 2}'

Parameter	Description
`--gpu-memory-utilization 0.95`	Use 95% of GPU VRAM for KV cache + weights
`--max-model-len 256000`	Full 256K context window support
`--tensor-parallel-size 4`	Shard across 4 GPUs (adjust to your setup)
`--reasoning-parser qwen3`	Enable thinking/reasoning token parsing
`--enable-auto-tool-choice --tool-call-parser qwen3_coder`	Enable tool/function calling
`--dtype float16`	Run in FP16 (required for ROCm GPTQ kernels)
`--skip-mm-profiling`	Skip multimodal memory profiling at startup
`--limit-mm-per-prompt '{"image": 2}'`	Allow up to 2 images per request

vLLM bug workaround: vLLM versions up to at least 0.15.2 have a bug in Qwen3_5MoeTextConfig where ignore_keys_at_rope_validation is defined as a list instead of a set, causing a TypeError during config parsing. Apply this fix before serving:

python3 -c "
for f in [
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5_moe.py',
    '/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/configs/qwen3_5.py',
]:
    t = open(f).read()
    t = t.replace(
        'ignore_keys_at_rope_validation\"] = [\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        ]',
        'ignore_keys_at_rope_validation\"] = {\n            \"mrope_section\",\n            \"mrope_interleaved\",\n        }')
    open(f,'w').write(t)
    print('Patched', f)
"

Vision Example (via OpenAI API)

import base64, requests

with open("image.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit",
    "messages": [{"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
        {"type": "text", "text": "Describe what you see in this image."},
    ]}],
    "max_tokens": 1024,
})
print(response.json()["choices"][0]["message"]["content"])

GPTQModel / transformers

Note: Neither GPTQModel nor transformers can currently load this model directly. GPTQModel's Qwen3_5MoeGPTQ class expects the text-only weight prefix (model.layers.*) and does not support the multimodal architecture (model.language_model.layers.*). The transformers GPTQ path delegates to optimum, which does not handle the fused-expert architecture. Use vLLM for inference.

Technical Notes

Qwen3.5-35B-A3B stores MoE expert weights as fused 3D nn.Parameter tensors rather than individual nn.Linear modules. During quantization, GPTQModel's MODULE_CONVERTER_MAP converts these to individual quantizable nn.Linear layers. This same conversion must also run during model loading for the quantized kernels to be applied correctly.

The vision encoder (27-block ViT) and MTP speculative decoding module are preserved at full BF16 precision from the original model. Only the text model's MoE expert weights are quantized.

Credits

Base Model: Qwen - Qwen3.5-35B-A3B
Quantization: GPTQ via GPTQModel v5.7.1
Expert Converter: convert_qwen3_5_moe_expert_converter for fused 3D expert weights
Quantized by: btbtyler09

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month: 985

Safetensors

Model size

36B params

Tensor type

I32

BF16

Model tree for btbtyler09/Qwen3.5-35B-A3B-GPTQ-8bit

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(216)

this model