Qwen3.6-35B-A3B-Quark-W8A8-INT8

W8A8 INT8 quantized version of Qwen/Qwen3.6-35B-A3B produced with AMD Quark.

Model Details

Base Model Qwen/Qwen3.6-35B-A3B
Architecture Qwen3_5MoeForConditionalGeneration (multimodal: ViT vision + text MoE + MTP head)
Parameters 35B total / 3B activated per token (256 experts, top-8) + 27-block ViT (BF16)
Quantization W8A8 INT8 — per-channel weight + per-token dynamic activation
Quantizer AMD Quark 0.11.1 (pack_method='order', weight_format='real_quantized')
Model Size ~35 GB (7 shards of ~5 GB)
Original Size ~67 GB (BF16, 26 shards)
Compression ~1.93× size reduction

Quantization Scheme

Component dtype Granularity Mode
Language attention (q/k/v/o_proj, linear_attn.*) INT8 per-channel weight (axis=0) weight static
Language MoE experts (256 × gate/up/down_proj × 40) INT8 per-channel weight (axis=0) weight static
shared_expert (gate/up/down_proj) INT8 per-channel weight (axis=0) weight static
All activations above INT8 per-token (axis=1) dynamic
lm_head BF16 unquantized
embed_tokens BF16 unquantized
MoE router (mlp.gate) — top-k gate BF16 unquantized
shared_expert_gate BF16 unquantized
visual.* (27-block ViT + merger) BF16 unquantized
MTP head BF16 unquantized

Note: MoE experts are stored as 256 per-expert nn.Linear triplets (gate_proj/up_proj/down_proj) instead of the upstream fused gate_up_proj tensor. This is required so that Quark observers can attach to each expert as a standard nn.Linear, and the key layout matches vLLM's FusedMoE.make_expert_params_mapping exactly — no loader-side change needed.

Accuracy

GSM8K full 1319-question test split, served under vLLM, /v1/chat/completions with chat_template_kwargs.enable_thinking=false, temperature=0, concurrency=16, max_tokens=1024.

Model Accuracy Correct
Qwen/Qwen3.6-35B-A3B (BF16 baseline) 95.91 % 1265 / 1319
This model (Quark W8A8 INT8) 95.91 % 1265 / 1319

Δ vs BF16 = 0.00 pp. The two result sets overlap on 1250 / 1280 questions (Jaccard = 0.9766); each side wins 15 problems the other loses — no systematic regression.

Both runs were done on a single AMD MI355X (288 GB HBM3e) at gpu_memory_utilization=0.55 (BF16) / 0.85 (INT8), max_model_len=4096.

Performance

Measured on a single AMD Radeon 8060S APU (gfx1151, "Strix Halo") with 128 GB LPDDR5X-8000 unified memory, container kyuz0/vllm-therock-gfx1151:stable (vLLM 0.19.2rc1.dev113+g6aa057c9d, transformers 5.5.4), TP=1, KV cache BF16 (gfx1151 has no INT8 matrix core).

Long context — input=4000 / output=200, num_prompts = C * 3

--max-model-len 4096 --gpu-memory-utilization 0.85. BF16 baseline is the upstream Qwen3.6-35B-A3B (~67 GB weights).

Concurrency BF16 req/s BF16 out tok/s Quark W8A8 req/s Quark W8A8 out tok/s W8A8 / BF16
1 0.044 8.83 0.060 12.02 +36%
5 0.093 18.58 0.142 28.31 +52%
10 0.128 25.58 0.186 37.30 +46%
20 0.163 32.53 0.240 47.98 +48%

Short context — input=512 / output=128, --ignore-eos, bs = num_prompts

Typical chat / decode-bound workload:

Batch size BF16 out tok/s Quark W8A8 out tok/s W8A8 / BF16
1 13.36 17.43 +30%
8 36.47 64.91 +78%
16 61.16 92.04 +50%

Takeaways

  • Quark W8A8 beats BF16 at every concurrency we measured on gfx1151, by +30–78 %. The gfx1151 APU has no INT8 matrix core, so the gain comes from the ~2× smaller weight footprint cutting memory-bandwidth pressure (LPDDR5X is the dominant bottleneck on Strix Halo).
  • Decode-bound / short-context is where W8A8 shines the most: at 512 in / 128 out, bs=8 → +78 %. Prefill-heavy long contexts still benefit, just less dramatically.
  • Fits in unified memory with headroom: the packed INT8 model is ~35 GB vs ~67 GB BF16, so KV cache and weights no longer compete on a 128 GB Strix Halo box (the BF16 build hit a scheduler regression around C=100 where TTFT blew up to ~187 s — W8A8 avoids that class of pressure entirely).

How to Use

With vLLM (Recommended)

vllm serve /path/to/Qwen3.6-35B-A3B-Quark-W8A8-INT8 \
    --served-model-name Qwen3.6-35B-A3B-W8A8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --port 8000

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3.6-35B-A3B-W8A8",
    "messages": [{"role":"user","content":"Solve: 16 - 3 - 4 = ?"}],
    "max_tokens": 256, "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'
  • vLLM ≥ 0.19.2rc1 with the qwen3_5_moe registration is required.
  • The Qwen3.6 default chat template wraps the response in <think>...</think>; pass enable_thinking=false if you want the short form.

Hardware Requirements

  • Minimum VRAM: ~40 GB free for model weights + KV cache, i.e. a single MI300X / MI355X / H100-80G / A100-80G.
  • Can fit on a consumer-class 48 GB card (e.g. W7900D) at max_model_len ≤ 4096, whereas the BF16 original (~68 GB of weights) cannot.

Quantization Details

Excluded layers (kept BF16)

  • lm_head
  • model.language_model.layers.*.mlp.shared_expert_gate (40 × single-output gate)
  • model.visual.pos_embed, model.visual.blocks.*.attn.{qkv,proj}, model.visual.blocks.*.mlp.linear_fc{1,2}, model.visual.merger.linear_fc{1,2} (full 27-block ViT + merger)
  • model.embed_tokens (not an nn.Linear; naturally not touched)
  • MoE top-k router mlp.gate — kept BF16 via the custom MoE rewrite (see below)
  • MTP head — kept BF16

Pre-quantization rewrite

The upstream Qwen3_5MoeExperts module stores 256 experts as a single fused 3-D tensor (gate_up_proj: [E, 2·I, H], down_proj: [E, H, I]). Before quantization this is split in-place into ModuleList[256] of three nn.Linears per expert, following the SwiGLU chunk(2, dim=-1) semantics (front half = gate, back half = up). This makes every expert visible to Quark as a standard nn.Linear, and the resulting key layout is bit-compatible with vLLM's fused MoE loader.

Post-export rename

Quark's native custom_mode='quark' export emits *_quantizer.scale / *_quantizer.zero_point keys. The published shards here have already been converted to the vLLM/HF-compatible layout:

  • *_quantizer.scale*_scale
  • *_quantizer.zero_point → dropped (symmetric quant)
  • weight_scale squeezed from [out, 1] to [out]

Reproduce

Core Quark config fragment:

from quark.torch.quantization.config.config import (
    QTensorConfig, QuantizationConfig, Config, Dtype,
)
from quark.torch.quantization.config.type import (
    RoundType, ScaleType, QSchemeType,
)
from quark.torch.quantization.observer import PerChannelMinMaxObserver

weight = QTensorConfig(
    dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
    symmetric=True, is_dynamic=False,
    qscheme=QSchemeType.per_channel, ch_axis=0,
    round_method=RoundType.round, scale_type=ScaleType.float,
)
act = QTensorConfig(
    dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
    symmetric=True, is_dynamic=True,
    qscheme=QSchemeType.per_channel, ch_axis=1,
    round_method=RoundType.round, scale_type=ScaleType.float,
)
cfg = Config(
    global_quant_config=QuantizationConfig(weight=weight, input_tensors=act),
    exclude=[
        "lm_head",
        "*mlp.gate",              # MoE router
        "*shared_expert_gate",    # per-layer gate
        "*visual*",               # vision tower + merger
        "mtp*",                   # MTP head
    ],
)

Export with pack_method='order', weight_format='real_quantized', custom_mode='quark', then run the rename_keys.py post-processor.

Citation

@misc{qwen35moe,
  title  = {Qwen3.6-35B-A3B},
  author = {Qwen Team, Alibaba Cloud},
  year   = {2026},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}

License

This model is released under the Apache License, Version 2.0, following the upstream Qwen/Qwen3.6-35B-A3B.

  • Modified files (the INT8-quantized model-*.safetensors and the quantization_config block in config.json) are described in NOTICE.
  • A copy of the Apache-2.0 license is provided in LICENSE.

Original weights © 2025–2026 Qwen Team, Alibaba Cloud. Quantization is a derivative work distributed under Apache-2.0; no warranty of any kind is provided.

Downloads last month
4,437
Safetensors
Model size
35B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8

Quantized
(332)
this model