Instructions to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8") model = AutoModelForImageTextToText.from_pretrained("nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8
- SGLang
How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with Docker Model Runner:
docker model run hf.co/nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8
Qwen3.6-35B-A3B-Quark-W8A8-INT8
W8A8 INT8 quantized version of Qwen/Qwen3.6-35B-A3B produced with AMD Quark.
Model Details
| Base Model | Qwen/Qwen3.6-35B-A3B |
| Architecture | Qwen3_5MoeForConditionalGeneration (multimodal: ViT vision + text MoE + MTP head) |
| Parameters | 35B total / 3B activated per token (256 experts, top-8) + 27-block ViT (BF16) |
| Quantization | W8A8 INT8 — per-channel weight + per-token dynamic activation |
| Quantizer | AMD Quark 0.11.1 (pack_method='order', weight_format='real_quantized') |
| Model Size | ~35 GB (7 shards of ~5 GB) |
| Original Size | ~67 GB (BF16, 26 shards) |
| Compression | ~1.93× size reduction |
Quantization Scheme
| Component | dtype | Granularity | Mode |
|---|---|---|---|
Language attention (q/k/v/o_proj, linear_attn.*) |
INT8 | per-channel weight (axis=0) | weight static |
Language MoE experts (256 × gate/up/down_proj × 40) |
INT8 | per-channel weight (axis=0) | weight static |
shared_expert (gate/up/down_proj) |
INT8 | per-channel weight (axis=0) | weight static |
| All activations above | INT8 | per-token (axis=1) | dynamic |
lm_head |
BF16 | — | unquantized |
embed_tokens |
BF16 | — | unquantized |
MoE router (mlp.gate) — top-k gate |
BF16 | — | unquantized |
shared_expert_gate |
BF16 | — | unquantized |
visual.* (27-block ViT + merger) |
BF16 | — | unquantized |
| MTP head | BF16 | — | unquantized |
Note: MoE experts are stored as 256 per-expert
nn.Lineartriplets (gate_proj/up_proj/down_proj) instead of the upstream fusedgate_up_projtensor. This is required so that Quark observers can attach to each expert as a standardnn.Linear, and the key layout matches vLLM'sFusedMoE.make_expert_params_mappingexactly — no loader-side change needed.
Accuracy
GSM8K full 1319-question test split, served under vLLM, /v1/chat/completions with chat_template_kwargs.enable_thinking=false, temperature=0, concurrency=16, max_tokens=1024.
| Model | Accuracy | Correct |
|---|---|---|
Qwen/Qwen3.6-35B-A3B (BF16 baseline) |
95.91 % | 1265 / 1319 |
| This model (Quark W8A8 INT8) | 95.91 % | 1265 / 1319 |
Δ vs BF16 = 0.00 pp. The two result sets overlap on 1250 / 1280 questions (Jaccard = 0.9766); each side wins 15 problems the other loses — no systematic regression.
Both runs were done on a single AMD MI355X (288 GB HBM3e) at gpu_memory_utilization=0.55 (BF16) / 0.85 (INT8), max_model_len=4096.
Performance
Measured on a single AMD Radeon 8060S APU (gfx1151, "Strix Halo") with 128 GB LPDDR5X-8000 unified memory, container kyuz0/vllm-therock-gfx1151:stable (vLLM 0.19.2rc1.dev113+g6aa057c9d, transformers 5.5.4), TP=1, KV cache BF16 (gfx1151 has no INT8 matrix core).
Long context — input=4000 / output=200, num_prompts = C * 3
--max-model-len 4096 --gpu-memory-utilization 0.85. BF16 baseline is the upstream Qwen3.6-35B-A3B (~67 GB weights).
| Concurrency | BF16 req/s | BF16 out tok/s | Quark W8A8 req/s | Quark W8A8 out tok/s | W8A8 / BF16 |
|---|---|---|---|---|---|
| 1 | 0.044 | 8.83 | 0.060 | 12.02 | +36% |
| 5 | 0.093 | 18.58 | 0.142 | 28.31 | +52% |
| 10 | 0.128 | 25.58 | 0.186 | 37.30 | +46% |
| 20 | 0.163 | 32.53 | 0.240 | 47.98 | +48% |
Short context — input=512 / output=128, --ignore-eos, bs = num_prompts
Typical chat / decode-bound workload:
| Batch size | BF16 out tok/s | Quark W8A8 out tok/s | W8A8 / BF16 |
|---|---|---|---|
| 1 | 13.36 | 17.43 | +30% |
| 8 | 36.47 | 64.91 | +78% |
| 16 | 61.16 | 92.04 | +50% |
Takeaways
- Quark W8A8 beats BF16 at every concurrency we measured on gfx1151, by +30–78 %. The gfx1151 APU has no INT8 matrix core, so the gain comes from the ~2× smaller weight footprint cutting memory-bandwidth pressure (LPDDR5X is the dominant bottleneck on Strix Halo).
- Decode-bound / short-context is where W8A8 shines the most: at 512 in / 128 out, bs=8 → +78 %. Prefill-heavy long contexts still benefit, just less dramatically.
- Fits in unified memory with headroom: the packed INT8 model is ~35 GB vs ~67 GB BF16, so KV cache and weights no longer compete on a 128 GB Strix Halo box (the BF16 build hit a scheduler regression around C=100 where TTFT blew up to ~187 s — W8A8 avoids that class of pressure entirely).
How to Use
With vLLM (Recommended)
vllm serve /path/to/Qwen3.6-35B-A3B-Quark-W8A8-INT8 \
--served-model-name Qwen3.6-35B-A3B-W8A8 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--port 8000
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen3.6-35B-A3B-W8A8",
"messages": [{"role":"user","content":"Solve: 16 - 3 - 4 = ?"}],
"max_tokens": 256, "temperature": 0.7,
"chat_template_kwargs": {"enable_thinking": false}
}'
- vLLM ≥
0.19.2rc1with theqwen3_5_moeregistration is required. - The Qwen3.6 default chat template wraps the response in
<think>...</think>; passenable_thinking=falseif you want the short form.
Hardware Requirements
- Minimum VRAM: ~40 GB free for model weights + KV cache, i.e. a single MI300X / MI355X / H100-80G / A100-80G.
- Can fit on a consumer-class 48 GB card (e.g. W7900D) at
max_model_len≤ 4096, whereas the BF16 original (~68 GB of weights) cannot.
Quantization Details
Excluded layers (kept BF16)
lm_headmodel.language_model.layers.*.mlp.shared_expert_gate(40 × single-output gate)model.visual.pos_embed,model.visual.blocks.*.attn.{qkv,proj},model.visual.blocks.*.mlp.linear_fc{1,2},model.visual.merger.linear_fc{1,2}(full 27-block ViT + merger)model.embed_tokens(not annn.Linear; naturally not touched)- MoE top-k router
mlp.gate— kept BF16 via the custom MoE rewrite (see below) - MTP head — kept BF16
Pre-quantization rewrite
The upstream Qwen3_5MoeExperts module stores 256 experts as a single fused 3-D tensor (gate_up_proj: [E, 2·I, H], down_proj: [E, H, I]). Before quantization this is split in-place into ModuleList[256] of three nn.Linears per expert, following the SwiGLU chunk(2, dim=-1) semantics (front half = gate, back half = up). This makes every expert visible to Quark as a standard nn.Linear, and the resulting key layout is bit-compatible with vLLM's fused MoE loader.
Post-export rename
Quark's native custom_mode='quark' export emits *_quantizer.scale / *_quantizer.zero_point keys. The published shards here have already been converted to the vLLM/HF-compatible layout:
*_quantizer.scale→*_scale*_quantizer.zero_point→ dropped (symmetric quant)weight_scalesqueezed from[out, 1]to[out]
Reproduce
Core Quark config fragment:
from quark.torch.quantization.config.config import (
QTensorConfig, QuantizationConfig, Config, Dtype,
)
from quark.torch.quantization.config.type import (
RoundType, ScaleType, QSchemeType,
)
from quark.torch.quantization.observer import PerChannelMinMaxObserver
weight = QTensorConfig(
dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
symmetric=True, is_dynamic=False,
qscheme=QSchemeType.per_channel, ch_axis=0,
round_method=RoundType.round, scale_type=ScaleType.float,
)
act = QTensorConfig(
dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
symmetric=True, is_dynamic=True,
qscheme=QSchemeType.per_channel, ch_axis=1,
round_method=RoundType.round, scale_type=ScaleType.float,
)
cfg = Config(
global_quant_config=QuantizationConfig(weight=weight, input_tensors=act),
exclude=[
"lm_head",
"*mlp.gate", # MoE router
"*shared_expert_gate", # per-layer gate
"*visual*", # vision tower + merger
"mtp*", # MTP head
],
)
Export with pack_method='order', weight_format='real_quantized', custom_mode='quark', then run the rename_keys.py post-processor.
Citation
@misc{qwen35moe,
title = {Qwen3.6-35B-A3B},
author = {Qwen Team, Alibaba Cloud},
year = {2026},
url = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}
License
This model is released under the Apache License, Version 2.0, following the upstream
Qwen/Qwen3.6-35B-A3B.
- Modified files (the INT8-quantized
model-*.safetensorsand thequantization_configblock inconfig.json) are described inNOTICE. - A copy of the Apache-2.0 license is provided in
LICENSE.
Original weights © 2025–2026 Qwen Team, Alibaba Cloud. Quantization is a derivative work distributed under Apache-2.0; no warranty of any kind is provided.
- Downloads last month
- 4,437
Model tree for nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8
Base model
Qwen/Qwen3.6-35B-A3B