Instructions to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8")
model = AutoModelForImageTextToText.from_pretrained("nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8

SGLang

How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8 with Docker Model Runner:
```
docker model run hf.co/nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8
```

Qwen3.6-35B-A3B-Quark-W8A8-INT8

W8A8 INT8 quantized version of Qwen/Qwen3.6-35B-A3B produced with AMD Quark.

Model Details


Base Model	`Qwen/Qwen3.6-35B-A3B`
Architecture	`Qwen3_5MoeForConditionalGeneration` (multimodal: ViT vision + text MoE + MTP head)
Parameters	35B total / 3B activated per token (256 experts, top-8) + 27-block ViT (BF16)
Quantization	W8A8 INT8 — per-channel weight + per-token dynamic activation
Quantizer	AMD Quark `0.11.1` (`pack_method='order'`, `weight_format='real_quantized'`)
Model Size	~35 GB (7 shards of ~5 GB)
Original Size	~67 GB (BF16, 26 shards)
Compression	~1.93× size reduction

Quantization Scheme

Component	dtype	Granularity	Mode
Language attention (`q/k/v/o_proj`, `linear_attn.*`)	INT8	per-channel weight (axis=0)	weight static
Language MoE experts (256 × `gate/up/down_proj` × 40)	INT8	per-channel weight (axis=0)	weight static
`shared_expert` (`gate/up/down_proj`)	INT8	per-channel weight (axis=0)	weight static
All activations above	INT8	per-token (axis=1)	dynamic
`lm_head`	BF16	—	unquantized
`embed_tokens`	BF16	—	unquantized
MoE router (`mlp.gate`) — top-k gate	BF16	—	unquantized
`shared_expert_gate`	BF16	—	unquantized
`visual.*` (27-block ViT + merger)	BF16	—	unquantized
MTP head	BF16	—	unquantized

Note: MoE experts are stored as 256 per-expert nn.Linear triplets (gate_proj/up_proj/down_proj) instead of the upstream fused gate_up_proj tensor. This is required so that Quark observers can attach to each expert as a standard nn.Linear, and the key layout matches vLLM's FusedMoE.make_expert_params_mapping exactly — no loader-side change needed.

Accuracy

GSM8K full 1319-question test split, served under vLLM, /v1/chat/completions with chat_template_kwargs.enable_thinking=false, temperature=0, concurrency=16, max_tokens=1024.

Model	Accuracy	Correct
`Qwen/Qwen3.6-35B-A3B` (BF16 baseline)	95.91 %	1265 / 1319
This model (Quark W8A8 INT8)	95.91 %	1265 / 1319

Δ vs BF16 = 0.00 pp. The two result sets overlap on 1250 / 1280 questions (Jaccard = 0.9766); each side wins 15 problems the other loses — no systematic regression.

Both runs were done on a single AMD MI355X (288 GB HBM3e) at gpu_memory_utilization=0.55 (BF16) / 0.85 (INT8), max_model_len=4096.

Performance

Measured on a single AMD Radeon 8060S APU (gfx1151, "Strix Halo") with 128 GB LPDDR5X-8000 unified memory, container kyuz0/vllm-therock-gfx1151:stable (vLLM 0.19.2rc1.dev113+g6aa057c9d, transformers 5.5.4), TP=1, KV cache BF16 (gfx1151 has no INT8 matrix core).

Long context — `input=4000 / output=200`, `num_prompts = C * 3`

--max-model-len 4096 --gpu-memory-utilization 0.85. BF16 baseline is the upstream Qwen3.6-35B-A3B (~67 GB weights).

Concurrency	BF16 req/s	BF16 out tok/s	Quark W8A8 req/s	Quark W8A8 out tok/s	W8A8 / BF16
1	0.044	8.83	0.060	12.02	+36%
5	0.093	18.58	0.142	28.31	+52%
10	0.128	25.58	0.186	37.30	+46%
20	0.163	32.53	0.240	47.98	+48%

Short context — `input=512 / output=128`, `--ignore-eos`, bs = num_prompts

Typical chat / decode-bound workload:

Batch size	BF16 out tok/s	Quark W8A8 out tok/s	W8A8 / BF16
1	13.36	17.43	+30%
8	36.47	64.91	+78%
16	61.16	92.04	+50%

Takeaways

Quark W8A8 beats BF16 at every concurrency we measured on gfx1151, by +30–78 %. The gfx1151 APU has no INT8 matrix core, so the gain comes from the ~2× smaller weight footprint cutting memory-bandwidth pressure (LPDDR5X is the dominant bottleneck on Strix Halo).
Decode-bound / short-context is where W8A8 shines the most: at 512 in / 128 out, bs=8 → +78 %. Prefill-heavy long contexts still benefit, just less dramatically.
Fits in unified memory with headroom: the packed INT8 model is ~35 GB vs ~67 GB BF16, so KV cache and weights no longer compete on a 128 GB Strix Halo box (the BF16 build hit a scheduler regression around C=100 where TTFT blew up to ~187 s — W8A8 avoids that class of pressure entirely).

How to Use

With vLLM (Recommended)

vllm serve /path/to/Qwen3.6-35B-A3B-Quark-W8A8-INT8 \
    --served-model-name Qwen3.6-35B-A3B-W8A8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.85 \
    --trust-remote-code \
    --port 8000

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen3.6-35B-A3B-W8A8",
    "messages": [{"role":"user","content":"Solve: 16 - 3 - 4 = ?"}],
    "max_tokens": 256, "temperature": 0.7,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

vLLM ≥ 0.19.2rc1 with the qwen3_5_moe registration is required.
The Qwen3.6 default chat template wraps the response in <think>...</think>; pass enable_thinking=false if you want the short form.

Hardware Requirements

Minimum VRAM: ~40 GB free for model weights + KV cache, i.e. a single MI300X / MI355X / H100-80G / A100-80G.
Can fit on a consumer-class 48 GB card (e.g. W7900D) at max_model_len ≤ 4096, whereas the BF16 original (~68 GB of weights) cannot.

Quantization Details

Excluded layers (kept BF16)

lm_head
model.language_model.layers.*.mlp.shared_expert_gate (40 × single-output gate)
model.visual.pos_embed, model.visual.blocks.*.attn.{qkv,proj}, model.visual.blocks.*.mlp.linear_fc{1,2}, model.visual.merger.linear_fc{1,2} (full 27-block ViT + merger)
model.embed_tokens (not an nn.Linear; naturally not touched)
MoE top-k router mlp.gate — kept BF16 via the custom MoE rewrite (see below)
MTP head — kept BF16

Pre-quantization rewrite

The upstream Qwen3_5MoeExperts module stores 256 experts as a single fused 3-D tensor (gate_up_proj: [E, 2·I, H], down_proj: [E, H, I]). Before quantization this is split in-place into ModuleList[256] of three nn.Linears per expert, following the SwiGLU chunk(2, dim=-1) semantics (front half = gate, back half = up). This makes every expert visible to Quark as a standard nn.Linear, and the resulting key layout is bit-compatible with vLLM's fused MoE loader.

Post-export rename

Quark's native custom_mode='quark' export emits *_quantizer.scale / *_quantizer.zero_point keys. The published shards here have already been converted to the vLLM/HF-compatible layout:

*_quantizer.scale → *_scale
*_quantizer.zero_point → dropped (symmetric quant)
weight_scale squeezed from [out, 1] to [out]

Reproduce

Core Quark config fragment:

from quark.torch.quantization.config.config import (
    QTensorConfig, QuantizationConfig, Config, Dtype,
)
from quark.torch.quantization.config.type import (
    RoundType, ScaleType, QSchemeType,
)
from quark.torch.quantization.observer import PerChannelMinMaxObserver

weight = QTensorConfig(
    dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
    symmetric=True, is_dynamic=False,
    qscheme=QSchemeType.per_channel, ch_axis=0,
    round_method=RoundType.round, scale_type=ScaleType.float,
)
act = QTensorConfig(
    dtype=Dtype.int8, observer_cls=PerChannelMinMaxObserver,
    symmetric=True, is_dynamic=True,
    qscheme=QSchemeType.per_channel, ch_axis=1,
    round_method=RoundType.round, scale_type=ScaleType.float,
)
cfg = Config(
    global_quant_config=QuantizationConfig(weight=weight, input_tensors=act),
    exclude=[
        "lm_head",
        "*mlp.gate",              # MoE router
        "*shared_expert_gate",    # per-layer gate
        "*visual*",               # vision tower + merger
        "mtp*",                   # MTP head
    ],
)

Export with pack_method='order', weight_format='real_quantized', custom_mode='quark', then run the rename_keys.py post-processor.

Citation

@misc{qwen35moe,
  title  = {Qwen3.6-35B-A3B},
  author = {Qwen Team, Alibaba Cloud},
  year   = {2026},
  url    = {https://huggingface.co/Qwen/Qwen3.6-35B-A3B}
}

License

This model is released under the Apache License, Version 2.0, following the upstream Qwen/Qwen3.6-35B-A3B.

Modified files (the INT8-quantized model-*.safetensors and the quantization_config block in config.json) are described in NOTICE.
A copy of the Apache-2.0 license is provided in LICENSE.

Downloads last month: 4,437

Safetensors

Model size

35B params

Tensor type

BF16

Model tree for nameistoken/Qwen3.6-35B-A3B-Quark-W8A8-INT8

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(332)

this model