⚡ Qwen3.6-35B-A3B — HLWQ Q5

Hadamard-Lloyd Weight Quantization of Qwen/Qwen3.6-35B-A3B

🔬 First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model

📊 Compression Pipeline

Metric	Value
🎯 Weight bits	5 (HLWQ Q5 — Lloyd-Max + Hadamard)
📦 polar_state	21.55 GB (6 shards, 62,190 keys)
🔢 Coverage	95.8% of 35.11B params (33.62B quantized)
⏱️ Quantization time	60s (PQ5) + 65s (CT INT4)
🏗️ Architecture	40L hybrid (30 GDN + 10 Full Attention)
🧩 Experts	256/layer, 8 routed + 1 shared

🏎️ Speed Benchmarks

RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized model.generate(); Q3/Q2 KV use manual generation loop with PolarQuantKVCache.

🧬 Architecture

Component	Spec
Hidden dim	2048
Head dim	256 (full attn) / 128 (GDN)
Expert intermediate	512
Vocab	248,320
Context	262,144 tokens
Vision	27-layer ViT (kept BF16)

📋 Quantization Coverage

Component	Count	Status
MoE expert slices	20,480	✅ HLWQ Q5
Attention projections	130	✅ HLWQ Q5
Shared expert MLPs	120	✅ HLWQ Q5
Norms, layernorms	—	⬜ BF16
MoE routers	40	⬜ BF16
GDN gates (in_proj_a/b)	60	⬜ BF16 (critical)
Vision encoder	27 layers	⬜ BF16
MTP layer	1	⬜ BF16

🚀 Deployment

For inference, use the CT INT4 version:

👉 caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
  --language-model-only --enforce-eager --moe-expert-cache-size 8

GPU	Expert Cache	VRAM
RTX PRO 6000 (96 GB)	all-in	~20 GB
RTX 4090 (24 GB)	cache=4	~4 GB
RTX 3060 (12 GB)	cache=2	~3 GB

🔬 Method

HLWQ (Hadamard-Lloyd Weight Quantization):

Weight Matrix W (out × in)
    │
    ▼
[1] Block reshape → (out, n_blocks, 128)
    │
    ▼
[2] Per-block L2 normalize → norms saved
    │
    ▼
[3] Walsh-Hadamard rotation: blocks @ H128 × √128
    │  (uniform information distribution)
    │
    ▼
[4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1))
    │  (optimal MSE for Gaussian values)
    │
    ▼
[5] 5-bit pack: 8 codes → 5 bytes
    │
    ▼
polar_state: __packed, __norms, __meta

📖 Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}

🔗 Links

Resource	Link
📄 Paper	arXiv:2603.29078
🔧 Code	GitHub
📦 PyPI	`pip install polarquant`
🚀 CT INT4	Qwen3.6-35B-A3B-HLWQ-CT-INT4
🏠 Base model	Qwen/Qwen3.6-35B-A3B

Downloads last month: 89

Model tree for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5

Base model

Qwen/Qwen3.6-35B-A3B

Quantized

(124)

this model

Paper for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 19 days ago