PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Paper β’ 2603.29078 β’ Published
Hadamard-Lloyd Weight Quantization of Qwen/Qwen3.6-35B-A3B
π¬ First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model
| Metric | Value |
|---|---|
| π― Weight bits | 5 (HLWQ Q5 β Lloyd-Max + Hadamard) |
| π¦ polar_state | 21.55 GB (6 shards, 62,190 keys) |
| π’ Coverage | 95.8% of 35.11B params (33.62B quantized) |
| β±οΈ Quantization time | 60s (PQ5) + 65s (CT INT4) |
| ποΈ Architecture | 40L hybrid (30 GDN + 10 Full Attention) |
| π§© Experts | 256/layer, 8 routed + 1 shared |
RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized model.generate(); Q3/Q2 KV use manual generation loop with PolarQuantKVCache.
| Component | Spec |
|---|---|
| Hidden dim | 2048 |
| Head dim | 256 (full attn) / 128 (GDN) |
| Expert intermediate | 512 |
| Vocab | 248,320 |
| Context | 262,144 tokens |
| Vision | 27-layer ViT (kept BF16) |
| Component | Count | Status |
|---|---|---|
| MoE expert slices | 20,480 | β HLWQ Q5 |
| Attention projections | 130 | β HLWQ Q5 |
| Shared expert MLPs | 120 | β HLWQ Q5 |
| Norms, layernorms | β | β¬ BF16 |
| MoE routers | 40 | β¬ BF16 |
| GDN gates (in_proj_a/b) | 60 | β¬ BF16 (critical) |
| Vision encoder | 27 layers | β¬ BF16 |
| MTP layer | 1 | β¬ BF16 |
For inference, use the CT INT4 version:
π caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
--language-model-only --enforce-eager --moe-expert-cache-size 8
| GPU | Expert Cache | VRAM |
|---|---|---|
| RTX PRO 6000 (96 GB) | all-in | ~20 GB |
| RTX 4090 (24 GB) | cache=4 | ~4 GB |
| RTX 3060 (12 GB) | cache=2 | ~3 GB |
HLWQ (Hadamard-Lloyd Weight Quantization):
Weight Matrix W (out Γ in)
β
βΌ
[1] Block reshape β (out, n_blocks, 128)
β
βΌ
[2] Per-block L2 normalize β norms saved
β
βΌ
[3] Walsh-Hadamard rotation: blocks @ H128 Γ β128
β (uniform information distribution)
β
βΌ
[4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1))
β (optimal MSE for Gaussian values)
β
βΌ
[5] 5-bit pack: 8 codes β 5 bytes
β
βΌ
polar_state: __packed, __norms, __meta
@misc{hlwq2026,
title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
author={Caio Vicentino},
year={2026},
url={https://arxiv.org/abs/2603.29078}
}
| Resource | Link |
|---|---|
| π Paper | arXiv:2603.29078 |
| π§ Code | GitHub |
| π¦ PyPI | pip install polarquant |
| π CT INT4 | Qwen3.6-35B-A3B-HLWQ-CT-INT4 |
| π Base model | Qwen/Qwen3.6-35B-A3B |
Base model
Qwen/Qwen3.6-35B-A3B