⚑ Qwen3.6-35B-A3B β€” HLWQ Q5

Hadamard-Lloyd Weight Quantization of Qwen/Qwen3.6-35B-A3B

πŸ”¬ First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model

πŸ“Š Compression Pipeline

Compression Pipeline

Metric Value
🎯 Weight bits 5 (HLWQ Q5 β€” Lloyd-Max + Hadamard)
πŸ“¦ polar_state 21.55 GB (6 shards, 62,190 keys)
πŸ”’ Coverage 95.8% of 35.11B params (33.62B quantized)
⏱️ Quantization time 60s (PQ5) + 65s (CT INT4)
πŸ—οΈ Architecture 40L hybrid (30 GDN + 10 Full Attention)
🧩 Experts 256/layer, 8 routed + 1 shared

🏎️ Speed Benchmarks

Speed Benchmark

RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized model.generate(); Q3/Q2 KV use manual generation loop with PolarQuantKVCache.

🧬 Architecture

Architecture

Component Spec
Hidden dim 2048
Head dim 256 (full attn) / 128 (GDN)
Expert intermediate 512
Vocab 248,320
Context 262,144 tokens
Vision 27-layer ViT (kept BF16)

πŸ“‹ Quantization Coverage

Quantization Coverage

Component Count Status
MoE expert slices 20,480 βœ… HLWQ Q5
Attention projections 130 βœ… HLWQ Q5
Shared expert MLPs 120 βœ… HLWQ Q5
Norms, layernorms β€” ⬜ BF16
MoE routers 40 ⬜ BF16
GDN gates (in_proj_a/b) 60 ⬜ BF16 (critical)
Vision encoder 27 layers ⬜ BF16
MTP layer 1 ⬜ BF16

πŸš€ Deployment

For inference, use the CT INT4 version:

πŸ‘‰ caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4

GPU Compatibility

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
  --language-model-only --enforce-eager --moe-expert-cache-size 8
GPU Expert Cache VRAM
RTX PRO 6000 (96 GB) all-in ~20 GB
RTX 4090 (24 GB) cache=4 ~4 GB
RTX 3060 (12 GB) cache=2 ~3 GB

πŸ”¬ Method

HLWQ (Hadamard-Lloyd Weight Quantization):

Weight Matrix W (out Γ— in)
    β”‚
    β–Ό
[1] Block reshape β†’ (out, n_blocks, 128)
    β”‚
    β–Ό
[2] Per-block L2 normalize β†’ norms saved
    β”‚
    β–Ό
[3] Walsh-Hadamard rotation: blocks @ H128 Γ— √128
    β”‚  (uniform information distribution)
    β”‚
    β–Ό
[4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1))
    β”‚  (optimal MSE for Gaussian values)
    β”‚
    β–Ό
[5] 5-bit pack: 8 codes β†’ 5 bytes
    β”‚
    β–Ό
polar_state: __packed, __norms, __meta

πŸ“– Citation

@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}

πŸ”— Links

Resource Link
πŸ“„ Paper arXiv:2603.29078
πŸ”§ Code GitHub
πŸ“¦ PyPI pip install polarquant
πŸš€ CT INT4 Qwen3.6-35B-A3B-HLWQ-CT-INT4
🏠 Base model Qwen/Qwen3.6-35B-A3B
Downloads last month
89
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5

Quantized
(124)
this model

Paper for caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5