--- license: apache-2.0 base_model: Qwen/Qwen3.6-35B-A3B tags: - hlwq - quantized - moe - polarengine - qwen3.6 - hybrid-attention - gated-deltanet library_name: transformers pipeline_tag: image-text-to-text --- # ⚡ Qwen3.6-35B-A3B — HLWQ Q5 **Hadamard-Lloyd Weight Quantization** of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) > 🔬 First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model ## 📊 Compression Pipeline ![Compression Pipeline](assets/compression_pipeline.png) | Metric | Value | |--------|-------| | 🎯 Weight bits | 5 (HLWQ Q5 — Lloyd-Max + Hadamard) | | 📦 polar_state | **21.55 GB** (6 shards, 62,190 keys) | | 🔢 Coverage | **95.8%** of 35.11B params (33.62B quantized) | | ⏱️ Quantization time | **60s** (PQ5) + **65s** (CT INT4) | | 🏗️ Architecture | 40L hybrid (30 GDN + 10 Full Attention) | | 🧩 Experts | 256/layer, 8 routed + 1 shared | ## 🏎️ Speed Benchmarks ![Speed Benchmark](assets/speed_benchmark.png) *RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized `model.generate()`; Q3/Q2 KV use manual generation loop with PolarQuantKVCache.* ## 🧬 Architecture ![Architecture](assets/architecture.png) | Component | Spec | |-----------|------| | Hidden dim | 2048 | | Head dim | 256 (full attn) / 128 (GDN) | | Expert intermediate | 512 | | Vocab | 248,320 | | Context | 262,144 tokens | | Vision | 27-layer ViT (kept BF16) | ## 📋 Quantization Coverage ![Quantization Coverage](assets/quantization_coverage.png) | Component | Count | Status | |-----------|-------|--------| | MoE expert slices | 20,480 | ✅ HLWQ Q5 | | Attention projections | 130 | ✅ HLWQ Q5 | | Shared expert MLPs | 120 | ✅ HLWQ Q5 | | Norms, layernorms | — | ⬜ BF16 | | MoE routers | 40 | ⬜ BF16 | | GDN gates (in_proj_a/b) | 60 | ⬜ BF16 (critical) | | Vision encoder | 27 layers | ⬜ BF16 | | MTP layer | 1 | ⬜ BF16 | ## 🚀 Deployment **For inference, use the CT INT4 version:** 👉 [caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4](https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4) ![GPU Compatibility](assets/gpu_compatibility.png) ```bash pip install git+https://github.com/caiovicentino/vllm-expert-offload.git vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \ --language-model-only --enforce-eager --moe-expert-cache-size 8 ``` | GPU | Expert Cache | VRAM | |-----|-------------|------| | RTX PRO 6000 (96 GB) | all-in | ~20 GB | | RTX 4090 (24 GB) | cache=4 | ~4 GB | | RTX 3060 (12 GB) | cache=2 | ~3 GB | ## 🔬 Method HLWQ (Hadamard-Lloyd Weight Quantization): ``` Weight Matrix W (out × in) │ ▼ [1] Block reshape → (out, n_blocks, 128) │ ▼ [2] Per-block L2 normalize → norms saved │ ▼ [3] Walsh-Hadamard rotation: blocks @ H128 × √128 │ (uniform information distribution) │ ▼ [4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1)) │ (optimal MSE for Gaussian values) │ ▼ [5] 5-bit pack: 8 codes → 5 bytes │ ▼ polar_state: __packed, __norms, __meta ``` ## 📖 Citation ```bibtex @misc{hlwq2026, title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models}, author={Caio Vicentino}, year={2026}, url={https://arxiv.org/abs/2603.29078} } ``` ## 🔗 Links | Resource | Link | |----------|------| | 📄 Paper | [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) | | 🔧 Code | [GitHub](https://github.com/caiovicentino/eoq-quantization) | | 📦 PyPI | [`pip install polarquant`](https://pypi.org/project/polarquant/) | | 🚀 CT INT4 | [Qwen3.6-35B-A3B-HLWQ-CT-INT4](https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4) | | 🏠 Base model | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) |