---
license: apache-2.0
base_model: Qwen/Qwen3.6-35B-A3B
tags:
  - hlwq
  - quantized
  - moe
  - polarengine
  - qwen3.6
  - hybrid-attention
  - gated-deltanet
library_name: transformers
pipeline_tag: image-text-to-text
---

# ⚡ Qwen3.6-35B-A3B — HLWQ Q5

**Hadamard-Lloyd Weight Quantization** of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

> 🔬 First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model

## 📊 Compression Pipeline

![Compression Pipeline](assets/compression_pipeline.png)

| Metric | Value |
|--------|-------|
| 🎯 Weight bits | 5 (HLWQ Q5 — Lloyd-Max + Hadamard) |
| 📦 polar_state | **21.55 GB** (6 shards, 62,190 keys) |
| 🔢 Coverage | **95.8%** of 35.11B params (33.62B quantized) |
| ⏱️ Quantization time | **60s** (PQ5) + **65s** (CT INT4) |
| 🏗️ Architecture | 40L hybrid (30 GDN + 10 Full Attention) |
| 🧩 Experts | 256/layer, 8 routed + 1 shared |

## 🏎️ Speed Benchmarks

![Speed Benchmark](assets/speed_benchmark.png)

*RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized `model.generate()`; Q3/Q2 KV use manual generation loop with PolarQuantKVCache.*

## 🧬 Architecture

![Architecture](assets/architecture.png)

| Component | Spec |
|-----------|------|
| Hidden dim | 2048 |
| Head dim | 256 (full attn) / 128 (GDN) |
| Expert intermediate | 512 |
| Vocab | 248,320 |
| Context | 262,144 tokens |
| Vision | 27-layer ViT (kept BF16) |

## 📋 Quantization Coverage

![Quantization Coverage](assets/quantization_coverage.png)

| Component | Count | Status |
|-----------|-------|--------|
| MoE expert slices | 20,480 | ✅ HLWQ Q5 |
| Attention projections | 130 | ✅ HLWQ Q5 |
| Shared expert MLPs | 120 | ✅ HLWQ Q5 |
| Norms, layernorms | — | ⬜ BF16 |
| MoE routers | 40 | ⬜ BF16 |
| GDN gates (in_proj_a/b) | 60 | ⬜ BF16 (critical) |
| Vision encoder | 27 layers | ⬜ BF16 |
| MTP layer | 1 | ⬜ BF16 |

## 🚀 Deployment

**For inference, use the CT INT4 version:**

👉 [caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4](https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4)

![GPU Compatibility](assets/gpu_compatibility.png)

```bash
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
  --language-model-only --enforce-eager --moe-expert-cache-size 8
```

| GPU | Expert Cache | VRAM |
|-----|-------------|------|
| RTX PRO 6000 (96 GB) | all-in | ~20 GB |
| RTX 4090 (24 GB) | cache=4 | ~4 GB |
| RTX 3060 (12 GB) | cache=2 | ~3 GB |

## 🔬 Method

HLWQ (Hadamard-Lloyd Weight Quantization):

```
Weight Matrix W (out × in)
    │
    ▼
[1] Block reshape → (out, n_blocks, 128)
    │
    ▼
[2] Per-block L2 normalize → norms saved
    │
    ▼
[3] Walsh-Hadamard rotation: blocks @ H128 × √128
    │  (uniform information distribution)
    │
    ▼
[4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1))
    │  (optimal MSE for Gaussian values)
    │
    ▼
[5] 5-bit pack: 8 codes → 5 bytes
    │
    ▼
polar_state: __packed, __norms, __meta
```

## 📖 Citation

```bibtex
@misc{hlwq2026,
  title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
  author={Caio Vicentino},
  year={2026},
  url={https://arxiv.org/abs/2603.29078}
}
```

## 🔗 Links

| Resource | Link |
|----------|------|
| 📄 Paper | [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) |
| 🔧 Code | [GitHub](https://github.com/caiovicentino/eoq-quantization) |
| 📦 PyPI | [`pip install polarquant`](https://pypi.org/project/polarquant/) |
| 🚀 CT INT4 | [Qwen3.6-35B-A3B-HLWQ-CT-INT4](https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4) |
| 🏠 Base model | [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B) |