HLWQ rebrand: title, tags, notice, self-links
Browse files
README.md
CHANGED
|
@@ -2,7 +2,6 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- hlwq
|
| 5 |
-
- polarquant
|
| 6 |
- gemma4
|
| 7 |
- claude-opus
|
| 8 |
- distill
|
|
@@ -15,15 +14,15 @@ arxiv: '2603.29078'
|
|
| 15 |
---
|
| 16 |
|
| 17 |
> [!IMPORTANT]
|
| 18 |
-
> **Naming notice (2026-04-10).** The "
|
| 19 |
>
|
| 20 |
-
> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named
|
| 21 |
>
|
| 22 |
> Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
|
| 23 |
>
|
| 24 |
> Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
|
| 25 |
|
| 26 |
-
# π§ Gemma-4-31B-Claude-Opus-
|
| 27 |
|
| 28 |
**Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
|
| 29 |
|
|
@@ -31,9 +30,9 @@ Download: **21.8 GB** (vs 62.5 GB BF16 β 2.9x compression)
|
|
| 31 |
|
| 32 |
| Component | Method | Result |
|
| 33 |
|---|---|---|
|
| 34 |
-
| **Text weights** |
|
| 35 |
| **Vision encoder** | BF16 (full quality) | included |
|
| 36 |
-
| **KV Cache** |
|
| 37 |
| **Reasoning** | Claude Opus 4.6 distilled | high-effort |
|
| 38 |
|
| 39 |
## π― Key Results
|
|
@@ -75,9 +74,9 @@ polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
|
|
| 75 |
| Method | Bits | Compression | Max Context (4GB) |
|
| 76 |
|---|---|---|---|
|
| 77 |
| FP16 | 16 | 1.0x | 4K |
|
| 78 |
-
|
|
| 79 |
-
| **
|
| 80 |
-
|
|
| 81 |
|
| 82 |
## π§ Technical Details
|
| 83 |
|
|
@@ -92,7 +91,7 @@ polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
|
|
| 92 |
|
| 93 |
```bibtex
|
| 94 |
@article{polarquant2025,
|
| 95 |
-
title={
|
| 96 |
author={Vicentino, Caio},
|
| 97 |
journal={arXiv preprint arXiv:2603.29078},
|
| 98 |
year={2025}
|
|
@@ -113,41 +112,41 @@ pip install git+https://github.com/caiovicentino/polarengine-vllm.git
|
|
| 113 |
|
| 114 |
### Load & Generate (1 line!)
|
| 115 |
```python
|
| 116 |
-
from polarengine_vllm import
|
| 117 |
|
| 118 |
-
model =
|
| 119 |
print(model.generate("Hello, how are you?", max_new_tokens=100))
|
| 120 |
```
|
| 121 |
|
| 122 |
### With KV Cache Compression (5.3x more context)
|
| 123 |
```python
|
| 124 |
-
model =
|
| 125 |
# KV cache now uses 5.3x less memory β fit longer conversations!
|
| 126 |
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
|
| 127 |
```
|
| 128 |
|
| 129 |
### Benchmark
|
| 130 |
```bash
|
| 131 |
-
polarquant bench caiovicentino1/Gemma-4-31B-Claude-Opus-
|
| 132 |
```
|
| 133 |
|
| 134 |
### Gradio Demo
|
| 135 |
```bash
|
| 136 |
-
polarquant demo caiovicentino1/Gemma-4-31B-Claude-Opus-
|
| 137 |
```
|
| 138 |
|
| 139 |
-
## π¦ Method:
|
| 140 |
|
| 141 |
**Hadamard Rotation + Lloyd-Max Optimal Centroids**
|
| 142 |
|
| 143 |
-
Unlike GGUF (uniform quantization),
|
| 144 |
|
| 145 |
```
|
| 146 |
-
|
| 147 |
```
|
| 148 |
|
| 149 |
## π Links
|
| 150 |
|
| 151 |
- π [Paper β arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
|
| 152 |
-
- π» [GitHub β
|
| 153 |
- π¦ [PyPI β `pip install polarquant`](https://pypi.org/project/polarquant/)
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
- hlwq
|
|
|
|
| 5 |
- gemma4
|
| 6 |
- claude-opus
|
| 7 |
- distill
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
> [!IMPORTANT]
|
| 17 |
+
> **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
|
| 18 |
>
|
| 19 |
+
> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
|
| 20 |
>
|
| 21 |
> Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
|
| 22 |
>
|
| 23 |
> Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
|
| 24 |
|
| 25 |
+
# π§ Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision
|
| 26 |
|
| 27 |
**Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
|
| 28 |
|
|
|
|
| 30 |
|
| 31 |
| Component | Method | Result |
|
| 32 |
|---|---|---|
|
| 33 |
+
| **Text weights** | HLWQ Q5 + torchao INT4 | 21.8 GB |
|
| 34 |
| **Vision encoder** | BF16 (full quality) | included |
|
| 35 |
+
| **KV Cache** | HLWQ Q3 (5.3x) | longer context |
|
| 36 |
| **Reasoning** | Claude Opus 4.6 distilled | high-effort |
|
| 37 |
|
| 38 |
## π― Key Results
|
|
|
|
| 74 |
| Method | Bits | Compression | Max Context (4GB) |
|
| 75 |
|---|---|---|---|
|
| 76 |
| FP16 | 16 | 1.0x | 4K |
|
| 77 |
+
| HLWQ Q4 | 4 | 4.0x | 17K |
|
| 78 |
+
| **HLWQ Q3** | **3** | **5.3x** | **22K** |
|
| 79 |
+
| HLWQ Q2 | 2 | 8.0x | 35K |
|
| 80 |
|
| 81 |
## π§ Technical Details
|
| 82 |
|
|
|
|
| 91 |
|
| 92 |
```bibtex
|
| 93 |
@article{polarquant2025,
|
| 94 |
+
title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
|
| 95 |
author={Vicentino, Caio},
|
| 96 |
journal={arXiv preprint arXiv:2603.29078},
|
| 97 |
year={2025}
|
|
|
|
| 112 |
|
| 113 |
### Load & Generate (1 line!)
|
| 114 |
```python
|
| 115 |
+
from polarengine_vllm import HLWQModel
|
| 116 |
|
| 117 |
+
model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision")
|
| 118 |
print(model.generate("Hello, how are you?", max_new_tokens=100))
|
| 119 |
```
|
| 120 |
|
| 121 |
### With KV Cache Compression (5.3x more context)
|
| 122 |
```python
|
| 123 |
+
model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision", kv_cache_nbits=3)
|
| 124 |
# KV cache now uses 5.3x less memory β fit longer conversations!
|
| 125 |
print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
|
| 126 |
```
|
| 127 |
|
| 128 |
### Benchmark
|
| 129 |
```bash
|
| 130 |
+
polarquant bench caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --ppl --chart
|
| 131 |
```
|
| 132 |
|
| 133 |
### Gradio Demo
|
| 134 |
```bash
|
| 135 |
+
polarquant demo caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --share
|
| 136 |
```
|
| 137 |
|
| 138 |
+
## π¦ Method: HLWQ
|
| 139 |
|
| 140 |
**Hadamard Rotation + Lloyd-Max Optimal Centroids**
|
| 141 |
|
| 142 |
+
Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest β mathematically proven optimal for Gaussian-distributed neural network weights.
|
| 143 |
|
| 144 |
```
|
| 145 |
+
HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
|
| 146 |
```
|
| 147 |
|
| 148 |
## π Links
|
| 149 |
|
| 150 |
- π [Paper β arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
|
| 151 |
+
- π» [GitHub β HLWQ-Engine](https://github.com/caiovicentino/polarengine-vllm)
|
| 152 |
- π¦ [PyPI β `pip install polarquant`](https://pypi.org/project/polarquant/)
|