caiovicentino1
/

Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision

@@ -2,7 +2,6 @@
 license: apache-2.0
 tags:
 - hlwq
-- polarquant
 - gemma4
 - claude-opus
 - distill
@@ -15,15 +14,15 @@ arxiv: '2603.29078'
 ---
 > [!IMPORTANT]
-> **Naming notice (2026-04-10).** The "PolarQuant" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
 >
-> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
 >
 > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
 >
 > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
-# 🧊 Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision
 **Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
@@ -31,9 +30,9 @@ Download: **21.8 GB** (vs 62.5 GB BF16 — 2.9x compression)
 | Component | Method | Result |
 |---|---|---|
-| **Text weights** | PolarQuant Q5 + torchao INT4 | 21.8 GB |
 | **Vision encoder** | BF16 (full quality) | included |
-| **KV Cache** | PolarQuant Q3 (5.3x) | longer context |
 | **Reasoning** | Claude Opus 4.6 distilled | high-effort |
 ## 🎯 Key Results
@@ -75,9 +74,9 @@ polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
 | Method | Bits | Compression | Max Context (4GB) |
 |---|---|---|---|
 | FP16 | 16 | 1.0x | 4K |
-| PolarQuant Q4 | 4 | 4.0x | 17K |
-| **PolarQuant Q3** | **3** | **5.3x** | **22K** |
-| PolarQuant Q2 | 2 | 8.0x | 35K |
 ## 🔧 Technical Details
@@ -92,7 +91,7 @@ polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
 ```bibtex
 @article{polarquant2025,
-  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
   author={Vicentino, Caio},
   journal={arXiv preprint arXiv:2603.29078},
   year={2025}
@@ -113,41 +112,41 @@ pip install git+https://github.com/caiovicentino/polarengine-vllm.git
 ### Load & Generate (1 line!)
 ```python
-from polarengine_vllm import PolarQuantModel
-model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision")
 print(model.generate("Hello, how are you?", max_new_tokens=100))
 ```
 ### With KV Cache Compression (5.3x more context)
 ```python
-model = PolarQuantModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision", kv_cache_nbits=3)
 # KV cache now uses 5.3x less memory — fit longer conversations!
 print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
 ```
 ### Benchmark
 ```bash
-polarquant bench caiovicentino1/Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision --ppl --chart
 ```
 ### Gradio Demo
 ```bash
-polarquant demo caiovicentino1/Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision --share
 ```
-## 📦 Method: PolarQuant
 **Hadamard Rotation + Lloyd-Max Optimal Centroids**
-Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.
 ```
-PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
 ```
 ## 🔗 Links
 - 📄 [Paper — arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
-- 💻 [GitHub — PolarEngine](https://github.com/caiovicentino/polarengine-vllm)
 - 📦 [PyPI — `pip install polarquant`](https://pypi.org/project/polarquant/)

 license: apache-2.0
 tags:
 - hlwq
 - gemma4
 - claude-opus
 - distill
 ---
 > [!IMPORTANT]
+> **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
 >
+> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
 >
 > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
 >
 > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
+# 🧊 Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision
 **Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
 | Component | Method | Result |
 |---|---|---|
+| **Text weights** | HLWQ Q5 + torchao INT4 | 21.8 GB |
 | **Vision encoder** | BF16 (full quality) | included |
+| **KV Cache** | HLWQ Q3 (5.3x) | longer context |
 | **Reasoning** | Claude Opus 4.6 distilled | high-effort |
 ## 🎯 Key Results
 | Method | Bits | Compression | Max Context (4GB) |
 |---|---|---|---|
 | FP16 | 16 | 1.0x | 4K |
+| HLWQ Q4 | 4 | 4.0x | 17K |
+| **HLWQ Q3** | **3** | **5.3x** | **22K** |
+| HLWQ Q2 | 2 | 8.0x | 35K |
 ## 🔧 Technical Details
 ```bibtex
 @article{polarquant2025,
+  title={HLWQ: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
   author={Vicentino, Caio},
   journal={arXiv preprint arXiv:2603.29078},
   year={2025}
 ### Load & Generate (1 line!)
 ```python
+from polarengine_vllm import HLWQModel
+model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision")
 print(model.generate("Hello, how are you?", max_new_tokens=100))
 ```
 ### With KV Cache Compression (5.3x more context)
 ```python
+model = HLWQModel.from_pretrained("caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision", kv_cache_nbits=3)
 # KV cache now uses 5.3x less memory — fit longer conversations!
 print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
 ```
 ### Benchmark
 ```bash
+polarquant bench caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --ppl --chart
 ```
 ### Gradio Demo
 ```bash
+polarquant demo caiovicentino1/Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision --share
 ```
+## 📦 Method: HLWQ
 **Hadamard Rotation + Lloyd-Max Optimal Centroids**
+Unlike GGUF (uniform quantization), HLWQ places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.
 ```
+HLWQ Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
 ```
 ## 🔗 Links
 - 📄 [Paper — arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
+- 💻 [GitHub — HLWQ-Engine](https://github.com/caiovicentino/polarengine-vllm)
 - 📦 [PyPI — `pip install polarquant`](https://pypi.org/project/polarquant/)