HLWQ rebrand: title, tags, notice, self-links
Browse files
README.md
CHANGED
|
@@ -8,7 +8,6 @@ language:
|
|
| 8 |
- ja
|
| 9 |
tags:
|
| 10 |
- hlwq
|
| 11 |
-
- polarquant
|
| 12 |
- quantized
|
| 13 |
- compressed-tensors
|
| 14 |
- int4
|
|
@@ -20,25 +19,25 @@ library_name: transformers
|
|
| 20 |
---
|
| 21 |
|
| 22 |
> [!IMPORTANT]
|
| 23 |
-
> **Naming notice (2026-04-10).** The "
|
| 24 |
>
|
| 25 |
-
> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named
|
| 26 |
>
|
| 27 |
> Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
|
| 28 |
>
|
| 29 |
> Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
|
| 30 |
|
| 31 |
-
# Qwen3.5-9B-Claude-Opus β
|
| 32 |
|
| 33 |
**Native vLLM. Marlin kernel. Zero plugin.**
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
## Quick Start β vLLM (one command)
|
| 38 |
|
| 39 |
```bash
|
| 40 |
pip install vllm
|
| 41 |
-
vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-
|
| 42 |
```
|
| 43 |
|
| 44 |
That's it. No plugin, no `pip install polarquant`, no custom code.
|
|
@@ -60,8 +59,8 @@ pip install polarquant
|
|
| 60 |
import polarengine_vllm # auto-registers with transformers
|
| 61 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 62 |
|
| 63 |
-
model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-
|
| 64 |
-
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-
|
| 65 |
|
| 66 |
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
|
| 67 |
out = model.generate(**inputs, max_new_tokens=100)
|
|
@@ -79,11 +78,11 @@ print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|
| 79 |
| RTX 4090 | 24 GB | YES | ~40 |
|
| 80 |
| A100 | 80 GB | YES | ~168 |
|
| 81 |
|
| 82 |
-
## Why
|
| 83 |
|
| 84 |
Standard INT4 (GPTQ/AWQ) quantizes weights directly β outliers cause errors.
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
1. **Hadamard rotation** β distributes weight energy uniformly (eliminates outliers)
|
| 89 |
2. **Lloyd-Max Q5** β MSE-optimal quantization for the resulting Gaussian distribution
|
|
@@ -92,7 +91,7 @@ PolarQuant adds a **preprocessing step**:
|
|
| 92 |
| Method | PPL (lower = better) |
|
| 93 |
|--------|---------------------|
|
| 94 |
| BF16 baseline | 6.37 |
|
| 95 |
-
| **
|
| 96 |
| Direct INT4 | 6.68 |
|
| 97 |
|
| 98 |
**Same speed as GPTQ/AWQ, better quality.**
|
|
|
|
| 8 |
- ja
|
| 9 |
tags:
|
| 10 |
- hlwq
|
|
|
|
| 11 |
- quantized
|
| 12 |
- compressed-tensors
|
| 13 |
- int4
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
> [!IMPORTANT]
|
| 22 |
+
> **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
|
| 23 |
>
|
| 24 |
+
> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
|
| 25 |
>
|
| 26 |
> Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
|
| 27 |
>
|
| 28 |
> Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
|
| 29 |
|
| 30 |
+
# Qwen3.5-9B-Claude-Opus β HLWQ INT4
|
| 31 |
|
| 32 |
**Native vLLM. Marlin kernel. Zero plugin.**
|
| 33 |
|
| 34 |
+
HLWQ Q5 preprocessing produces **better INT4 weights** than direct quantization β stored in CompressedTensors format for native vLLM inference.
|
| 35 |
|
| 36 |
## Quick Start β vLLM (one command)
|
| 37 |
|
| 38 |
```bash
|
| 39 |
pip install vllm
|
| 40 |
+
vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5 --language-model-only --enforce-eager
|
| 41 |
```
|
| 42 |
|
| 43 |
That's it. No plugin, no `pip install polarquant`, no custom code.
|
|
|
|
| 59 |
import polarengine_vllm # auto-registers with transformers
|
| 60 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 61 |
|
| 62 |
+
model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", device_map="auto", trust_remote_code=True)
|
| 63 |
+
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", trust_remote_code=True)
|
| 64 |
|
| 65 |
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
|
| 66 |
out = model.generate(**inputs, max_new_tokens=100)
|
|
|
|
| 78 |
| RTX 4090 | 24 GB | YES | ~40 |
|
| 79 |
| A100 | 80 GB | YES | ~168 |
|
| 80 |
|
| 81 |
+
## Why HLWQ INT4 is Better
|
| 82 |
|
| 83 |
Standard INT4 (GPTQ/AWQ) quantizes weights directly β outliers cause errors.
|
| 84 |
|
| 85 |
+
HLWQ adds a **preprocessing step**:
|
| 86 |
|
| 87 |
1. **Hadamard rotation** β distributes weight energy uniformly (eliminates outliers)
|
| 88 |
2. **Lloyd-Max Q5** β MSE-optimal quantization for the resulting Gaussian distribution
|
|
|
|
| 91 |
| Method | PPL (lower = better) |
|
| 92 |
|--------|---------------------|
|
| 93 |
| BF16 baseline | 6.37 |
|
| 94 |
+
| **HLWQ β INT4** | **6.56** |
|
| 95 |
| Direct INT4 | 6.68 |
|
| 96 |
|
| 97 |
**Same speed as GPTQ/AWQ, better quality.**
|