caiovicentino1 commited on
Commit
1651a28
Β·
verified Β·
1 Parent(s): 8ee301d

HLWQ rebrand: title, tags, notice, self-links

Browse files
Files changed (1) hide show
  1. README.md +10 -11
README.md CHANGED
@@ -8,7 +8,6 @@ language:
8
  - ja
9
  tags:
10
  - hlwq
11
- - polarquant
12
  - quantized
13
  - compressed-tensors
14
  - int4
@@ -20,25 +19,25 @@ library_name: transformers
20
  ---
21
 
22
  > [!IMPORTANT]
23
- > **Naming notice (2026-04-10).** The "PolarQuant" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
24
  >
25
- > The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
26
  >
27
  > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
28
  >
29
  > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
30
 
31
- # Qwen3.5-9B-Claude-Opus β€” PolarQuant INT4
32
 
33
  **Native vLLM. Marlin kernel. Zero plugin.**
34
 
35
- PolarQuant Q5 preprocessing produces **better INT4 weights** than direct quantization β€” stored in CompressedTensors format for native vLLM inference.
36
 
37
  ## Quick Start β€” vLLM (one command)
38
 
39
  ```bash
40
  pip install vllm
41
- vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5 --language-model-only --enforce-eager
42
  ```
43
 
44
  That's it. No plugin, no `pip install polarquant`, no custom code.
@@ -60,8 +59,8 @@ pip install polarquant
60
  import polarengine_vllm # auto-registers with transformers
61
  from transformers import AutoModelForCausalLM, AutoTokenizer
62
 
63
- model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
64
- tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5", trust_remote_code=True)
65
 
66
  inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
67
  out = model.generate(**inputs, max_new_tokens=100)
@@ -79,11 +78,11 @@ print(tokenizer.decode(out[0], skip_special_tokens=True))
79
  | RTX 4090 | 24 GB | YES | ~40 |
80
  | A100 | 80 GB | YES | ~168 |
81
 
82
- ## Why PolarQuant INT4 is Better
83
 
84
  Standard INT4 (GPTQ/AWQ) quantizes weights directly β€” outliers cause errors.
85
 
86
- PolarQuant adds a **preprocessing step**:
87
 
88
  1. **Hadamard rotation** β€” distributes weight energy uniformly (eliminates outliers)
89
  2. **Lloyd-Max Q5** β€” MSE-optimal quantization for the resulting Gaussian distribution
@@ -92,7 +91,7 @@ PolarQuant adds a **preprocessing step**:
92
  | Method | PPL (lower = better) |
93
  |--------|---------------------|
94
  | BF16 baseline | 6.37 |
95
- | **PolarQuant β†’ INT4** | **6.56** |
96
  | Direct INT4 | 6.68 |
97
 
98
  **Same speed as GPTQ/AWQ, better quality.**
 
8
  - ja
9
  tags:
10
  - hlwq
 
11
  - quantized
12
  - compressed-tensors
13
  - int4
 
19
  ---
20
 
21
  > [!IMPORTANT]
22
+ > **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
23
  >
24
+ > The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
25
  >
26
  > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
27
  >
28
  > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
29
 
30
+ # Qwen3.5-9B-Claude-Opus β€” HLWQ INT4
31
 
32
  **Native vLLM. Marlin kernel. Zero plugin.**
33
 
34
+ HLWQ Q5 preprocessing produces **better INT4 weights** than direct quantization β€” stored in CompressedTensors format for native vLLM inference.
35
 
36
  ## Quick Start β€” vLLM (one command)
37
 
38
  ```bash
39
  pip install vllm
40
+ vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5 --language-model-only --enforce-eager
41
  ```
42
 
43
  That's it. No plugin, no `pip install polarquant`, no custom code.
 
59
  import polarengine_vllm # auto-registers with transformers
60
  from transformers import AutoModelForCausalLM, AutoTokenizer
61
 
62
+ model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", device_map="auto", trust_remote_code=True)
63
+ tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", trust_remote_code=True)
64
 
65
  inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
66
  out = model.generate(**inputs, max_new_tokens=100)
 
78
  | RTX 4090 | 24 GB | YES | ~40 |
79
  | A100 | 80 GB | YES | ~168 |
80
 
81
+ ## Why HLWQ INT4 is Better
82
 
83
  Standard INT4 (GPTQ/AWQ) quantizes weights directly β€” outliers cause errors.
84
 
85
+ HLWQ adds a **preprocessing step**:
86
 
87
  1. **Hadamard rotation** β€” distributes weight energy uniformly (eliminates outliers)
88
  2. **Lloyd-Max Q5** β€” MSE-optimal quantization for the resulting Gaussian distribution
 
91
  | Method | PPL (lower = better) |
92
  |--------|---------------------|
93
  | BF16 baseline | 6.37 |
94
+ | **HLWQ β†’ INT4** | **6.56** |
95
  | Direct INT4 | 6.68 |
96
 
97
  **Same speed as GPTQ/AWQ, better quality.**