caiovicentino1
/

Qwen3.5-9B-Claude-Opus-HLWQ-Q5

@@ -8,7 +8,6 @@ language:
 - ja
 tags:
 - hlwq
-- polarquant
 - quantized
 - compressed-tensors
 - int4
@@ -20,25 +19,25 @@ library_name: transformers
 ---
 > [!IMPORTANT]
-> **Naming notice (2026-04-10).** The "PolarQuant" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
 >
-> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
 >
 > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
 >
 > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
-# Qwen3.5-9B-Claude-Opus — PolarQuant INT4
 **Native vLLM. Marlin kernel. Zero plugin.**
-PolarQuant Q5 preprocessing produces **better INT4 weights** than direct quantization — stored in CompressedTensors format for native vLLM inference.
 ## Quick Start — vLLM (one command)
 ```bash
 pip install vllm
-vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5 --language-model-only --enforce-eager
 ```
 That's it. No plugin, no `pip install polarquant`, no custom code.
@@ -60,8 +59,8 @@ pip install polarquant
 import polarengine_vllm  # auto-registers with transformers
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-PolarQuant-Q5", trust_remote_code=True)
 inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
 out = model.generate(**inputs, max_new_tokens=100)
@@ -79,11 +78,11 @@ print(tokenizer.decode(out[0], skip_special_tokens=True))
 | RTX 4090 | 24 GB | YES | ~40 |
 | A100 | 80 GB | YES | ~168 |
-## Why PolarQuant INT4 is Better
 Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
-PolarQuant adds a **preprocessing step**:
 1. **Hadamard rotation** — distributes weight energy uniformly (eliminates outliers)
 2. **Lloyd-Max Q5** — MSE-optimal quantization for the resulting Gaussian distribution
@@ -92,7 +91,7 @@ PolarQuant adds a **preprocessing step**:
 | Method | PPL (lower = better) |
 |--------|---------------------|
 | BF16 baseline | 6.37 |
-| **PolarQuant → INT4** | **6.56** |
 | Direct INT4 | 6.68 |
 **Same speed as GPTQ/AWQ, better quality.**

 - ja
 tags:
 - hlwq
 - quantized
 - compressed-tensors
 - int4
 ---
 > [!IMPORTANT]
+> **Naming notice (2026-04-10).** The "HLWQ" technique used in this model is being rebranded to **HLWQ (Hadamard-Lloyd Weight Quantization)**. The change is only the name; the algorithm and the weights in this repository are unchanged.
 >
+> The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ ([Han et al., arXiv:2502.02617, 2025](https://arxiv.org/abs/2502.02617)). HLWQ addresses **weight** quantization with a **deterministic Walsh-Hadamard rotation** and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses **KV cache** quantization with a **random polar rotation**. The two methods are technically distinct.
 >
 > Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
 >
 > Reference paper for this technique: [arXiv:2603.29078](https://arxiv.org/abs/2603.29078) (v2 in preparation; v1 still uses the old name).
+# Qwen3.5-9B-Claude-Opus — HLWQ INT4
 **Native vLLM. Marlin kernel. Zero plugin.**
+HLWQ Q5 preprocessing produces **better INT4 weights** than direct quantization — stored in CompressedTensors format for native vLLM inference.
 ## Quick Start — vLLM (one command)
 ```bash
 pip install vllm
+vllm serve caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5 --language-model-only --enforce-eager
 ```
 That's it. No plugin, no `pip install polarquant`, no custom code.
 import polarengine_vllm  # auto-registers with transformers
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-Claude-Opus-HLWQ-Q5", trust_remote_code=True)
 inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
 out = model.generate(**inputs, max_new_tokens=100)
 | RTX 4090 | 24 GB | YES | ~40 |
 | A100 | 80 GB | YES | ~168 |
+## Why HLWQ INT4 is Better
 Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
+HLWQ adds a **preprocessing step**:
 1. **Hadamard rotation** — distributes weight energy uniformly (eliminates outliers)
 2. **Lloyd-Max Q5** — MSE-optimal quantization for the resulting Gaussian distribution
 | Method | PPL (lower = better) |
 |--------|---------------------|
 | BF16 baseline | 6.37 |
+| **HLWQ → INT4** | **6.56** |
 | Direct INT4 | 6.68 |
 **Same speed as GPTQ/AWQ, better quality.**