docs: add HLWQ rebrand notice (cite Han et al. PolarQuant prior art)

f30b84a verified 3 days ago

6.46 kB

license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
tags:
  - hlwq
  - quantized
  - gptq
  - int4
  - polarquant
  - vllm
  - marlin
pipeline_tag: text-generation
model-index:
  - name: Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ
    results:
      - task:
          type: text-generation
          name: Code Generation
        dataset:
          name: HumanEval
          type: openai_humaneval
        metrics:
          - name: pass@1 (thinking)
            type: pass@1
            value: 78.66
          - name: pass@1 (standard)
            type: pass@1
            value: 55.49

Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Qwopus3.5-27B-v3 PolarQuant v7 GPTQ

🎯 27B Reasoning Model in 19 GB

Metric	INT4 (ours)	BF16 (base)	Delta
🎯 HumanEval (thinking)	78.66%	97.56% (Jackrong)	-18.9pp
🎯 HumanEval (standard)	55.49%	not measured	—
📦 Download	19.2 GB	54.7 GB	-65%
⚡ BPW	4.475	16	3.6x smaller
🚀 Kernel	Marlin	—	Native vLLM

Note on the thinking-mode gap: The 18.9pp delta from BF16 (97.56% → 78.66%) is a real quality impact of INT4 quantization on chain-of-thought code generation at 27B scale. Users who need maximum thinking-mode quality should consider the BF16 base model from Jackrong.

📊 Benchmarks

🔬 About This Model

This is the 27B reasoning model from the Qwopus3.5 series, quantized with our proven PolarQuant v7 config. The model uses <think> tags for chain-of-thought reasoning before answering.

Uses the same GPTQ gs64 + FOEM config as our 9B v7 release, where it narrowly edged out BF16 on HumanEval standard mode (67.07% vs 66.87%). Note that at 27B thinking-mode scale there is a measurable gap from the BF16 baseline (see results below).

🚀 Quick Start

vLLM (recommended)

from vllm import LLM, SamplingParams

model = LLM(
    "caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ",
    trust_remote_code=True,
    language_model_only=True,
    gpu_memory_utilization=0.75,
)

output = model.generate(
    ["Write a Python function to sort a list:"],
    SamplingParams(max_tokens=4096, temperature=0.0),
)
print(output[0].outputs[0].text)

vLLM Server

vllm serve caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ \
    --trust-remote-code --language-model-only \
    --gpu-memory-utilization 0.75 --max-model-len 16384

🔧 Quantization Config

# Same config as our 9B v7 — bits=4, gs=64, FOEM
from gptqmodel import GPTQModel
from gptqmodel.quantization import QuantizeConfig
from gptqmodel.quantization.config import FOEMConfig

quantize_config = QuantizeConfig(
    bits=4,
    group_size=64,
    sym=True,
    desc_act=True,
    foem=FOEMConfig(alpha=0.25, beta=0.2, device="auto")
)

Quantizer: GPTQModel v6.0.3
Calibration: 512 samples from neuralmagic/LLM_compression_calibration
Kernel: Marlin (native vLLM, zero overhead)
Time: 33 min on RTX PRO 6000 Blackwell (102 GB)

📈 HumanEval Results

Mode	Score	Method
Thinking (chat template)	78.66%	129/164, automated exec()
Standard (lm-eval)	55.49%	lm_eval --tasks humaneval
BF16 Thinking (Jackrong)	97.56%	Reported by base model author

About the 18.9pp gap from BF16 thinking

The drop from 97.56% to 78.66% reflects a real quality cost of INT4 quantization on chain-of-thought code generation at 27B scale. Thinking mode is more sensitive to weight precision than single-shot generation because small numerical errors can compound across reasoning steps.

Part of the gap may also come from differences in evaluation harness (we use automated exec()-based checking), but we have not isolated this component through direct comparison.

Recommendation: if you need maximum thinking-mode quality, use the BF16 base model. If you need fast low-VRAM vLLM serving and can tolerate this gap, v7-GPTQ is the right pick.

📖 Technical Details

Parameter	Value
Base Model	Jackrong/Qwopus3.5-27B-v3
Architecture	Qwen3.5 (48 linear_attn + 16 full_attn)
Hidden Size	5120
Layers	64
Bits	4
Group Size	64
FOEM	alpha=0.25, beta=0.2
BPW	4.475
Format	GPTQ v1 (Marlin compatible)

🔗 Links

🏆 9B version (67.07% beats BF16)
📜 Paper: PolarQuant — arXiv:2603.29078
💻 GitHub: polarengine-vllm
📦 PyPI: pip install polarquant
🌍 Base Model: Jackrong/Qwopus3.5-27B-v3

📖 Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Polar Coordinate Quantization for Efficient LLM Inference},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.29078},
    year={2026}
}

🙏 Acknowledgements

Jackrong for the Qwopus3.5-27B-v3 base model and HumanEval methodology
GPTQModel team for FOEM implementation
vLLM team for Marlin kernel support