caiovicentino1's picture
docs: add HLWQ rebrand notice (cite Han et al. PolarQuant prior art)
f30b84a verified
metadata
license: apache-2.0
base_model: Jackrong/Qwopus3.5-27B-v3
tags:
  - hlwq
  - quantized
  - gptq
  - int4
  - polarquant
  - vllm
  - marlin
pipeline_tag: text-generation
model-index:
  - name: Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ
    results:
      - task:
          type: text-generation
          name: Code Generation
        dataset:
          name: HumanEval
          type: openai_humaneval
        metrics:
          - name: pass@1 (thinking)
            type: pass@1
            value: 78.66
          - name: pass@1 (standard)
            type: pass@1
            value: 55.49

Naming notice (2026-04-10). The "PolarQuant" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named PolarQuant (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s PolarQuant addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

🧊 Qwopus3.5-27B-v3 PolarQuant v7 GPTQ

🎯 27B Reasoning Model in 19 GB

Metric INT4 (ours) BF16 (base) Delta
🎯 HumanEval (thinking) 78.66% 97.56% (Jackrong) -18.9pp
🎯 HumanEval (standard) 55.49% not measured β€”
πŸ“¦ Download 19.2 GB 54.7 GB -65%
⚑ BPW 4.475 16 3.6x smaller
πŸš€ Kernel Marlin β€” Native vLLM

Note on the thinking-mode gap: The 18.9pp delta from BF16 (97.56% β†’ 78.66%) is a real quality impact of INT4 quantization on chain-of-thought code generation at 27B scale. Users who need maximum thinking-mode quality should consider the BF16 base model from Jackrong.

πŸ“Š Benchmarks

HumanEval Thinking Mode

9B vs 27B Comparison

Quality vs Size

πŸ”¬ About This Model

This is the 27B reasoning model from the Qwopus3.5 series, quantized with our proven PolarQuant v7 config. The model uses <think> tags for chain-of-thought reasoning before answering.

Uses the same GPTQ gs64 + FOEM config as our 9B v7 release, where it narrowly edged out BF16 on HumanEval standard mode (67.07% vs 66.87%). Note that at 27B thinking-mode scale there is a measurable gap from the BF16 baseline (see results below).

πŸš€ Quick Start

vLLM (recommended)

from vllm import LLM, SamplingParams

model = LLM(
    "caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ",
    trust_remote_code=True,
    language_model_only=True,
    gpu_memory_utilization=0.75,
)

output = model.generate(
    ["Write a Python function to sort a list:"],
    SamplingParams(max_tokens=4096, temperature=0.0),
)
print(output[0].outputs[0].text)

vLLM Server

vllm serve caiovicentino1/Qwopus3.5-27B-v3-PolarQuant-v7-GPTQ \
    --trust-remote-code --language-model-only \
    --gpu-memory-utilization 0.75 --max-model-len 16384

πŸ”§ Quantization Config

# Same config as our 9B v7 β€” bits=4, gs=64, FOEM
from gptqmodel import GPTQModel
from gptqmodel.quantization import QuantizeConfig
from gptqmodel.quantization.config import FOEMConfig

quantize_config = QuantizeConfig(
    bits=4,
    group_size=64,
    sym=True,
    desc_act=True,
    foem=FOEMConfig(alpha=0.25, beta=0.2, device="auto")
)
  • Quantizer: GPTQModel v6.0.3
  • Calibration: 512 samples from neuralmagic/LLM_compression_calibration
  • Kernel: Marlin (native vLLM, zero overhead)
  • Time: 33 min on RTX PRO 6000 Blackwell (102 GB)

πŸ“ˆ HumanEval Results

Mode Score Method
Thinking (chat template) 78.66% 129/164, automated exec()
Standard (lm-eval) 55.49% lm_eval --tasks humaneval
BF16 Thinking (Jackrong) 97.56% Reported by base model author

About the 18.9pp gap from BF16 thinking

The drop from 97.56% to 78.66% reflects a real quality cost of INT4 quantization on chain-of-thought code generation at 27B scale. Thinking mode is more sensitive to weight precision than single-shot generation because small numerical errors can compound across reasoning steps.

Part of the gap may also come from differences in evaluation harness (we use automated exec()-based checking), but we have not isolated this component through direct comparison.

Recommendation: if you need maximum thinking-mode quality, use the BF16 base model. If you need fast low-VRAM vLLM serving and can tolerate this gap, v7-GPTQ is the right pick.

πŸ“– Technical Details

Parameter Value
Base Model Jackrong/Qwopus3.5-27B-v3
Architecture Qwen3.5 (48 linear_attn + 16 full_attn)
Hidden Size 5120
Layers 64
Bits 4
Group Size 64
FOEM alpha=0.25, beta=0.2
BPW 4.475
Format GPTQ v1 (Marlin compatible)

πŸ”— Links

πŸ“– Citation

@article{vicentino2026polarquant,
    title={PolarQuant: Polar Coordinate Quantization for Efficient LLM Inference},
    author={Vicentino, Caio},
    journal={arXiv preprint arXiv:2603.29078},
    year={2026}
}

πŸ™ Acknowledgements

  • Jackrong for the Qwopus3.5-27B-v3 base model and HumanEval methodology
  • GPTQModel team for FOEM implementation
  • vLLM team for Marlin kernel support