Qwen2.5-Coder-3B-HXQ
1.6x smaller. Fits a 4 GB GPU. Code-specialized.
Qwen2.5-Coder-3B-Instruct compressed from 6.2 GB (BF16) to 3.84 GB with only +1.92% perplexity increase. Runs on a Quadro T2000 at 3.6 tok/s. No calibration data. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/qwen2.5-coder-3b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/qwen2.5-coder-3b-helix")
inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 6.2 GB | 3.84 GB |
| Perplexity (WikiText-2) | 6.113 | 6.230 (+1.92%) |
| Compression ratio | — | 1.6x |
| Compressed modules | — | 252 HelixLinear layers |
| Architecture | Qwen2 (36 layers, GQA, 2 KV heads) | unchanged |
Eval: WikiText-2 test split, 2048 tokens, stride 512.
GPU Validation
Tested on a Quadro T2000 (4 GB VRAM):
| Phase | Result |
|---|---|
| Model load | 1,254 MB VRAM |
| Peak VRAM | 1,350 MB (33% of 4 GB) |
| Headroom | 2,738 MB available |
| GPU PPL | 6.230 (matches CPU) |
| Code generation | 3.6 tok/s, coherent output |
The dense FP32 model (12.3 GB) does not fit on this card. The compressed model runs with room to spare.
Good to Know
- GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- Not fine-tunable — compressed weights are read-only (
is_trainable = False). - Requires
helix-substrate— the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". - Tied embeddings —
lm_headsharesembed_tokens, stored at full precision.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable —
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 252
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form — no decompression needed
Compression Receipt
Compressed tensors: 252
Exact tensors: 144 (norms, embeddings, tied lm_head)
Total keys: 1,176
Output size: 3,837 MB
Weight ratio: 1.6x
PPL delta: +1.92% (6.230 vs 6.113 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from Qwen/Qwen2.5-Coder-3B-Instruct).
- Downloads last month
- 1,593
Model tree for EchoLabs33/qwen2.5-coder-3b-hxq
Collection including EchoLabs33/qwen2.5-coder-3b-hxq
Collection
Calibration-free VQ-256 compression. Transformers, SSMs, hybrids. Beats GPTQ/AWQ quality. Zero calibration data. • 10 items • Updated
Evaluation results
- Perplexity on WikiText-2test set self-reported6.230