Qwen2.5-Coder-3B-HXQ

1.6x smaller. Fits a 4 GB GPU. Code-specialized.

Qwen2.5-Coder-3B-Instruct compressed from 6.2 GB (BF16) to 3.84 GB with only +1.92% perplexity increase. Runs on a Quadro T2000 at 3.6 tok/s. No calibration data. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/qwen2.5-coder-3b-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/qwen2.5-coder-3b-helix")

inputs = tokenizer("def fibonacci(n):", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Benchmark

Dense (BF16) HXQ
Size 6.2 GB 3.84 GB
Perplexity (WikiText-2) 6.113 6.230 (+1.92%)
Compression ratio — 1.6x
Compressed modules — 252 HelixLinear layers
Architecture Qwen2 (36 layers, GQA, 2 KV heads) unchanged

Eval: WikiText-2 test split, 2048 tokens, stride 512.

GPU Validation

Tested on a Quadro T2000 (4 GB VRAM):

Phase Result
Model load 1,254 MB VRAM
Peak VRAM 1,350 MB (33% of 4 GB)
Headroom 2,738 MB available
GPU PPL 6.230 (matches CPU)
Code generation 3.6 tok/s, coherent output

The dense FP32 model (12.3 GB) does not fit on this card. The compressed model runs with room to spare.

Good to Know

  • GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
  • Not fine-tunable — compressed weights are read-only (is_trainable = False).
  • Requires helix-substrate — the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • Tied embeddings — lm_head shares embed_tokens, stored at full precision.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable — HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
  • No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 252 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form — no decompression needed

Compression Receipt

Compressed tensors:  252
Exact tensors:       144  (norms, embeddings, tied lm_head)
Total keys:          1,176
Output size:         3,837 MB
Weight ratio:        1.6x
PPL delta:           +1.92% (6.230 vs 6.113 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio PPL Delta
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34%
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69%
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63%
tinyllama-1.1b-helix Transformer 4.0x +0.78%
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59%
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90%
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0%
mamba-130m-helix Pure SSM 3.8x +18.4%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Qwen/Qwen2.5-Coder-3B-Instruct).

Downloads last month
1,593
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EchoLabs33/qwen2.5-coder-3b-hxq

Base model

Qwen/Qwen2.5-3B
Quantized
(97)
this model

Collection including EchoLabs33/qwen2.5-coder-3b-hxq

Evaluation results