lew96123/qwen3.5-0.8b-custom-packed-4bit

This is a 4-bit TurboQuant quantized version of Qwen/Qwen3.5-0.8B, created by lew96123.

Paper reference

TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate Amir Zandieh, Nima Daliri, Navid Hadian, Vahab Mirrokni arXiv:2504.19874v1, 2025 https://arxiv.org/abs/2504.19874

What this repo contains

File	Description
`turboquant_weights.safetensors`	Quantized weights (TurboQuant format)
`quant_manifest.json`	Quantization metadata (per-parameter specs)
`load_quantized.py`	Self-contained loader (zero extra dependencies beyond torch + transformers + safetensors)
`eval_summary.json`	Evaluation results
`sample_generations.json`	Sample text generations for qualitative comparison

Algorithm: TurboQuant_MSE (Algorithm 1)

The quantization method implements Algorithm 1 from the paper (Section 3.1):

Core idea

Each weight row is treated as a d-dimensional vector. The key insight is that after applying a random rotation, each coordinate of a unit-norm vector follows a Beta distribution on [-1, 1], enabling optimal scalar quantization via Lloyd-Max codebook construction.

Steps

Norm separation: Store ||x|| in float16; work with the unit vector x / ||x||
Random rotation (Algorithm 1, line 1): Generate rotation matrix Pi via QR decomposition of a Gaussian random matrix. Compute y = Pi * x
Coordinate distribution (Lemma 1): Each coordinate y_j follows the density:

f_X(t) = Gamma(d/2) / (sqrt(pi) * Gamma((d-1)/2)) * (1 - t^2)^((d-3)/2)
Lloyd-Max codebook (Eq 4, Section 3.1): Find centroids c_1, ..., c_{2^b} minimizing:

argmin_{c_k} E[min_k |X - c_k|^2]

Solved via iterative weighted k-means on the discretized Beta density.
Quantize: For each rotated coordinate y_j, find nearest centroid index
Dequantize: Look up centroids, rotate back with Pi^T, rescale by stored norm

Theoretical guarantee (Theorem 1)

For b-bit quantization of d-dimensional vectors, the expected MSE satisfies:

E[||x - x_hat||^2] / ||x||^2 <= (1 + o(1)) * pi * e / (d * 2^(2b))

This matches the rate-distortion lower bound up to a constant factor.

Paper-to-code mapping

Paper element	Reference	Code location
Beta density (Lemma 1)	Section 2, Lemma 1	`_sphere_coordinate_density()` in load_quantized.py
Lloyd-Max codebook (Eq 4)	Section 3.1, Eq 4	`_compute_codebook()` in load_quantized.py
Random rotation Pi	Algorithm 1	`_cached_rotation_matrix()` (QR of Gaussian)
Quant_mse	Algorithm 1	Quantization stored in packed indices
Dequant_mse	Algorithm 1	`_dequantize_mse()` in load_quantized.py
Norm separation	Section 3.1	Norms stored as float16, vectors normalized

Evaluation results

Evaluated on WikiText-2 (sliding-window perplexity, 297,053 tokens scored). All measurements taken on a single NVIDIA RTX 4060 (8 GB).

This variant (4-bit)

Metric	Base (fp16)	4-bit TurboQuant
Perplexity	18.2903	31.0520 (+69.8%)
Avg NLL	2.906368	3.435662
Cosine similarity	1.000000	0.870710

All variants comparison

Variant	Weight File	Compression	Perplexity	Cosine Sim	Increase
Base (fp16)	1.63 GB	1.00x	18.2903	1.000000	--
8-bit TQ	1.15 GB	1.42x	18.3844	0.999206	+0.5%
4-bit TQ	0.90 GB	1.80x	31.0520	0.870710	+69.8%
2-bit TQ	0.78 GB	2.08x	1887.3106	0.634794	+10,218%
1-bit TQ	0.72 GB	2.25x	1722576.0	-0.149249	+9.4M%

Memory breakdown: Qwen3.5-0.8B

Architecture: hidden_size=1024, 24 layers, 2 KV heads, head_dim=256, vocab=248320.

Component	fp16 (base)	8-bit TQ	4-bit TQ	2-bit TQ	1-bit TQ
Weight file	1.63 GB	1.15 GB	0.90 GB	0.78 GB	0.72 GB
KV cache (2K ctx)	96 MB	96 MB	96 MB	96 MB	96 MB
KV cache (32K ctx)	1.50 GB	1.50 GB	1.50 GB	1.50 GB	1.50 GB
Compression ratio	1.00x	1.42x	1.80x	2.08x	2.25x
Perplexity	18.29	18.38	31.05	1887	1722576

Usage

from huggingface_hub import snapshot_download

# Download the repo
repo_dir = snapshot_download("lew96123/qwen3.5-0.8b-custom-packed-4bit")

# Load and use
import sys
sys.path.insert(0, repo_dir)
from load_quantized import load_quantized_model, load_tokenizer

model, manifest = load_quantized_model(repo_dir, device="cuda")
tokenizer = load_tokenizer(repo_dir)

prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
)

batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Important notes

This is not an official Qwen release. It is an independent quantization experiment.
The self-contained load_quantized.py re-implements the TurboQuant dequantization algorithm without requiring the full qwen_quant package.
Dequantization happens at load time: weights are fully reconstructed into float tensors. This trades load-time compute for standard inference speed.
The random rotation and codebook are deterministically generated from the stored seed, so dequantization is fully reproducible.

Implementation notes

Codebook: Computed once per (dimension, bit_width) pair and cached via @lru_cache. For d=896 (Qwen3.5-0.8B hidden size), 8193-point grid, 96 Lloyd-Max iterations.
Rotation: QR decomposition of Gaussian matrix with sign correction for uniqueness.
Packing: Indices are bit-packed into uint8 tensors for storage efficiency.
Norms: Stored in float16 (trade-off: ~0.1% precision loss vs 2x storage savings).