lew96123/qwen3.5-0.8b-custom-packed-4bit

This is a 4-bit TurboQuant quantized version of Qwen/Qwen3.5-0.8B, created by lew96123.

Paper reference

TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate Amir Zandieh, Nima Daliri, Navid Hadian, Vahab Mirrokni arXiv:2504.19874v1, 2025 https://arxiv.org/abs/2504.19874

What this repo contains

File Description
turboquant_weights.safetensors Quantized weights (TurboQuant format)
quant_manifest.json Quantization metadata (per-parameter specs)
load_quantized.py Self-contained loader (zero extra dependencies beyond torch + transformers + safetensors)
eval_summary.json Evaluation results
sample_generations.json Sample text generations for qualitative comparison

Algorithm: TurboQuant_MSE (Algorithm 1)

The quantization method implements Algorithm 1 from the paper (Section 3.1):

Core idea

Each weight row is treated as a d-dimensional vector. The key insight is that after applying a random rotation, each coordinate of a unit-norm vector follows a Beta distribution on [-1, 1], enabling optimal scalar quantization via Lloyd-Max codebook construction.

Steps

  1. Norm separation: Store ||x|| in float16; work with the unit vector x / ||x||

  2. Random rotation (Algorithm 1, line 1): Generate rotation matrix Pi via QR decomposition of a Gaussian random matrix. Compute y = Pi * x

  3. Coordinate distribution (Lemma 1): Each coordinate y_j follows the density:

    f_X(t) = Gamma(d/2) / (sqrt(pi) * Gamma((d-1)/2)) * (1 - t^2)^((d-3)/2)

  4. Lloyd-Max codebook (Eq 4, Section 3.1): Find centroids c_1, ..., c_{2^b} minimizing:

    argmin_{c_k} E[min_k |X - c_k|^2]

    Solved via iterative weighted k-means on the discretized Beta density.

  5. Quantize: For each rotated coordinate y_j, find nearest centroid index

  6. Dequantize: Look up centroids, rotate back with Pi^T, rescale by stored norm

Theoretical guarantee (Theorem 1)

For b-bit quantization of d-dimensional vectors, the expected MSE satisfies:

E[||x - x_hat||^2] / ||x||^2 <= (1 + o(1)) * pi * e / (d * 2^(2b))

This matches the rate-distortion lower bound up to a constant factor.

Paper-to-code mapping

Paper element Reference Code location
Beta density (Lemma 1) Section 2, Lemma 1 _sphere_coordinate_density() in load_quantized.py
Lloyd-Max codebook (Eq 4) Section 3.1, Eq 4 _compute_codebook() in load_quantized.py
Random rotation Pi Algorithm 1 _cached_rotation_matrix() (QR of Gaussian)
Quant_mse Algorithm 1 Quantization stored in packed indices
Dequant_mse Algorithm 1 _dequantize_mse() in load_quantized.py
Norm separation Section 3.1 Norms stored as float16, vectors normalized

Evaluation results

Evaluated on WikiText-2 (sliding-window perplexity, 297,053 tokens scored). All measurements taken on a single NVIDIA RTX 4060 (8 GB).

This variant (4-bit)

Metric Base (fp16) 4-bit TurboQuant
Perplexity 18.2903 31.0520 (+69.8%)
Avg NLL 2.906368 3.435662
Cosine similarity 1.000000 0.870710

All variants comparison

Variant Weight File Compression Perplexity Cosine Sim Increase
Base (fp16) 1.63 GB 1.00x 18.2903 1.000000 --
8-bit TQ 1.15 GB 1.42x 18.3844 0.999206 +0.5%
4-bit TQ 0.90 GB 1.80x 31.0520 0.870710 +69.8%
2-bit TQ 0.78 GB 2.08x 1887.3106 0.634794 +10,218%
1-bit TQ 0.72 GB 2.25x 1722576.0 -0.149249 +9.4M%

Memory breakdown: Qwen3.5-0.8B

Architecture: hidden_size=1024, 24 layers, 2 KV heads, head_dim=256, vocab=248320.

Component fp16 (base) 8-bit TQ 4-bit TQ 2-bit TQ 1-bit TQ
Weight file 1.63 GB 1.15 GB 0.90 GB 0.78 GB 0.72 GB
KV cache (2K ctx) 96 MB 96 MB 96 MB 96 MB 96 MB
KV cache (32K ctx) 1.50 GB 1.50 GB 1.50 GB 1.50 GB 1.50 GB
Compression ratio 1.00x 1.42x 1.80x 2.08x 2.25x
Perplexity 18.29 18.38 31.05 1887 1722576

Usage

from huggingface_hub import snapshot_download

# Download the repo
repo_dir = snapshot_download("lew96123/qwen3.5-0.8b-custom-packed-4bit")

# Load and use
import sys
sys.path.insert(0, repo_dir)
from load_quantized import load_quantized_model, load_tokenizer

model, manifest = load_quantized_model(repo_dir, device="cuda")
tokenizer = load_tokenizer(repo_dir)

prompt = "Explain why quantization can reduce storage requirements."
rendered = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    tokenize=False,
    add_generation_prompt=True,
)

batch = tokenizer(rendered, return_tensors="pt").to(model.device)
output = model.generate(**batch, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Important notes

  • This is not an official Qwen release. It is an independent quantization experiment.
  • The self-contained load_quantized.py re-implements the TurboQuant dequantization algorithm without requiring the full qwen_quant package.
  • Dequantization happens at load time: weights are fully reconstructed into float tensors. This trades load-time compute for standard inference speed.
  • The random rotation and codebook are deterministically generated from the stored seed, so dequantization is fully reproducible.

Implementation notes

  • Codebook: Computed once per (dimension, bit_width) pair and cached via @lru_cache. For d=896 (Qwen3.5-0.8B hidden size), 8193-point grid, 96 Lloyd-Max iterations.
  • Rotation: QR decomposition of Gaussian matrix with sign correction for uniqueness.
  • Packing: Indices are bit-packed into uint8 tensors for storage efficiency.
  • Norms: Stored in float16 (trade-off: ~0.1% precision loss vs 2x storage savings).

Provenance

  • Base model: Qwen/Qwen3.5-0.8B
  • Base license: apache-2.0
  • Quantization method: turboquant_mse (Algorithm 1, arXiv:2504.19874v1)
  • Upload account: lew96123
Downloads last month
82
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lew96123/qwen3.5-0.8b-custom-packed-4bit

Finetuned
(155)
this model

Collection including lew96123/qwen3.5-0.8b-custom-packed-4bit

Paper for lew96123/qwen3.5-0.8b-custom-packed-4bit