Qwen3.5-27B-TurboQuant-2bit

2-bit KV cache compression for Qwen/Qwen3.5-27B using TurboQuant.

This is a KV-cache-only repository. It contains no model weight files — only the configuration and model card for applying TurboQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.

Overview

Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.

TurboQuant 2-bit compresses the KV cache by approximately 8x compared to FP16, dramatically reducing memory usage for long-context inference. At 2-bit precision this is an aggressive quantization — expect some quality degradation compared to 4-bit, but it enables inference in memory-constrained environments where 4-bit KV cache would not fit.

Specifications

Property	Value
Base model	Qwen/Qwen3.5-27B
Parameters	27B
Architecture	Hybrid Transformer
Native context	262,144 tokens
Thinking mode	Yes
KV cache method	TurboQuant 2-bit
KV cache compression	~8x vs FP16
Weights	Original (FP16/BF16, loaded separately)

Memory Estimates

Component	Estimate
Model weights (BF16)	~54 GB
KV cache at 128K context (2-bit)	~1.6 GB
KV cache at 128K context (FP16, baseline)	~12.8 GB

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "Qwen/Qwen3.5-27B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Apply 2-bit KV cache compression
cache = TurboQuantCache(bits=2)

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quality Notes

2-bit is aggressive quantization. It is best suited for memory-constrained scenarios (e.g., fitting long-context inference on a single GPU).
For higher quality with moderate compression, consider 4-bit KV cache variants.
Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.

References

Model tree for majentik/Qwen3.5-27B-TurboQuant-2bit

Base model

Qwen/Qwen3.5-27B

Finetuned

(241)

this model

Paper for majentik/Qwen3.5-27B-TurboQuant-2bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32

majentik
/

Qwen3.5-27B-TurboQuant-2bit