Qwen3.5-27B-TurboQuant-2bit
2-bit KV cache compression for Qwen/Qwen3.5-27B using TurboQuant.
This is a KV-cache-only repository. It contains no model weight files โ only the configuration and model card for applying TurboQuant 2-bit KV cache quantization at runtime on the original Qwen3.5-27B weights.
Overview
Qwen3.5-27B is a 27B-parameter hybrid transformer with 262K native context and built-in thinking mode (the model generates internal reasoning tokens before answering). Thinking mode makes KV cache compression especially valuable, since the reasoning chain can consume substantial cache memory.
TurboQuant 2-bit compresses the KV cache by approximately 8x compared to FP16, dramatically reducing memory usage for long-context inference. At 2-bit precision this is an aggressive quantization โ expect some quality degradation compared to 4-bit, but it enables inference in memory-constrained environments where 4-bit KV cache would not fit.
Specifications
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Architecture | Hybrid Transformer |
| Native context | 262,144 tokens |
| Thinking mode | Yes |
| KV cache method | TurboQuant 2-bit |
| KV cache compression | ~8x vs FP16 |
| Weights | Original (FP16/BF16, loaded separately) |
Memory Estimates
| Component | Estimate |
|---|---|
| Model weights (BF16) | ~54 GB |
| KV cache at 128K context (2-bit) | ~1.6 GB |
| KV cache at 128K context (FP16, baseline) | ~12.8 GB |
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model_id = "Qwen/Qwen3.5-27B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
# Apply 2-bit KV cache compression
cache = TurboQuantCache(bits=2)
messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=2048,
past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quality Notes
- 2-bit is aggressive quantization. It is best suited for memory-constrained scenarios (e.g., fitting long-context inference on a single GPU).
- For higher quality with moderate compression, consider 4-bit KV cache variants.
- Thinking mode reasoning quality may be more sensitive to cache quantization since the model relies on cached reasoning tokens for its final answer.
References
See Also
- majentik/Qwen3.5-27B-RotorQuant-2bit โ RotorQuant 2-bit KV cache variant
- majentik/Qwen3.5-27B-TurboQuant-MLX-2bit โ MLX 2-bit weights + TurboQuant KV cache
- majentik/Qwen3.5-27B-RotorQuant-MLX-2bit โ MLX 2-bit weights + RotorQuant KV cache
Model tree for majentik/Qwen3.5-27B-TurboQuant-2bit
Base model
Qwen/Qwen3.5-27B