Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- polarquant
|
| 5 |
+
- gemma4
|
| 6 |
+
- claude-opus
|
| 7 |
+
- distill
|
| 8 |
+
- vision
|
| 9 |
+
- multimodal
|
| 10 |
+
- quantized
|
| 11 |
+
base_model: TeichAI/gemma-4-31B-it-Claude-Opus-Distill
|
| 12 |
+
pipeline_tag: image-text-to-text
|
| 13 |
+
arxiv: "2603.29078"
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# π§ Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision
|
| 17 |
+
|
| 18 |
+
**Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
|
| 19 |
+
|
| 20 |
+
Download: **21.8 GB** (vs 62.5 GB BF16 β 2.9x compression)
|
| 21 |
+
|
| 22 |
+
| Component | Method | Result |
|
| 23 |
+
|---|---|---|
|
| 24 |
+
| **Text weights** | PolarQuant Q5 + torchao INT4 | 21.8 GB |
|
| 25 |
+
| **Vision encoder** | BF16 (full quality) | included |
|
| 26 |
+
| **KV Cache** | PolarQuant Q3 (5.3x) | longer context |
|
| 27 |
+
| **Reasoning** | Claude Opus 4.6 distilled | high-effort |
|
| 28 |
+
|
| 29 |
+
## π― Key Results
|
| 30 |
+
|
| 31 |
+
| Metric | Value |
|
| 32 |
+
|---|---|
|
| 33 |
+
| **VRAM** | 22.8 GB (streaming loader) |
|
| 34 |
+
| **Speed** | ~24.9 tok/s |
|
| 35 |
+
| **Download** | 21.8 GB |
|
| 36 |
+
| **Vision** | β
Golden Gate Bridge |
|
| 37 |
+
| **Compression** | 2.9x |
|
| 38 |
+
| **Quantized layers** | 602 |
|
| 39 |
+
|
| 40 |
+
## π Charts
|
| 41 |
+
|
| 42 |
+

|
| 43 |
+

|
| 44 |
+

|
| 45 |
+

|
| 46 |
+
|
| 47 |
+
## π GPU Support
|
| 48 |
+
|
| 49 |
+
| GPU | VRAM | Fits? |
|
| 50 |
+
|---|---|---|
|
| 51 |
+
| **RTX 4090** | 24 GB | β
|
|
| 52 |
+
| **L4** | 24 GB | β
|
|
| 53 |
+
| **RTX 5090** | 32 GB | β
|
|
| 54 |
+
| **A100** | 40-80 GB | β
|
|
| 55 |
+
|
| 56 |
+
## π Quick Start
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
pip install polarquant[all]
|
| 60 |
+
polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## π¬ KV Cache Compression
|
| 64 |
+
|
| 65 |
+
| Method | Bits | Compression | Max Context (4GB) |
|
| 66 |
+
|---|---|---|---|
|
| 67 |
+
| FP16 | 16 | 1.0x | 4K |
|
| 68 |
+
| PolarQuant Q4 | 4 | 4.0x | 17K |
|
| 69 |
+
| **PolarQuant Q3** | **3** | **5.3x** | **22K** |
|
| 70 |
+
| PolarQuant Q2 | 2 | 8.0x | 35K |
|
| 71 |
+
|
| 72 |
+
## π§ Technical Details
|
| 73 |
+
|
| 74 |
+
- **Architecture**: Gemma 4 (60 layers, 32 attn heads, 16 KV heads, head_dim=256)
|
| 75 |
+
- **Hybrid attention**: Sliding window (1024) + global attention
|
| 76 |
+
- **Weight quantization**: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
|
| 77 |
+
- **KV cache**: Hadamard rotation (256x256) + Lloyd-Max Q3 + real bit-packing
|
| 78 |
+
- **Streaming loader**: Per-module INT4 via nn.Sequential wrapper β fits 24GB GPUs
|
| 79 |
+
- **Base model**: [TeichAI/gemma-4-31B-it-Claude-Opus-Distill](https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill)
|
| 80 |
+
|
| 81 |
+
## π Citation
|
| 82 |
+
|
| 83 |
+
```bibtex
|
| 84 |
+
@article{polarquant2025,
|
| 85 |
+
title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
|
| 86 |
+
author={Vicentino, Caio},
|
| 87 |
+
journal={arXiv preprint arXiv:2603.29078},
|
| 88 |
+
year={2025}
|
| 89 |
+
}
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
π [Paper](https://arxiv.org/abs/2603.29078) Β· π» [GitHub](https://github.com/caiovicentino/polarengine-vllm) Β· π¦ [PyPI](https://pypi.org/project/polarquant/)
|