caiovicentino1 commited on
Commit
a97c6e2
Β·
verified Β·
1 Parent(s): dc4d0ad

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - polarquant
5
+ - gemma4
6
+ - claude-opus
7
+ - distill
8
+ - vision
9
+ - multimodal
10
+ - quantized
11
+ base_model: TeichAI/gemma-4-31B-it-Claude-Opus-Distill
12
+ pipeline_tag: image-text-to-text
13
+ arxiv: "2603.29078"
14
+ ---
15
+
16
+ # 🧊 Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision
17
+
18
+ **Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
19
+
20
+ Download: **21.8 GB** (vs 62.5 GB BF16 β€” 2.9x compression)
21
+
22
+ | Component | Method | Result |
23
+ |---|---|---|
24
+ | **Text weights** | PolarQuant Q5 + torchao INT4 | 21.8 GB |
25
+ | **Vision encoder** | BF16 (full quality) | included |
26
+ | **KV Cache** | PolarQuant Q3 (5.3x) | longer context |
27
+ | **Reasoning** | Claude Opus 4.6 distilled | high-effort |
28
+
29
+ ## 🎯 Key Results
30
+
31
+ | Metric | Value |
32
+ |---|---|
33
+ | **VRAM** | 22.8 GB (streaming loader) |
34
+ | **Speed** | ~24.9 tok/s |
35
+ | **Download** | 21.8 GB |
36
+ | **Vision** | βœ… Golden Gate Bridge |
37
+ | **Compression** | 2.9x |
38
+ | **Quantized layers** | 602 |
39
+
40
+ ## πŸ“Š Charts
41
+
42
+ ![Compression](compression.png)
43
+ ![VRAM](vram_breakdown.png)
44
+ ![Family](family.png)
45
+ ![Context](context.png)
46
+
47
+ ## πŸ† GPU Support
48
+
49
+ | GPU | VRAM | Fits? |
50
+ |---|---|---|
51
+ | **RTX 4090** | 24 GB | βœ… |
52
+ | **L4** | 24 GB | βœ… |
53
+ | **RTX 5090** | 32 GB | βœ… |
54
+ | **A100** | 40-80 GB | βœ… |
55
+
56
+ ## πŸš€ Quick Start
57
+
58
+ ```bash
59
+ pip install polarquant[all]
60
+ polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
61
+ ```
62
+
63
+ ## πŸ”¬ KV Cache Compression
64
+
65
+ | Method | Bits | Compression | Max Context (4GB) |
66
+ |---|---|---|---|
67
+ | FP16 | 16 | 1.0x | 4K |
68
+ | PolarQuant Q4 | 4 | 4.0x | 17K |
69
+ | **PolarQuant Q3** | **3** | **5.3x** | **22K** |
70
+ | PolarQuant Q2 | 2 | 8.0x | 35K |
71
+
72
+ ## πŸ”§ Technical Details
73
+
74
+ - **Architecture**: Gemma 4 (60 layers, 32 attn heads, 16 KV heads, head_dim=256)
75
+ - **Hybrid attention**: Sliding window (1024) + global attention
76
+ - **Weight quantization**: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
77
+ - **KV cache**: Hadamard rotation (256x256) + Lloyd-Max Q3 + real bit-packing
78
+ - **Streaming loader**: Per-module INT4 via nn.Sequential wrapper β€” fits 24GB GPUs
79
+ - **Base model**: [TeichAI/gemma-4-31B-it-Claude-Opus-Distill](https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill)
80
+
81
+ ## πŸ“– Citation
82
+
83
+ ```bibtex
84
+ @article{polarquant2025,
85
+ title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
86
+ author={Vicentino, Caio},
87
+ journal={arXiv preprint arXiv:2603.29078},
88
+ year={2025}
89
+ }
90
+ ```
91
+
92
+ πŸ“„ [Paper](https://arxiv.org/abs/2603.29078) Β· πŸ’» [GitHub](https://github.com/caiovicentino/polarengine-vllm) Β· πŸ“¦ [PyPI](https://pypi.org/project/polarquant/)