caiovicentino1
/

Gemma-4-31B-Claude-Opus-HLWQ-Q5-Vision

Image-Text-to-Text

Model card Files Files and versions

caiovicentino1 commited on Apr 3

Commit

a97c6e2

·

verified ·

1 Parent(s): dc4d0ad

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +92 -0

README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+---
+license: apache-2.0
+tags:
+- polarquant
+- gemma4
+- claude-opus
+- distill
+- vision
+- multimodal
+- quantized
+base_model: TeichAI/gemma-4-31B-it-Claude-Opus-Distill
+pipeline_tag: image-text-to-text
+arxiv: "2603.29078"
+---
+# 🧊 Gemma-4-31B-Claude-Opus-PolarQuant-Q5-Vision
+**Claude Opus distilled Gemma 4 31B + Vision** on consumer GPUs.
+Download: **21.8 GB** (vs 62.5 GB BF16 — 2.9x compression)
+| Component | Method | Result |
+|---|---|---|
+| **Text weights** | PolarQuant Q5 + torchao INT4 | 21.8 GB |
+| **Vision encoder** | BF16 (full quality) | included |
+| **KV Cache** | PolarQuant Q3 (5.3x) | longer context |
+| **Reasoning** | Claude Opus 4.6 distilled | high-effort |
+## 🎯 Key Results
+| Metric | Value |
+|---|---|
+| **VRAM** | 22.8 GB (streaming loader) |
+| **Speed** | ~24.9 tok/s |
+| **Download** | 21.8 GB |
+| **Vision** | ✅ Golden Gate Bridge |
+| **Compression** | 2.9x |
+| **Quantized layers** | 602 |
+## 📊 Charts
+![Compression](compression.png)
+![VRAM](vram_breakdown.png)
+![Family](family.png)
+![Context](context.png)
+## 🏆 GPU Support
+| GPU | VRAM | Fits? |
+|---|---|---|
+| **RTX 4090** | 24 GB | ✅ |
+| **L4** | 24 GB | ✅ |
+| **RTX 5090** | 32 GB | ✅ |
+| **A100** | 40-80 GB | ✅ |
+## 🚀 Quick Start
+```bash
+pip install polarquant[all]
+polarquant chat TeichAI/gemma-4-31B-it-Claude-Opus-Distill --vision
+```
+## 🔬 KV Cache Compression
+| Method | Bits | Compression | Max Context (4GB) |
+|---|---|---|---|
+| FP16 | 16 | 1.0x | 4K |
+| PolarQuant Q4 | 4 | 4.0x | 17K |
+| **PolarQuant Q3** | **3** | **5.3x** | **22K** |
+| PolarQuant Q2 | 2 | 8.0x | 35K |
+## 🔧 Technical Details
+- **Architecture**: Gemma 4 (60 layers, 32 attn heads, 16 KV heads, head_dim=256)
+- **Hybrid attention**: Sliding window (1024) + global attention
+- **Weight quantization**: Hadamard rotation (128x128) + Lloyd-Max Q5 + torchao INT4
+- **KV cache**: Hadamard rotation (256x256) + Lloyd-Max Q3 + real bit-packing
+- **Streaming loader**: Per-module INT4 via nn.Sequential wrapper — fits 24GB GPUs
+- **Base model**: [TeichAI/gemma-4-31B-it-Claude-Opus-Distill](https://huggingface.co/TeichAI/gemma-4-31B-it-Claude-Opus-Distill)
+## 📖 Citation
+```bibtex
+@article{polarquant2025,
+  title={PolarQuant: Hadamard-Rotated Lloyd-Max Quantization for LLM Compression},
+  author={Vicentino, Caio},
+  journal={arXiv preprint arXiv:2603.29078},
+  year={2025}
+}
+```
+📄 [Paper](https://arxiv.org/abs/2603.29078) · 💻 [GitHub](https://github.com/caiovicentino/polarengine-vllm) · 📦 [PyPI](https://pypi.org/project/polarquant/)