--- library_name: gguf base_model: google/gemma-4-31B tags: - gguf - rotorquant - kv-cache-quantization - gemma - gemma4 - dense - multimodal - llama-cpp - quantized license: apache-2.0 --- # gemma-4-31B-RotorQuant-GGUF-Q4_K_M GGUF Q4_K_M weight-quantized variant of [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) with **RotorQuant** KV cache compression for efficient inference with llama.cpp, Ollama, and LM Studio. ## Overview This model combines two compression techniques: - **GGUF Q4_K_M weight quantization** — reduces model size from ~62GB to ~18 GB - **RotorQuant KV cache compression** — block-diagonal rotations (Clifford algebra) for 3-bit KV cache, 5.3x faster prefill ## Quickstart ### llama.cpp ```bash llama-cli -m gemma-4-31B-RotorQuant-GGUF-Q4_K_M.gguf \ --cache-type-k planar3 --cache-type-v iso3 \ -p "Explain quantum computing" ``` ### Ollama ```bash ollama run majentik/gemma-4-31B-RotorQuant-GGUF-Q4_K_M ``` ### LM Studio Download the GGUF file and load in LM Studio. Enable RotorQuant KV cache in advanced settings. ## Specifications | Property | Value | |----------|-------| | Base Model | google/gemma-4-31B | | Parameters | 31B dense | | Weight Quantization | GGUF Q4_K_M | | KV Cache | RotorQuant 3-bit (planar/iso) | | File Size | ~18 GB | | License | Apache 2.0 | | Compatible | llama.cpp, Ollama, LM Studio, koboldcpp | ## What is RotorQuant? RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. When used with llama.cpp's `--cache-type-k planar3 --cache-type-v iso3` flags: | Metric | RotorQuant | TurboQuant | |--------|-----------|-----------| | Prefill Speed | 3,822 tok/s | 722 tok/s | | Decode Speed | 119 tok/s | 93 tok/s | | Perplexity | 6.91 | 7.07 | ## See Also - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) - [Base model](https://huggingface.co/google/gemma-4-31B) - [MLX variants](https://huggingface.co/majentik/gemma-4-31B-RotorQuant-MLX-4bit)