gemma-4-31B-RotorQuant-GGUF-Q4_K_M
GGUF Q4_K_M weight-quantized variant of google/gemma-4-31B with RotorQuant KV cache compression for efficient inference with llama.cpp, Ollama, and LM Studio.
Overview
This model combines two compression techniques:
- GGUF Q4_K_M weight quantization โ reduces model size from ~62GB to ~18 GB
- RotorQuant KV cache compression โ block-diagonal rotations (Clifford algebra) for 3-bit KV cache, 5.3x faster prefill
Quickstart
llama.cpp
llama-cli -m gemma-4-31B-RotorQuant-GGUF-Q4_K_M.gguf \
--cache-type-k planar3 --cache-type-v iso3 \
-p "Explain quantum computing"
Ollama
ollama run majentik/gemma-4-31B-RotorQuant-GGUF-Q4_K_M
LM Studio
Download the GGUF file and load in LM Studio. Enable RotorQuant KV cache in advanced settings.
Specifications
| Property | Value |
|---|---|
| Base Model | google/gemma-4-31B |
| Parameters | 31B dense |
| Weight Quantization | GGUF Q4_K_M |
| KV Cache | RotorQuant 3-bit (planar/iso) |
| File Size | ~18 GB |
| License | Apache 2.0 |
| Compatible | llama.cpp, Ollama, LM Studio, koboldcpp |
What is RotorQuant?
RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. When used with llama.cpp's --cache-type-k planar3 --cache-type-v iso3 flags:
| Metric | RotorQuant | TurboQuant |
|---|---|---|
| Prefill Speed | 3,822 tok/s | 722 tok/s |
| Decode Speed | 119 tok/s | 93 tok/s |
| Perplexity | 6.91 | 7.07 |
See Also
- Downloads last month
- 52
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for majentik/gemma-4-31B-RotorQuant-GGUF-Q4_K_M
Base model
google/gemma-4-31B