majentik
/

gemma-4-31B-RotorQuant-GGUF-Q4_K_M

kv-cache-quantization

Model card Files Files and versions

gemma-4-31B-RotorQuant-GGUF-Q4_K_M / README.md

majentik's picture

Add model card

a4a4fd7 verified 1 day ago

|

history blame contribute delete

2 kB

	---
	library_name: gguf
	base_model: google/gemma-4-31B
	tags:
	- gguf
	- rotorquant
	- kv-cache-quantization
	- gemma
	- gemma4
	- dense
	- multimodal
	- llama-cpp
	- quantized
	license: apache-2.0
	---

	# gemma-4-31B-RotorQuant-GGUF-Q4_K_M

	GGUF Q4_K_M weight-quantized variant of [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) with RotorQuant KV cache compression for efficient inference with llama.cpp, Ollama, and LM Studio.

	## Overview

	This model combines two compression techniques:
	- GGUF Q4_K_M weight quantization — reduces model size from ~62GB to ~18 GB
	- RotorQuant KV cache compression — block-diagonal rotations (Clifford algebra) for 3-bit KV cache, 5.3x faster prefill

	## Quickstart

	### llama.cpp
	```bash
	llama-cli -m gemma-4-31B-RotorQuant-GGUF-Q4_K_M.gguf \
	--cache-type-k planar3 --cache-type-v iso3 \
	-p "Explain quantum computing"
	```

	### Ollama
	```bash
	ollama run majentik/gemma-4-31B-RotorQuant-GGUF-Q4_K_M
	```

	### LM Studio
	Download the GGUF file and load in LM Studio. Enable RotorQuant KV cache in advanced settings.

	## Specifications

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| google/gemma-4-31B \|
	\| Parameters \| 31B dense \|
	\| Weight Quantization \| GGUF Q4_K_M \|
	\| KV Cache \| RotorQuant 3-bit (planar/iso) \|
	\| File Size \| ~18 GB \|
	\| License \| Apache 2.0 \|
	\| Compatible \| llama.cpp, Ollama, LM Studio, koboldcpp \|

	## What is RotorQuant?

	RotorQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. When used with llama.cpp's `--cache-type-k planar3 --cache-type-v iso3` flags:

	\| Metric \| RotorQuant \| TurboQuant \|
	\|--------\|-----------\|-----------\|
	\| Prefill Speed \| 3,822 tok/s \| 722 tok/s \|
	\| Decode Speed \| 119 tok/s \| 93 tok/s \|
	\| Perplexity \| 6.91 \| 7.07 \|

	## See Also

	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
	- [Base model](https://huggingface.co/google/gemma-4-31B)
	- [MLX variants](https://huggingface.co/majentik/gemma-4-31B-RotorQuant-MLX-4bit)