--- license: other license_name: minimax-model-license license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE base_model: MiniMaxAI/MiniMax-M2.7 tags: - rotorquant - kv-cache-quantization - minimax - m2.7 - moe - quantized library_name: transformers pipeline_tag: text-generation language: - en inference: false --- # MiniMax-M2.7-RotorQuant **RotorQuant KV cache compression** for [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7). This is a **documentation repository** that explains how to combine MiniMax-M2.7's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork. ## What is this? KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression. | Technique | Where it's applied | Savings | |-----------|-------------------|---------| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | Both can be combined for maximum efficiency. ## Quickstart ### Option A — Python / transformers Install the `rotorquant` package: ```bash pip install rotorquant ``` Then use it with the base model: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from rotorquant import IsoQuantCache tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M2.7", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "MiniMaxAI/MiniMax-M2.7", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) # Apply RotorQuant to the KV cache cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=128, past_key_values=cache, use_cache=True, ) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) ``` ### Option B — llama.cpp / LM Studio / Ollama (with fork) RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require: - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) Once built: ```bash llama-cli -m MiniMax-M2.7.gguf \ --cache-type-k iso3 --cache-type-v iso3 \ -ngl 99 -fa \ -p "Hello" ``` For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization. ## Model Specifications | Property | Value | |----------|-------| | Base Model | [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | | Architecture | Sparse MoE (256 experts, 8 active) | | Parameters | ~456B total (MoE) | | Context Length | 256K | | BF16 Size | ~912 GB | | Modalities | Text | | License | other | ## What is RotorQuant? [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors — a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies. **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 — results vary by model and hardware): - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s) - Decode: 119 tok/s (vs TurboQuant 93 tok/s) - Perplexity: 6.91 (vs TurboQuant 7.07) - Parameters: 4 per rotor (vs TurboQuant 16,384) > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on MiniMax-M2.7 will differ. Please open a discussion if you have independent results. ## Current Ecosystem Support | Runtime | RotorQuant Support | Notes | |---------|----------------------|-------| | Python transformers + `rotorquant` | ✅ Full | Drop-in cache class | | llama.cpp upstream | ❌ Not merged | Use fork below | | llama-cpp-turboquant fork | ✅ `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | | vLLM | ❌ Not supported | — | | koboldcpp | ❌ Not supported | — | ## Pre-quantized weight variants If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=MiniMax-M2.7+MLX) - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=MiniMax-M2.7+GGUF) ## See Also - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) - [Base model: MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7)