| --- |
| license: other |
| license_name: minimax-model-license |
| license_link: https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE |
| base_model: MiniMaxAI/MiniMax-M2.7 |
| tags: |
| - rotorquant |
| - kv-cache-quantization |
| - minimax |
| - m2.7 |
| - moe |
| - quantized |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # MiniMax-M2.7-RotorQuant |
|
|
| **RotorQuant KV cache compression** for [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7). |
|
|
| This is a **documentation repository** that explains how to combine MiniMax-M2.7's weights with RotorQuant inference-time KV cache compression. No weights are stored here β use the base model directly and apply RotorQuant via the Python package or llama.cpp fork. |
|
|
| ## Hardware compatibility |
|
|
| | Device | VRAM / RAM | Recommendation | |
| | --- | --- | --- | |
| | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant | |
|
|
| ## What is this? |
|
|
| KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β so the same base weights can be used with or without compression. |
|
|
| | Technique | Where it's applied | Savings | |
| |-----------|-------------------|---------| |
| | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory | |
| | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) | |
|
|
| Both can be combined for maximum efficiency. |
|
|
| ## Quickstart |
|
|
| ### Option A β Python / transformers |
|
|
| Install the `rotorquant` package: |
|
|
| ```bash |
| pip install rotorquant |
| ``` |
|
|
| Then use it with the base model: |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from rotorquant import IsoQuantCache |
| |
| tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M2.7", trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| "MiniMaxAI/MiniMax-M2.7", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| # Apply RotorQuant to the KV cache |
| cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression |
| |
| inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=128, |
| past_key_values=cache, |
| use_cache=True, |
| ) |
| print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)) |
| ``` |
|
|
|
|
| ### Option B β llama.cpp / LM Studio / Ollama (with fork) |
|
|
| RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require: |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
|
|
| Once built: |
|
|
| ```bash |
| llama-cli -m MiniMax-M2.7.gguf \ |
| --cache-type-k iso3 --cache-type-v iso3 \ |
| -ngl 99 -fa \ |
| -p "Hello" |
| ``` |
|
|
| For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization. |
|
|
| ## Model Specifications |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base Model | [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) | |
| | Architecture | Sparse MoE (256 experts, 8 active) | |
| | Parameters | ~456B total (MoE) | |
| | Context Length | 256K | |
| | BF16 Size | ~912 GB | |
| | Modalities | Text | |
| | License | other | |
|
|
| ## What is RotorQuant? |
|
|
| [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies. |
|
|
| **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β results vary by model and hardware): |
|
|
| - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s) |
| - Decode: 119 tok/s (vs TurboQuant 93 tok/s) |
| - Perplexity: 6.91 (vs TurboQuant 7.07) |
| - Parameters: 4 per rotor (vs TurboQuant 16,384) |
|
|
| > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on MiniMax-M2.7 will differ. Please open a discussion if you have independent results. |
|
|
| ## Current Ecosystem Support |
|
|
| | Runtime | RotorQuant Support | Notes | |
| |---------|----------------------|-------| |
| | Python transformers + `rotorquant` | β
Full | Drop-in cache class | |
| | llama.cpp upstream | β Not merged | Use fork below | |
| | llama-cpp-turboquant fork | β
`planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) | |
| | LM Studio | β [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative | |
| | Ollama | β Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` | |
| | vLLM | β Not supported | β | |
| | koboldcpp | β Not supported | β | |
|
|
| ## Pre-quantized weight variants |
|
|
| If you want combined weight + KV cache compression, majentik hosts pre-quantized versions: |
|
|
| - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=MiniMax-M2.7+MLX) |
| - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=MiniMax-M2.7+GGUF) |
|
|
| ## See Also |
|
|
| - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant) |
| - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874) |
| - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
| - [Base model: MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) |
|
|