How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "krampenschiesser/MiniMax-M2.7-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "krampenschiesser/MiniMax-M2.7-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/krampenschiesser/MiniMax-M2.7-GGUF:NVFP4
Quick Links

These are some quants I use depending on the memory availability. I also added nvfp4 in the hope for custom kernels emerging in the future. I recommend the Q3K-IQ4XS and IQ4XS-Q5K quants.

KLD

I need to use the Q8 version due to hardware restrictions for running the kld baseline. However it is quantized in the same way as the original model which also uses 8 bits for the expert weights so the difference should not be big.

Sadly I am getting weird outputs (nan floats from llama-perplexity) from some kld runs so take this with a salt lake.

Provider Quant Size GB Mean PPL Mean KLD Same Top p
KS Q8 7.0266 +/- 0.05210 baseline baseline
KS IQ4XS-Q5K 135.5 90.720 Β± 0.077 %
KS IQ4XS 123.8 7.153799 Β± 0.053213 0.086127 Β± 0.001029 89.425 Β± 0.082 %
KS IQ4XS-Q4K 126.1 89.205 Β± 0.083 %
KS NVFP4 130.8 7.177182 Β± 0.053324 0.105053 Β± 0.001034 88.154 Β± 0.086 %
unsloth UD-Q4_K_XL 141 86.990 Β± 0.090 %
KS Q3K-IQ4XS 108.6 7.297092 Β± 0.054489 0.140361 Β± 0.001216 86.387 Β± 0.091 %
Downloads last month
1,195
GGUF
Model size
229B params
Architecture
minimax-m2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for krampenschiesser/MiniMax-M2.7-GGUF

Quantized
(104)
this model