MiniMax-M2.7-NVFP4-GB10
Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=2) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 130.6 GB on disk, down from the 230.2 GB official FP8 release.
Model Details
| Base Model | MiniMaxAI/MiniMax-M2.7 |
| Architecture | MiniMaxM2ForCausalLM (MoE, 256 experts, 2 active per token) |
| Total Parameters | 230B |
| Active Parameters | ~10B per token |
| Hidden Layers | 62 |
| Quantization | NVFP4 (4-bit floating point) with GB10-tuned ignore list |
| Format | compressed-tensors (safetensors) |
| Size on Disk | 130.6 GB |
| Deployment | 2× DGX Spark (does not fit in a single 128 GB Spark) |
| License | Other (inherited from MiniMaxAI/MiniMax-M2.7) |
Quantization Details
- Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (
nvidia-modelopt) - Scheme:
mtq.NVFP4_DEFAULT_CFG+ GB10-tuned disable list applied post-calibration - Calibration Dataset: HuggingFaceH4/ultrachat_200k (
train_sftsplit) - Calibration Samples: 64
- Max Sequence Length: 2048 tokens
- Preserved in BF16 (ignore list):
lm_head,*block_sparse_moe.gate(MoE router gate, not per-expert gates) - Hardware Used: Hugging Face Jobs, 8× NVIDIA A100 80 GB
- Recipe script:
quantize-nvfp4-gb10.py— env-var-configurable; see the file header for the MoE-expert amax gotcha and how to adapt for other architectures.
Performance (2× NVIDIA DGX Spark — GB10, 128 GB each)
Benchmarked with llama-benchy 0.3.3, 3 runs per scenario, TP=2 over ConnectX-7 RoCE, 192K context window.
| PP | TG | Prefill (tok/s) | Decode (tok/s) | TTFT (ms) |
|---|---|---|---|---|
| 512 | 128 | 1,436 | 26.2 | 461 |
| 512 | 256 | 1,737 | 26.9 | 381 |
| 1024 | 128 | 2,878 | 26.9 | 443 |
| 1024 | 256 | 3,004 | 26.6 | 427 |
| 2048 | 128 | 3,637 | 26.3 | 663 |
| 2048 | 256 | 4,041 | 26.1 | 593 |
| 4096 | 128 | 3,967 | 25.7 | 1,170 |
| 4096 | 256 | 4,493 | 26.2 | 998 |
Decode throughput is stable at ~26 tok/s across all prompt lengths — characteristic of a dual-node tensor-parallel deployment where decode is bound by per-step inter-node communication rather than compute. Prefill scales cleanly with prompt length as the per-step batch amortizes fixed overhead.
Running on 2× DGX Spark (Tensor Parallel)
At 130.6 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It is intended to run with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.
Environment variables
These are the ones that matter for NVFP4 on GB10 and for NCCL over the ConnectX-7 link. Interface names (enp1s0f0np0, rocep1s0f0, roceP2p1s0f0) may differ on your hardware — verify with ip -br link and ibdev2netdev.
# NVFP4 kernel selection (validated on GB10 SM 12.1)
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
VLLM_USE_FLASHINFER_MOE_FP4=0
SAFETENSORS_FAST_GPU=1
OMP_NUM_THREADS=8
TORCHINDUCTOR_MAX_AUTOTUNE=0
# NCCL + Gloo + Ray — route everything over the RoCE interface
NCCL_SOCKET_IFNAME=enp1s0f0np0
GLOO_SOCKET_IFNAME=enp1s0f0np0
TP_SOCKET_IFNAME=enp1s0f0np0
UCX_NET_DEVICES=enp1s0f0np0
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0 # both halves for full 200 Gb/s
NCCL_IB_DISABLE=0
NCCL_IGNORE_CPU_AFFINITY=1
# Per-node
VLLM_HOST_IP=<this node's RoCE IP>
Ray cluster
On the head node:
ray start --head --port=6379 \
--node-ip-address=<HEAD_ROCE_IP> \
--disable-usage-stats
On the worker node:
ray start --address=<HEAD_ROCE_IP>:6379 \
--node-ip-address=<WORKER_ROCE_IP> \
--disable-usage-stats --block
vLLM server (head node, after Ray reports 2 GPUs)
vllm serve /models/MiniMax-M2.7-NVFP4-GB10 \
--host 0.0.0.0 --port 30000 \
--served-model-name minimax-m2.7 \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--gpu-memory-utilization 0.85 \
--max-model-len 196608 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think \
--compilation-config '{"cudagraph_mode":"none","inductor_compile_config":{"combo_kernels":false,"benchmark_combo_kernel":false,"max_autotune":false,"max_autotune_gemm":false}}'
First boot loads the weights and runs JIT compilation — plan for 10–15 minutes before /v1/models responds.
Test it
curl http://<HEAD_HOST>:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "minimax-m2.7",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01,
"max_tokens": 512
}'
Gotchas
--gpu-memory-utilization 0.85is the safe default for TP=2. Activation and inter-node comm buffers grow under tensor parallel; raise cautiously.- Both RoCE halves must be in
NCCL_IB_HCA, or you get 100 Gb/s instead of 200. cudagraph_mode: nonein--compilation-configis load-bearing for MiniMax MoE on GB10 — CUDA graphs deadlock on this architecture. Leavetorch.compileotherwise enabled.- Do not pass
--enable-expert-parallel— uneven per-node memory on MoE, breaks under current kernels. - If
NCCL_DEBUG=INFOshowsNET/Socketinstead ofNET/IB/0+NET/IB/1, NCCL fell back to TCP — recheck interface names andNCCL_IB_HCA.
Recommended Sampling Parameters
{
"temperature": 1.0,
"top_p": 0.95,
"top_k": 40,
"min_p": 0.01
}
Target Hardware
Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore-list was tuned for Blackwell and will leave some performance on the table.
Acknowledgments
- Base model by MiniMax
- Quantization tooling: NVIDIA TensorRT Model Optimizer
- GB10 quantization profile guidance: Scott Glover (scottgl)
- Multi-Spark runtime tuning: the
eugr/spark-vllm-dockerproject and the NVIDIA Developer Forum community
- Downloads last month
- 509
Model tree for saricles/MiniMax-M2.7-NVFP4-GB10
Base model
MiniMaxAI/MiniMax-M2.7