MiniMax-M2.7-NVFP4-GB10

Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=2) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 130.6 GB on disk, down from the 230.2 GB official FP8 release.

Model Details


Base Model	MiniMaxAI/MiniMax-M2.7
Architecture	MiniMaxM2ForCausalLM (MoE, 256 experts, 2 active per token)
Total Parameters	230B
Active Parameters	~10B per token
Hidden Layers	62
Quantization	NVFP4 (4-bit floating point) with GB10-tuned ignore list
Format	compressed-tensors (safetensors)
Size on Disk	130.6 GB
Deployment	2× DGX Spark (does not fit in a single 128 GB Spark)
License	Other (inherited from MiniMaxAI/MiniMax-M2.7)

Quantization Details

Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt)
Scheme: mtq.NVFP4_DEFAULT_CFG + GB10-tuned disable list applied post-calibration
Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration Samples: 64
Max Sequence Length: 2048 tokens
Preserved in BF16 (ignore list): lm_head, *block_sparse_moe.gate (MoE router gate, not per-expert gates)
Hardware Used: Hugging Face Jobs, 8× NVIDIA A100 80 GB
Recipe script: quantize-nvfp4-gb10.py — env-var-configurable; see the file header for the MoE-expert amax gotcha and how to adapt for other architectures.

Performance (2× NVIDIA DGX Spark — GB10, 128 GB each)

Benchmarked with llama-benchy 0.3.3, 3 runs per scenario, TP=2 over ConnectX-7 RoCE, 192K context window.

PP	TG	Prefill (tok/s)	Decode (tok/s)	TTFT (ms)
512	128	1,436	26.2	461
512	256	1,737	26.9	381
1024	128	2,878	26.9	443
1024	256	3,004	26.6	427
2048	128	3,637	26.3	663
2048	256	4,041	26.1	593
4096	128	3,967	25.7	1,170
4096	256	4,493	26.2	998

Decode throughput is stable at ~26 tok/s across all prompt lengths — characteristic of a dual-node tensor-parallel deployment where decode is bound by per-step inter-node communication rather than compute. Prefill scales cleanly with prompt length as the per-step batch amortizes fixed overhead.

Running on 2× DGX Spark (Tensor Parallel)

At 130.6 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It is intended to run with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.

Environment variables

These are the ones that matter for NVFP4 on GB10 and for NCCL over the ConnectX-7 link. Interface names (enp1s0f0np0, rocep1s0f0, roceP2p1s0f0) may differ on your hardware — verify with ip -br link and ibdev2netdev.

# NVFP4 kernel selection (validated on GB10 SM 12.1)
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
VLLM_USE_FLASHINFER_MOE_FP4=0
SAFETENSORS_FAST_GPU=1
OMP_NUM_THREADS=8
TORCHINDUCTOR_MAX_AUTOTUNE=0

# NCCL + Gloo + Ray — route everything over the RoCE interface
NCCL_SOCKET_IFNAME=enp1s0f0np0
GLOO_SOCKET_IFNAME=enp1s0f0np0
TP_SOCKET_IFNAME=enp1s0f0np0
UCX_NET_DEVICES=enp1s0f0np0
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0    # both halves for full 200 Gb/s
NCCL_IB_DISABLE=0
NCCL_IGNORE_CPU_AFFINITY=1

# Per-node
VLLM_HOST_IP=<this node's RoCE IP>

Ray cluster

On the head node:

ray start --head --port=6379 \
  --node-ip-address=<HEAD_ROCE_IP> \
  --disable-usage-stats

On the worker node:

ray start --address=<HEAD_ROCE_IP>:6379 \
  --node-ip-address=<WORKER_ROCE_IP> \
  --disable-usage-stats --block

vLLM server (head node, after Ray reports 2 GPUs)

vllm serve /models/MiniMax-M2.7-NVFP4-GB10 \
  --host 0.0.0.0 --port 30000 \
  --served-model-name minimax-m2.7 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --max-model-len 196608 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config '{"cudagraph_mode":"none","inductor_compile_config":{"combo_kernels":false,"benchmark_combo_kernel":false,"max_autotune":false,"max_autotune_gemm":false}}'

First boot loads the weights and runs JIT compilation — plan for 10–15 minutes before /v1/models responds.

Test it

curl http://<HEAD_HOST>:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2.7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01,
    "max_tokens": 512
  }'

Gotchas

--gpu-memory-utilization 0.85 is the safe default for TP=2. Activation and inter-node comm buffers grow under tensor parallel; raise cautiously.
Both RoCE halves must be in NCCL_IB_HCA, or you get 100 Gb/s instead of 200.
cudagraph_mode: none in --compilation-config is load-bearing for MiniMax MoE on GB10 — CUDA graphs deadlock on this architecture. Leave torch.compile otherwise enabled.
Do not pass --enable-expert-parallel — uneven per-node memory on MoE, breaks under current kernels.
If NCCL_DEBUG=INFO shows NET/Socket instead of NET/IB/0 + NET/IB/1, NCCL fell back to TCP — recheck interface names and NCCL_IB_HCA.

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore-list was tuned for Blackwell and will leave some performance on the table.

Acknowledgments

Base model by MiniMax
Quantization tooling: NVIDIA TensorRT Model Optimizer
GB10 quantization profile guidance: Scott Glover (scottgl)
Multi-Spark runtime tuning: the eugr/spark-vllm-docker project and the NVIDIA Developer Forum community

Downloads last month: 509

Safetensors

Model size

115B params

Tensor type

BF16

F8_E4M3

Model tree for saricles/MiniMax-M2.7-NVFP4-GB10

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

(14)

this model