MiniMax-M2.7-NVFP4-GB10

Custom GB10 NVFP4 quantization of MiniMaxAI/MiniMax-M2.7 (230B, 256 MoE experts, top-K=2) targeted at NVIDIA DGX Spark (GB10) and Blackwell-family hardware. 130.6 GB on disk, down from the 230.2 GB official FP8 release.

Model Details

Base Model MiniMaxAI/MiniMax-M2.7
Architecture MiniMaxM2ForCausalLM (MoE, 256 experts, 2 active per token)
Total Parameters 230B
Active Parameters ~10B per token
Hidden Layers 62
Quantization NVFP4 (4-bit floating point) with GB10-tuned ignore list
Format compressed-tensors (safetensors)
Size on Disk 130.6 GB
Deployment 2× DGX Spark (does not fit in a single 128 GB Spark)
License Other (inherited from MiniMaxAI/MiniMax-M2.7)

Quantization Details

  • Method: Post-training quantization via NVIDIA TensorRT Model Optimizer (nvidia-modelopt)
  • Scheme: mtq.NVFP4_DEFAULT_CFG + GB10-tuned disable list applied post-calibration
  • Calibration Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Calibration Samples: 64
  • Max Sequence Length: 2048 tokens
  • Preserved in BF16 (ignore list): lm_head, *block_sparse_moe.gate (MoE router gate, not per-expert gates)
  • Hardware Used: Hugging Face Jobs, 8× NVIDIA A100 80 GB
  • Recipe script: quantize-nvfp4-gb10.py — env-var-configurable; see the file header for the MoE-expert amax gotcha and how to adapt for other architectures.

Performance (2× NVIDIA DGX Spark — GB10, 128 GB each)

Benchmarked with llama-benchy 0.3.3, 3 runs per scenario, TP=2 over ConnectX-7 RoCE, 192K context window.

PP TG Prefill (tok/s) Decode (tok/s) TTFT (ms)
512 128 1,436 26.2 461
512 256 1,737 26.9 381
1024 128 2,878 26.9 443
1024 256 3,004 26.6 427
2048 128 3,637 26.3 663
2048 256 4,041 26.1 593
4096 128 3,967 25.7 1,170
4096 256 4,493 26.2 998

Decode throughput is stable at ~26 tok/s across all prompt lengths — characteristic of a dual-node tensor-parallel deployment where decode is bound by per-step inter-node communication rather than compute. Prefill scales cleanly with prompt length as the per-step batch amortizes fixed overhead.

Running on 2× DGX Spark (Tensor Parallel)

At 130.6 GB this model does not fit in a single DGX Spark's 128 GB unified memory. It is intended to run with tensor-parallel-size=2 across two Sparks connected via their ConnectX-7 200 GbE link, orchestrated by Ray. The community reference container is eugr/spark-vllm-docker.

Environment variables

These are the ones that matter for NVFP4 on GB10 and for NCCL over the ConnectX-7 link. Interface names (enp1s0f0np0, rocep1s0f0, roceP2p1s0f0) may differ on your hardware — verify with ip -br link and ibdev2netdev.

# NVFP4 kernel selection (validated on GB10 SM 12.1)
VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
VLLM_USE_FLASHINFER_MOE_FP4=0
SAFETENSORS_FAST_GPU=1
OMP_NUM_THREADS=8
TORCHINDUCTOR_MAX_AUTOTUNE=0

# NCCL + Gloo + Ray — route everything over the RoCE interface
NCCL_SOCKET_IFNAME=enp1s0f0np0
GLOO_SOCKET_IFNAME=enp1s0f0np0
TP_SOCKET_IFNAME=enp1s0f0np0
UCX_NET_DEVICES=enp1s0f0np0
NCCL_IB_HCA=rocep1s0f0,roceP2p1s0f0    # both halves for full 200 Gb/s
NCCL_IB_DISABLE=0
NCCL_IGNORE_CPU_AFFINITY=1

# Per-node
VLLM_HOST_IP=<this node's RoCE IP>

Ray cluster

On the head node:

ray start --head --port=6379 \
  --node-ip-address=<HEAD_ROCE_IP> \
  --disable-usage-stats

On the worker node:

ray start --address=<HEAD_ROCE_IP>:6379 \
  --node-ip-address=<WORKER_ROCE_IP> \
  --disable-usage-stats --block

vLLM server (head node, after Ray reports 2 GPUs)

vllm serve /models/MiniMax-M2.7-NVFP4-GB10 \
  --host 0.0.0.0 --port 30000 \
  --served-model-name minimax-m2.7 \
  --tensor-parallel-size 2 \
  --distributed-executor-backend ray \
  --gpu-memory-utilization 0.85 \
  --max-model-len 196608 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-backend flashinfer \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 64 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --enable-auto-tool-choice --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config '{"cudagraph_mode":"none","inductor_compile_config":{"combo_kernels":false,"benchmark_combo_kernel":false,"max_autotune":false,"max_autotune_gemm":false}}'

First boot loads the weights and runs JIT compilation — plan for 10–15 minutes before /v1/models responds.

Test it

curl http://<HEAD_HOST>:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "minimax-m2.7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 1.0, "top_p": 0.95, "top_k": 40, "min_p": 0.01,
    "max_tokens": 512
  }'

Gotchas

  • --gpu-memory-utilization 0.85 is the safe default for TP=2. Activation and inter-node comm buffers grow under tensor parallel; raise cautiously.
  • Both RoCE halves must be in NCCL_IB_HCA, or you get 100 Gb/s instead of 200.
  • cudagraph_mode: none in --compilation-config is load-bearing for MiniMax MoE on GB10 — CUDA graphs deadlock on this architecture. Leave torch.compile otherwise enabled.
  • Do not pass --enable-expert-parallel — uneven per-node memory on MoE, breaks under current kernels.
  • If NCCL_DEBUG=INFO shows NET/Socket instead of NET/IB/0 + NET/IB/1, NCCL fell back to TCP — recheck interface names and NCCL_IB_HCA.

Recommended Sampling Parameters

Per MiniMax documentation:

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 40,
  "min_p": 0.01
}

Target Hardware

Quantized for and tested on NVIDIA DGX Spark (GB10, 128 GB unified memory, 221 GB/s bandwidth). Should work on other Blackwell-class GPUs with NVFP4 tensor-core support. On Hopper-class hardware (H100/H200) the model will load and run, but the ignore-list was tuned for Blackwell and will leave some performance on the table.

Acknowledgments

Downloads last month
509
Safetensors
Model size
115B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saricles/MiniMax-M2.7-NVFP4-GB10

Finetuned
(14)
this model