Instructions to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

SGLang

How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with Docker Model Runner:
```
docker model run hf.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

About The Model

NVIDIA-Nemotron-3-Super-120B-A12B has been REAP-pruned (512 -> 256 experts), fine-tuned and quantized to reduce its size, yet retain math & tool-integrated reasoning abilities.

This is the unquantized BF16 model.

AWQ quantized variant: Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ
FP8 dynamic quantized variant: Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8

See details in the github repo.

vLLM Patch

To run this model on vllm, this patch needs to be applied.

e.g.: uv run patches/vllm_grouped_topk.py

VRAM Usage

BF16: ~129GB
AWQ: ~43GB
FP8 dynamic: ~72GB

AIME 2026

Variant	avg@4	pass@4	tool use
120B base model	0.9000	n\a	no
AWQ	0.9083	0.9333	no
FP8	0.9167	0.9667	no

Throughput

FP8 is ~40% slower than AWQ in this decode-heavy workload. Reason: this is memory-bandwidth-bound decode, and W4 weights transfer half the bytes of W8 per forward step. The A8-vs-A16 saving barely matters because activations are ~10⁴× smaller than weights at low batch. FP8 tensor core compute advantage doesn't cash in when the GPU is waiting on memory. However, the FP8 model converges to answers faster, negating the slow throughput to a degree.

Note

AWQ for throughput: 40% faster, quality drop is ~1 avg@4 point.
FP8 dynamic for quality: +1 solvable problem, 40% throughput tax. Converges faster.
Instruction placement matters for this model: system-role +5% absolute over user-role prefix on this benchmark. User-role placement leaks the instruction into the reasoning trace; system-role keeps it as a directive.

Training Data

nguyen599/AstralMath-v1 — HF dataset
AIMO3 competition data — Kaggle, AI Mathematical Olympiad - Progress Prize 3

Training Data Licensing Note

Due to Kaggle competition data redistribution restrictions, the AIMO3 training data is not bundled with this model. Users who want to reproduce the training need to accept the competition rules on Kaggle and download the data separately.

This model was fine-tuned on data including AIMO3 reference problems (CC BY-SA 4.0) and AstralMath-v1 (CC BY-SA 4.0). The applicability of CC BY-SA's ShareAlike provision to ML model weights is an unsettled legal question; industry practice generally treats trained model weights as not being derivatives of training data for the purposes of license propagation. This model is released under the licenses described above on that basis.

Citations

@misc{nvidia_nemotron_3_2025,
  title  = {NVIDIA Nemotron 3: Efficient and Open Intelligence},
  author = {{NVIDIA}},
  year   = {2025},
  url    = {https://arxiv.org/abs/2512.20856},
  note   = {White Paper}
}

@misc{balunovic_srimatharena_2025,
  title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions},
  author = {Mislav Balunović and Jasper Dekoninck and Ivo Petrov and Nikola Jovanović and Martin Vechev},
  copyright = {MIT},
  url = {https://matharena.ai/},
  publisher = {SRI Lab, ETH Zurich},
  month = feb,
  year = {2025},
}

@misc{nguyen2026astralmath,
  title={AstralMath-v1: A Large-Scale Multi-Model Tool-Integrated Reasoning Dataset for Mathematical Problem Solving},
  author={Nguyen Nguyen},
  year={2026},
  url={https://huggingface.co/datasets/nguyen599/AstralMath-v1},
}

@inproceedings{
    lasby2026reap,
    title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
    author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=ukGxWd2aDG}
}

License

This model is a derivative work distributed under dual-layer licensing:

Base Model

The underlying NVIDIA Nemotron weights and architecture remain governed by the NVIDIA Nemotron Open Model License (last modified December 15, 2025).

See NVIDIA-Nemotron-Open-Model-License-12-12-25.pdf in this repository, or the official page:

https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/

"Licensed by NVIDIA Corporation under the NVIDIA Nemotron Model License."

Modifications

Modifications contributed by Max & Omnis Inc.

This modified model is licensed under the Apache License 2.0. See LICENSE-APACHE-MAX-AND-OMNIS.txt.

https://www.maxandomnis.com/en

Important: When redistributing this model or any derivative, you must comply with both licenses. The NVIDIA Nemotron Open Model License applies to the base weights; the Apache 2.0 license covers only the specific modifications listed above.

Downloads last month: 961

Safetensors

Model size

64B params

Tensor type

BF16

F32

Model tree for Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

Base model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Finetuned

(17)

this model

Quantizations

3 models

Dataset used to train Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

Paper for Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

NVIDIA Nemotron 3: Efficient and Open Intelligence

Paper • 2512.20856 • Published Dec 24, 2025 • 44