Instructions to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16
- SGLang
How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16 with Docker Model Runner:
docker model run hf.co/Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16
About The Model
NVIDIA-Nemotron-3-Super-120B-A12B has been REAP-pruned (512 -> 256 experts), fine-tuned and quantized to reduce its size, yet retain math & tool-integrated reasoning abilities.
This is the unquantized BF16 model.
- AWQ quantized variant: Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-AWQ
- FP8 dynamic quantized variant: Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-FP8
See details in the github repo.
vLLM Patch
To run this model on vllm, this patch needs to be applied.
- e.g.:
uv run patches/vllm_grouped_topk.py
VRAM Usage
- BF16: ~129GB
- AWQ: ~43GB
- FP8 dynamic: ~72GB
AIME 2026
| Variant | avg@4 | pass@4 | tool use |
|---|---|---|---|
| 120B base model | 0.9000 | n\a | no |
| AWQ | 0.9083 | 0.9333 | no |
| FP8 | 0.9167 | 0.9667 | no |
Throughput
FP8 is ~40% slower than AWQ in this decode-heavy workload. Reason: this is memory-bandwidth-bound decode, and W4 weights transfer half the bytes of W8 per forward step. The A8-vs-A16 saving barely matters because activations are ~10⁴× smaller than weights at low batch. FP8 tensor core compute advantage doesn't cash in when the GPU is waiting on memory. However, the FP8 model converges to answers faster, negating the slow throughput to a degree.
Note
- AWQ for throughput: 40% faster, quality drop is ~1 avg@4 point.
- FP8 dynamic for quality: +1 solvable problem, 40% throughput tax. Converges faster.
- Instruction placement matters for this model: system-role +5% absolute over user-role prefix on this benchmark. User-role placement leaks the instruction into the reasoning trace; system-role keeps it as a directive.
Training Data
- nguyen599/AstralMath-v1 — HF dataset
- AIMO3 competition data — Kaggle, AI Mathematical Olympiad - Progress Prize 3
Training Data Licensing Note
Due to Kaggle competition data redistribution restrictions, the AIMO3 training data is not bundled with this model. Users who want to reproduce the training need to accept the competition rules on Kaggle and download the data separately.
This model was fine-tuned on data including AIMO3 reference problems (CC BY-SA 4.0) and AstralMath-v1 (CC BY-SA 4.0). The applicability of CC BY-SA's ShareAlike provision to ML model weights is an unsettled legal question; industry practice generally treats trained model weights as not being derivatives of training data for the purposes of license propagation. This model is released under the licenses described above on that basis.
Citations
@misc{nvidia_nemotron_3_2025,
title = {NVIDIA Nemotron 3: Efficient and Open Intelligence},
author = {{NVIDIA}},
year = {2025},
url = {https://arxiv.org/abs/2512.20856},
note = {White Paper}
}
@misc{balunovic_srimatharena_2025,
title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions},
author = {Mislav Balunović and Jasper Dekoninck and Ivo Petrov and Nikola Jovanović and Martin Vechev},
copyright = {MIT},
url = {https://matharena.ai/},
publisher = {SRI Lab, ETH Zurich},
month = feb,
year = {2025},
}
@misc{nguyen2026astralmath,
title={AstralMath-v1: A Large-Scale Multi-Model Tool-Integrated Reasoning Dataset for Mathematical Problem Solving},
author={Nguyen Nguyen},
year={2026},
url={https://huggingface.co/datasets/nguyen599/AstralMath-v1},
}
@inproceedings{
lasby2026reap,
title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=ukGxWd2aDG}
}
License
This model is a derivative work distributed under dual-layer licensing:
Base Model
The underlying NVIDIA Nemotron weights and architecture remain governed by the NVIDIA Nemotron Open Model License (last modified December 15, 2025).
See NVIDIA-Nemotron-Open-Model-License-12-12-25.pdf in this repository, or the official page:
https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
"Licensed by NVIDIA Corporation under the NVIDIA Nemotron Model License."
Modifications
Modifications contributed by Max & Omnis Inc.
This modified model is licensed under the Apache License 2.0. See LICENSE-APACHE-MAX-AND-OMNIS.txt.
© 2026 Max & Omnis Inc.
https://www.maxandomnis.com/en
Important: When redistributing this model or any derivative, you must comply with both licenses. The NVIDIA Nemotron Open Model License applies to the base weights; the Apache 2.0 license covers only the specific modifications listed above.
- Downloads last month
- 961