About The Model

NVIDIA-Nemotron-3-Super-120B-A12B has been REAP-pruned (512 -> 256 experts), fine-tuned and quantized to reduce its size, yet retain math & tool-integrated reasoning abilities.

This is the unquantized BF16 model.

See details in the github repo.

vLLM Patch

To run this model on vllm, this patch needs to be applied.

  • e.g.: uv run patches/vllm_grouped_topk.py

VRAM Usage

  • BF16: ~129GB
  • AWQ: ~43GB
  • FP8 dynamic: ~72GB

AIME 2026

Variant avg@4 pass@4 tool use
120B base model 0.9000 n\a no
AWQ 0.9083 0.9333 no
FP8 0.9167 0.9667 no

Throughput

FP8 is ~40% slower than AWQ in this decode-heavy workload. Reason: this is memory-bandwidth-bound decode, and W4 weights transfer half the bytes of W8 per forward step. The A8-vs-A16 saving barely matters because activations are ~10⁴× smaller than weights at low batch. FP8 tensor core compute advantage doesn't cash in when the GPU is waiting on memory. However, the FP8 model converges to answers faster, negating the slow throughput to a degree.

Note

  • AWQ for throughput: 40% faster, quality drop is ~1 avg@4 point.
  • FP8 dynamic for quality: +1 solvable problem, 40% throughput tax. Converges faster.
  • Instruction placement matters for this model: system-role +5% absolute over user-role prefix on this benchmark. User-role placement leaks the instruction into the reasoning trace; system-role keeps it as a directive.

Training Data

Training Data Licensing Note

Due to Kaggle competition data redistribution restrictions, the AIMO3 training data is not bundled with this model. Users who want to reproduce the training need to accept the competition rules on Kaggle and download the data separately.

This model was fine-tuned on data including AIMO3 reference problems (CC BY-SA 4.0) and AstralMath-v1 (CC BY-SA 4.0). The applicability of CC BY-SA's ShareAlike provision to ML model weights is an unsettled legal question; industry practice generally treats trained model weights as not being derivatives of training data for the purposes of license propagation. This model is released under the licenses described above on that basis.

Citations

@misc{nvidia_nemotron_3_2025,
  title  = {NVIDIA Nemotron 3: Efficient and Open Intelligence},
  author = {{NVIDIA}},
  year   = {2025},
  url    = {https://arxiv.org/abs/2512.20856},
  note   = {White Paper}
}

@misc{balunovic_srimatharena_2025,
  title = {MathArena: Evaluating LLMs on Uncontaminated Math Competitions},
  author = {Mislav Balunović and Jasper Dekoninck and Ivo Petrov and Nikola Jovanović and Martin Vechev},
  copyright = {MIT},
  url = {https://matharena.ai/},
  publisher = {SRI Lab, ETH Zurich},
  month = feb,
  year = {2025},
}

@misc{nguyen2026astralmath,
  title={AstralMath-v1: A Large-Scale Multi-Model Tool-Integrated Reasoning Dataset for Mathematical Problem Solving},
  author={Nguyen Nguyen},
  year={2026},
  url={https://huggingface.co/datasets/nguyen599/AstralMath-v1},
}

@inproceedings{
    lasby2026reap,
    title={{REAP} the Experts: Why Pruning Prevails for One-Shot MoE compression},
    author={Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=ukGxWd2aDG}
}

License

This model is a derivative work distributed under dual-layer licensing:

Base Model

The underlying NVIDIA Nemotron weights and architecture remain governed by the NVIDIA Nemotron Open Model License (last modified December 15, 2025).

See NVIDIA-Nemotron-Open-Model-License-12-12-25.pdf in this repository, or the official page:

https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/

"Licensed by NVIDIA Corporation under the NVIDIA Nemotron Model License."

Modifications

Modifications contributed by Max & Omnis Inc.

This modified model is licensed under the Apache License 2.0. See LICENSE-APACHE-MAX-AND-OMNIS.txt.

© 2026 Max & Omnis Inc.

https://www.maxandomnis.com/en

Important: When redistributing this model or any derivative, you must comply with both licenses. The NVIDIA Nemotron Open Model License applies to the base weights; the Apache 2.0 license covers only the specific modifications listed above.

Downloads last month
961
Safetensors
Model size
64B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

Finetuned
(17)
this model
Quantizations
3 models

Dataset used to train Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16

Paper for Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-BF16