How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16")
model = AutoModelForMultimodalLM.from_pretrained("RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16

Model Overview

  • Model Architecture: Hybrid Mamba-2 + Latent Mixture-of-Experts (LatentMoE) with Multi-Token Prediction (MTP)
    • Input: Text
    • Output: Text
    • Total Parameters: 550B
    • Active Parameters: 55B
  • Model Optimizations:
    • Weight quantization: INT4 (W4A16, group size 128)
  • Intended Use Cases:
    • Reasoning and complex problem solving.
    • Mathematics and science.
    • Code generation.
    • Instruction following.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
  • Release Date: 06/04/2025
  • Version: 1.0
  • Model Developers: Red Hat

Quantized version of nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16.

Model Optimizations

This model was obtained by quantizing the weights of nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformer blocks are quantized. Weights are quantized using an asymmetric per-group scheme with group size 128. The llm-compressor library is used for quantization.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Install dependencies:

uv pip install git+https://github.com/vllm-project/vllm.git
uv pip install llmcompressor

Launch the vLLM server:

vllm serve RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16 \
  --host 0.0.0.0 --port 8088 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser nemotron_v3 \
  --mamba-ssm-cache-dtype float16 \
  --mamba-backend flashinfer \
  --enable-mamba-cache-stochastic-rounding \
  --mamba-cache-philox-rounds 5 \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}' \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 96}' \
  --trust-remote-code

Send requests:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8088/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16"

messages = [
    {"role": "user", "content": "Solve for x: 2x + 5 = 13"},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was quantized using the llm-compressor library as shown below.

from llmcompressor import model_free_ptq

MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-W4A16-G128"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="W4A16",
    ignore=[
        "re:.*gate$",
        "lm_head",
        "model.embed_tokens",
        "re:.*mixer.conv1d.*",
        "re:.*norm_f*",
        "re:.*bias$",
        "re:.*embed_tokens$",
        "backbone.embeddings"
    ],
    max_workers=15,
    device="cuda:0",
)

Evaluation

The model was evaluated on reasoning tasks using lighteval. vLLM was used as the serving backend for all evaluations.

Install dependencies:

uv pip install git+https://github.com/vllm-project/vllm.git
uv pip install lighteval==0.13.0
uv pip install "litellm[caching]>=1.66.0"

Launch the vLLM server:

vllm serve RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16 \
  --host 0.0.0.0 --port 8088 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser nemotron_v3 \
  --mamba-ssm-cache-dtype float16 \
  --mamba-backend flashinfer \
  --enable-mamba-cache-stochastic-rounding \
  --mamba-cache-philox-rounds 5 \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}' \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 96}' \
  --trust-remote-code

AIME 2025:

lighteval endpoint litellm \
  "model_name=hosted_vllm/RedHatAI__NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16,provider=hosted_vllm,base_url=http://127.0.0.1:8088/v1,timeout=3600,concurrent_requests=32,generation_parameters={temperature:1.0,top_p:0.95,max_new_tokens:32768}" \
  "aime25|0" \
  --output-dir results --save-details

GPQA Diamond:

lighteval endpoint litellm \
  "model_name=hosted_vllm/RedHatAI__NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16,provider=hosted_vllm,base_url=http://127.0.0.1:8088/v1,timeout=3600,concurrent_requests=32,generation_parameters={temperature:1.0,top_p:0.95,max_new_tokens:32768}" \
  "gpqa:diamond|0" \
  --output-dir results --save-details

Accuracy

Benchmark nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-Dynamic RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-BLOCK RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16
(this model)
AIME 2025 (pass@1) 90.00 90.00 (100.0%) 93.33 (103.7%) 86.67 (96.3%) 86.67 (96.3%)
GPQA Diamond (pass@1) 78.79 84.85 (107.7%) 82.32 (104.5%) 81.31 (103.2%) 81.82 (103.8%)
Average 84.39 87.42 (103.6%) 87.83 (104.1%) 83.99 (99.5%) 84.24 (99.8%)
Downloads last month
746
Safetensors
Model size
565B params
Tensor type
I64
I32
BF16
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for RedHatAI/NVIDIA-Nemotron-3-Ultra-550B-A55B-quantized.w4a16

Quantized
(17)
this model