Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
```

Corrupted Weight Shards (6–10), Shards 6-10 are 40-byte, are currently "ghost"

by CRY24180339 - opened Feb 3

Discussion

CRY24180339

Feb 3

The main issue with the NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 repository is a hydration failure within the Xet storage system.

Here is a concise breakdown of the technical faults for your issue report:

Corrupted Weight Shards (6–10)

The Symptom: Shards model-00006-of-00010.safetensors through model-00010-of-00010.safetensors are served as tiny 40-byte text files rather than multi-gigabyte binary weights.

The Cause: These files are currently "ghost" pointers used by NVIDIA's Xet storage architecture. On this specific repository branch, the backend is failing to "hydrate" or materialize these pointers into actual tensors during download.
Broken Safetensors Headers

The Symptom: Attempts to load the shards result in Error while deserializing header: header too large.

The Cause: A valid Safetensor file must begin with an 8-byte integer defining the metadata header length. Because these files contain Git LFS/Xet pointer text (version https://git-lfs...), the loader misinterprets the text as a massive header size and crashes.
Repository Incompatibility

Environment Failure: Standard tools like huggingface-cli, hf_hub_download, and even huggingface_hub[xet] extension are failing to resolve these specific pointers into real math.

Blocker: This prevents TensorRT-LLM from building an engine, as the weight map cannot be verified without readable shard headers.

Suggested Issue Title: Shards 6-10 are 40-byte Xet pointers; "header too large" error on load

Concise Description: > Shards 6 through 10 of the NVFP4 model are currently stuck as 40-byte Xet pointer files. Standard materialization via huggingface_hub[xet] fails to hydrate these into valid Safetensors. This results in a deserializing header: header too large error, making it impossible to build a TensorRT-LLM engine or run inference on Blackwell hardware.

bkartal

NVIDIA org Feb 9

Thanks for raising this issue. The HF artifacts are just updated, please let us know if you still face a build issue.

bkartal changed discussion status to closed Mar 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment