Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
- SGLang
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with Docker Model Runner:
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
Corrupted Weight Shards (6–10), Shards 6-10 are 40-byte, are currently "ghost"
The main issue with the NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 repository is a hydration failure within the Xet storage system.
Here is a concise breakdown of the technical faults for your issue report:
Corrupted Weight Shards (6–10)
The Symptom: Shards model-00006-of-00010.safetensors through model-00010-of-00010.safetensors are served as tiny 40-byte text files rather than multi-gigabyte binary weights.
The Cause: These files are currently "ghost" pointers used by NVIDIA's Xet storage architecture. On this specific repository branch, the backend is failing to "hydrate" or materialize these pointers into actual tensors during download.
Broken Safetensors Headers
The Symptom: Attempts to load the shards result in Error while deserializing header: header too large.
The Cause: A valid Safetensor file must begin with an 8-byte integer defining the metadata header length. Because these files contain Git LFS/Xet pointer text (version https://git-lfs...), the loader misinterprets the text as a massive header size and crashes.
Repository Incompatibility
Environment Failure: Standard tools like huggingface-cli, hf_hub_download, and even huggingface_hub[xet] extension are failing to resolve these specific pointers into real math.
Blocker: This prevents TensorRT-LLM from building an engine, as the weight map cannot be verified without readable shard headers.
Suggested Issue Title: Shards 6-10 are 40-byte Xet pointers; "header too large" error on load
Concise Description: > Shards 6 through 10 of the NVFP4 model are currently stuck as 40-byte Xet pointer files. Standard materialization via huggingface_hub[xet] fails to hydrate these into valid Safetensors. This results in a deserializing header: header too large error, making it impossible to build a TensorRT-LLM engine or run inference on Blackwell hardware.
Thanks for raising this issue. The HF artifacts are just updated, please let us know if you still face a build issue.