Instructions to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS", filename="IQ4_XS.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS # Run inference directly in the terminal: llama-cli -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS # Run inference directly in the terminal: llama-cli -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS # Run inference directly in the terminal: ./llama-cli -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Use Docker
docker model run hf.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
- LM Studio
- Jan
- vLLM
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
- Ollama
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Ollama:
ollama run hf.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
- Unsloth Studio new
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS to start chatting
- Pi new
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Run Hermes
hermes
- Docker Model Runner
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Docker Model Runner:
docker model run hf.co/majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
- Lemonade
How to use majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS:IQ4_XS
Run and chat with the model
lemonade run user.Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-IQ4_XS
List all available models
lemonade list
Nemotron-3-Nano-Omni-30B-A3B-Reasoning - TurboQuant GGUF IQ4_XS
GGUF IQ4_XS quantization of Nemotron-3-Nano-Omni-30B-A3B-Reasoning (nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16) with TurboQuant weight method.
The IQ4_XS.gguf binary in this repo is loaded by llama.cpp / llama-mtmd-cli.
For multimodal inference (text + image + audio + video) pair this with the
multimodal projector: majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16.
For the matched-KV stack — TurboQuant weights + TurboQuant KV-cache modifier —
see majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS-TQ-KV.
For the runtime KV-cache modifier itself (weight-agnostic), see
majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant.
Quickstart
# 1. Download the GGUF + the multimodal projector
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-TurboQuant-GGUF-IQ4_XS IQ4_XS.gguf --local-dir ./model
huggingface-cli download majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16 mmproj-F16.gguf --local-dir ./mmproj
# 2. Multimodal inference (text + image + audio + video)
llama-mtmd-cli \
-m ./model/IQ4_XS.gguf \
--mmproj ./mmproj/mmproj-F16.gguf \
--image cat.jpg \
-p "Describe this image in detail" \
--temp 0.6 --top-p 0.95 -n 512
# 3. Text-only inference (no mmproj needed)
llama-cli \
-m ./model/IQ4_XS.gguf \
-p "What is the capital of France?" \
--temp 0.6 --top-p 0.95 -n 256
# Disable extended reasoning (default is on):
# add `--chat-template-kwargs '{"enable_thinking": false}'`
⚠️ Do NOT use llama.cpp built against CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use Metal/CPU.
Modality matrix
| Modality | Encoder | Quantization in this variant |
|---|---|---|
| Text | LLM backbone (Mamba-2 + Transformer hybrid Sparse MoE) | per the variant suffix |
| Image | CRADIO v4-H | BF16 (kept full-precision in every non-GGUF variant; GGUF uses mmproj-F16 split file) |
| Audio | Parakeet-TDT-0.6B-v2 | BF16 (same rationale) |
| Video | Parakeet-TDT-0.6B-v2 + frame sampler | BF16 (≤ 2 min, 256 frames @ 2 FPS) |
NVIDIA's official FP8 / NVFP4 recipe keeps both encoders + the cross-modal MLP projectors in BF16 to preserve multimodal accuracy. We follow that convention in every quantized variant we ship.
Runtime quirks
llama.cpp
Use llama-mtmd-cli for multimodal inference; pass --mmproj mmproj-F16.gguf
(see majentik/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-mmproj-F16).
Do NOT use CUDA 13.2 — produces gibberish. Pin CUDA 12.x or use the Metal/CPU paths.
Ollama
Text-only; multimodal is blocked because Ollama doesn't yet support the mmproj split-file pattern.
Reasoning mode
enable_thinking defaults to True. To disable extended reasoning
(e.g., for latency-sensitive cases), pass enable_thinking=False
to the chat template / generate call. No separate "no-think"
variant card exists — this is a runtime flag, not a model variant.
Quant trade-off (GGUF lane)
| Quant | Approx size | Use case | Recommendation |
|---|---|---|---|
| Q2_K | ~17 GB | Lossy, low-RAM CPU/edge | Resource-constrained inference |
| Q3_K_M | ~19 GB | Smaller-than-Q4, modest quality drop | Edge devices with ~16 GB RAM |
| IQ4_XS | ~16 GB | Importance-quant 4-bit, smaller than Q4_K_M | Best size/quality at 4-bit |
| Q4_K_M | ~23 GB | Balanced default | Recommended for most users |
| Q5_K_M | ~24 GB | Higher fidelity than Q4 | Quality-sensitive applications |
| Q6_K | ~28 GB | Approaching FP16 quality | High-fidelity CPU/edge |
| Q8_0 | ~32 GB | Near-lossless reference | Fidelity-critical work |
| MXFP4_MOE | ~17 GB | Microscaling FP4 (MoE-aware) | vLLM / transformers users |
(Current variant — IQ4_XS — is bolded.)
Variants in this family
(Showing 56 sibling variants under majentik/nemotron3-nano-omni-30b-*. The current variant — TurboQuant-GGUF-IQ4_XS — is bolded.)
| Variant | Runtime | Approx size | Use case |
|---|---|---|---|
| mmproj-F16 | llama-mtmd-cli | ~1-2 GB | Multimodal projector (pair with any GGUF) |
| RotorQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| RotorQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| RotorQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| RotorQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| RotorQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| RotorQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| RotorQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| RotorQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| RotorQuant-GGUF-IQ4_XS-RQ-KV | llama.cpp | ~26 GB | IQ4_XS + RotorQuant KV |
| RotorQuant-GGUF-MXFP4_MOE-RQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + RotorQuant KV |
| RotorQuant-GGUF-Q2_K-RQ-KV | llama.cpp | ~18 GB | Q2_K + RotorQuant KV |
| RotorQuant-GGUF-Q3_K_M-RQ-KV | llama.cpp | ~23 GB | Q3_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q4_K_M-RQ-KV | llama.cpp | ~33 GB | Q4_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q5_K_M-RQ-KV | llama.cpp | ~40 GB | Q5_K_M + RotorQuant KV |
| RotorQuant-GGUF-Q8_0-RQ-KV | llama.cpp | ~63 GB | Q8_0 + RotorQuant KV |
| RotorQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| RotorQuant-MLX-2bit-RQ-KV | mlx-lm | ~9.6 GB | 2-bit + RotorQuant KV |
| RotorQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| RotorQuant-MLX-3bit-RQ-KV | mlx-lm | ~14 GB | 3-bit + RotorQuant KV |
| RotorQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| RotorQuant-MLX-4bit-RQ-KV | mlx-lm | ~19 GB | 4-bit + RotorQuant KV |
| RotorQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| RotorQuant-MLX-5bit-RQ-KV | mlx-lm | ~23 GB | 5-bit + RotorQuant KV |
| RotorQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| RotorQuant-MLX-6bit-RQ-KV | mlx-lm | ~27 GB | 6-bit + RotorQuant KV |
| RotorQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| RotorQuant-MLX-8bit-RQ-KV | mlx-lm | ~35 GB | 8-bit + RotorQuant KV |
| RotorQuant-MLX-MXFP4 | mlx-lm | ~19 GB | Apple Silicon MXFP4 |
| TurboQuant | runtime modifier | n/a | KV-cache root (weight-agnostic) |
| TurboQuant-GGUF-IQ4_XS | llama.cpp | ~26 GB | Lossy 4-bit, low-RAM CPU/edge |
| TurboQuant-GGUF-MXFP4_MOE | llama.cpp | ~30 GB | MXFP4 MoE quant |
| TurboQuant-GGUF-Q2_K | llama.cpp | ~18 GB | Lossy, low-RAM CPU/edge |
| TurboQuant-GGUF-Q3_K_M | llama.cpp | ~23 GB | Smaller 3-bit, CPU-friendly |
| TurboQuant-GGUF-Q4_K_M | llama.cpp | ~33 GB | Balanced default |
| TurboQuant-GGUF-Q5_K_M | llama.cpp | ~40 GB | Higher fidelity, more RAM |
| TurboQuant-GGUF-Q8_0 | llama.cpp | ~63 GB | Near-lossless reference |
| TurboQuant-GGUF-IQ4_XS-TQ-KV | llama.cpp | ~26 GB | IQ4_XS + TurboQuant KV |
| TurboQuant-GGUF-MXFP4_MOE-TQ-KV | llama.cpp | ~30 GB | MXFP4 MoE + TurboQuant KV |
| TurboQuant-GGUF-Q2_K-TQ-KV | llama.cpp | ~18 GB | Q2_K + TurboQuant KV |
| TurboQuant-GGUF-Q3_K_M-TQ-KV | llama.cpp | ~23 GB | Q3_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q4_K_M-TQ-KV | llama.cpp | ~33 GB | Q4_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q5_K_M-TQ-KV | llama.cpp | ~40 GB | Q5_K_M + TurboQuant KV |
| TurboQuant-GGUF-Q8_0-TQ-KV | llama.cpp | ~63 GB | Q8_0 + TurboQuant KV |
| TurboQuant-MLX-2bit | mlx-lm | ~9.6 GB | Apple Silicon, smallest |
| TurboQuant-MLX-2bit-TQ-KV | mlx-lm | ~9.6 GB | 2-bit + TurboQuant KV |
| TurboQuant-MLX-3bit | mlx-lm | ~14 GB | Apple Silicon, small |
| TurboQuant-MLX-3bit-TQ-KV | mlx-lm | ~14 GB | 3-bit + TurboQuant KV |
| TurboQuant-MLX-4bit | mlx-lm | ~19 GB | Apple Silicon balanced |
| TurboQuant-MLX-4bit-TQ-KV | mlx-lm | ~19 GB | 4-bit + TurboQuant KV |
| TurboQuant-MLX-5bit | mlx-lm | ~23 GB | Apple Silicon, higher fidelity |
| TurboQuant-MLX-5bit-TQ-KV | mlx-lm | ~23 GB | 5-bit + TurboQuant KV |
| TurboQuant-MLX-6bit | mlx-lm | ~27 GB | Apple Silicon, near-lossless |
| TurboQuant-MLX-6bit-TQ-KV | mlx-lm | ~27 GB | 6-bit + TurboQuant KV |
| TurboQuant-MLX-8bit | mlx-lm | ~35 GB | Apple Silicon reference |
| TurboQuant-MLX-8bit-TQ-KV | mlx-lm | ~35 GB | 8-bit + TurboQuant KV |
- Downloads last month
- 625
4-bit