Instructions to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF",
	filename="IQ1_M/Nex-397B-A17B-IQ1_M-00001-of-00005.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
# Run inference directly in the terminal:
llama-cli -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
# Run inference directly in the terminal:
./llama-cli -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Use Docker

docker model run hf.co/paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

LM Studio
Jan

vLLM

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Ollama
How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Ollama:
```
ollama run hf.co/paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
```

Unsloth Studio

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF to start chatting

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Docker Model Runner:
```
docker model run hf.co/paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M
```

Lemonade

How to use paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF:IQ1_M

Run and chat with the model

lemonade run user.Nex-N2-Pro-397B-A17B-GGUF-IQ1_M

List all available models

lemonade list

Quants of Nex-N2-Pro, a fine tune built on Qwen 3.5 397B A17B. Basically the Qwen 3.6 397B that we never got. Comes with mmproj for vision, but isn't shipped with MTP.

All quants target 16/24/32GB GPUs, with varying amounts of RAM depending on the quant.

Specific quant details:

IQ5_KS - ik fork only

Only works on ik_llama.cpp, targets a 256GB RAM system + nvidia GPU 24/32GB.

Will eat 20822MB of VRAM and 214GB of RAM with this config (needs a strong CPU, like 9950x3d, or PP will be slower):

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ5_KS.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=1000,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Will eat 23500MB of VRAM and 214GB of RAM with this config (increases PP speed for weaker CPUs at the cost of more VRAM usage):

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ5_KS.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=bf16
blk\..*\.attn_qkv\.weight=bf16
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=IQ5_KS
blk\..*\.ffn_(gate|up)_exps\.weight=IQ4_KS

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0

IQ4_KSS - ik fork only

Works with ik only, targets a 192GB RAM system + any GPU 24GB.

Will eat 19450MB of VRAM and 182GB of RAM with standard config:

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ4_KSS.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=bf16
blk\..*\.ssm_beta\.weight=bf16
blk\..*\.ssm_out\.weight=bf16

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0

IQ3_M - mainline compatible (Uploading..)

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.

Will eat 19600MB of VRAM and 180GB of RAM with standard config:

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ3_M.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=IQ4_XS
blk\..*\.ffn_(gate|up)_exps\.weight=IQ3_S

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0

IQ3_XXS - mainline compatible

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.

Will eat 18930MB of VRAM and 151GB of RAM with standard config:

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ3_XXS.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=Q6_K
blk\..*\.ffn_(gate|up)_shexp\.weight=Q6_K

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=IQ3_XXS
blk\..*\.ffn_(gate|up)_exps\.weight=IQ3_XXS

# Non-Repeating Layers
token_embd\.weight=q6_k
output\.weight=q6_k

IQ2_M - mainline compatible

Works with mainline and ik, targets a 196GB RAM system + any GPU 24GB.

Will eat 19050MB of VRAM and 138GB of RAM with standard config:

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ2_M.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=q8_0
blk\..*\.attn_qkv\.weight=q8_0
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Normal attention
blk\..*\.attn_output\.weight=q8_0
blk\..*\.attn_q\.weight=q8_0
blk\..*\.attn_k\.weight=q8_0
blk\..*\.attn_v\.weight=q8_0

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=IQ3_XXS
blk\..*\.ffn_(gate|up)_exps\.weight=IQ2_S

# Non-Repeating Layers
token_embd\.weight=q8_0
output\.weight=q8_0

IQ1_M - mainline compatible

Works with mainline and ik, targets a 128GB RAM system + any GPU 16GB+.

Will eat 14210MB of VRAM and 94GB of RAM with standard config:

./build/bin/llama-server
    -m pmodels/Nex-397B-A17B-IQ1_M.gguf
    --mmproj pmodels/Nex-397B-A17B-BF16-mmproj.gguf
    --no-mmproj-offload
    -a NexQ8
    --slot-save-path slots
    --context-shift off
    -ot "blk\.(?:[0-9]|[1-5][0-9])\.ffn.*_exps.*=CPU"
    -ot "token_embd\.weight=CPU"
    -c 196608
    --ctx-checkpoints 12
    --ctx-checkpoints-interval 512
    --ctx-checkpoints-tolerance 4
    --parallel 1
    -cram 0
    -b 4096 -ub 4096
    -wgt 1
    -ctk q8_0 -ctv q8_0
    -khad -vhad
    -mqkv
    --threads 7 --threads-batch 8 -ngl 100
    -cuda fusion=1,offload-batch-size=16,mmq-id-size=0,fa-offset=0
    --host 127.0.0.1
    --port 8080
    --webui none
    --jinja

Details:

## Gated Attention/Delta Net [Blended 0-59]
blk\..*\.attn_gate\.weight=IQ4_XS
blk\..*\.attn_qkv\.weight=IQ4_XS
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=q8_0

# Normal attention
blk\..*\.attn_output\.weight=IQ4_XS
blk\..*\.attn_q\.weight=IQ4_XS
blk\..*\.attn_k\.weight=IQ4_XS
blk\..*\.attn_v\.weight=IQ4_XS

# Shared Expert Layers [0-59]
blk\..*\.ffn_down_shexp\.weight=IQ4_XS
blk\..*\.ffn_(gate|up)_shexp\.weight=IQ4_XS

# Routed Experts Layers [0-59]
blk\..*\.ffn_down_exps\.weight=IQ2_XXS
blk\..*\.ffn_(gate|up)_exps\.weight=IQ1_M

# Non-Repeating Layers
token_embd\.weight=Q6_K
output\.weight=Q6_K

Every additional 65536 tokens of context window require one additional GB of VRAM at Q8 KV cache.

The model was natively trained on a 262144 ctx window, so if you want to go beyond 262144 you need to use the additional YARN commands (both for ik and mainline):

  --rope-scaling yarn
  --rope-scale N
  --yarn-orig-ctx 262144

Where N is the context ceiling multiplier (2 for 524288, 4 for 1M). Close to no quality loss at scale 2, some quality loss at scale 4.

Downloads last month: 10,426

GGUF

Model size

396B params

Architecture

qwen35moe

Hardware compatibility

1-bit

2-bit

3-bit

Model tree for paragon-of-brah/Nex-N2-Pro-397B-A17B-GGUF

Base model

nex-agi/Nex-N2-Pro

Quantized

(20)

this model