Instructions to use Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4
- SGLang
How to use Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 with Docker Model Runner:
docker model run hf.co/Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4
β‘ Qwen3-Next-80B-A3B-Instruct-NVFP4
NVFP4 quantization of Qwen3-Next-80B-A3B-Instruct β 160GB β 44.6GB, ready for single-GPU deployment.
A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's flagship Mixture-of-Experts model, calibrated on Italian-language data with full expert coverage. Designed for production inference with vLLM on NVIDIA Blackwell, Hopper, and Ada GPUs.
ποΈ Model Overview
| 𧬠Architecture | Qwen3-Next β MoE with DeltaNet (linear attention) + standard attention |
| π Parameters | 80B total, 3B active per token (512 experts, top-10 routing) |
| ποΈ Quantization | NVFP4 (4-bit floating point) with FP8 KV cache |
| π¦ Size | 44.6 GB (from 160 GB BF16) β 72% reduction |
| π§ Format | compressed-tensors β native vLLM support |
π Quick Start
vLLM (recommended)
vllm serve Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
--kv-cache-dtype fp8
vLLM with Docker
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
--kv-cache-dtype fp8
Python (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
],
max_tokens=512,
)
print(response.choices[0].message.content)
Python (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is DeltaNet and how does it differ from standard attention?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π¬ Quantization Details
Method
NVFP4 quantization using llmcompressor v0.9.0 with the compressed-tensors format. Weights are quantized to 4-bit NVIDIA floating point with per-channel global scales, and the KV cache is quantized to FP8 for additional memory savings during inference.
Calibration
| π Samples | 512 |
| π Sequence length | 1024 tokens |
| π Calibration language | Italian |
| π MoE coverage | All 512 experts calibrated (moe_calibrate_all_experts=True) |
| βοΈ Pipeline | Basic (full GPU, no CPU offload) |
| π₯οΈ Hardware | 2Γ NVIDIA B200 SXM (358 GB VRAM) |
| β±οΈ Total time | ~4 hours |
Preserved Layers (not quantized)
The following layers are kept in their original precision to preserve model quality:
| Pattern | Reason |
|---|---|
lm_head |
Output projection β critical for token prediction |
mlp.gate |
MoE routing gates β low parameter count, high impact |
mlp.shared_expert_gate |
Shared expert gating β controls expert selection |
linear_attn.* |
DeltaNet layers β specialized linear attention mechanism |
self_attn.q_proj |
Query projection on standard attention layers |
self_attn.k_proj |
Key projection on standard attention layers |
self_attn.v_proj |
Value projection on standard attention layers |
These exclusions follow NVIDIA's official quantization configuration for this architecture. A total of 385 modules are preserved in original precision.
π» Hardware Requirements
| Setup | VRAM | Notes |
|---|---|---|
| 1Γ B200 (192 GB) | ~45 GB | β Recommended β plenty of headroom for KV cache |
| 1Γ H200 (141 GB) | ~45 GB | β Works well |
| 1Γ A100 (80 GB) | ~45 GB | β Works β monitor KV cache with long contexts |
| 1Γ H100 (80 GB) | ~45 GB | β Works β same as A100 |
| 1Γ RTX 4090 (24 GB) | ~45 GB | β Insufficient VRAM |
The FP8 KV cache (
--kv-cache-dtype fp8) is recommended for all deployments to maximize context length within available VRAM.
ποΈ Architecture Notes
Qwen3-Next introduces a hybrid attention architecture that alternates between:
- DeltaNet (linear attention): Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, ... β efficient linear-complexity attention
- Standard attention: Layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47 β full quadratic attention every 4th layer
This hybrid design enables efficient long-context processing while maintaining the representational power of standard attention at regular intervals. The MoE routing activates 10 out of 512 experts per token, keeping inference compute at ~3B active parameters despite the 80B total.
β οΈ Important Notes
- π― Calibration language β calibrated on Italian data. The model retains its full multilingual capabilities, but quantization quality may be slightly optimized for Italian and similar Romance languages.
- π Sequence length β calibrated at 1024 tokens. The model supports longer contexts but quantization statistics are optimized for this range.
- π§ vLLM recommended β
compressed-tensorsformat is natively supported by vLLM. Other inference engines may require conversion. - π Benchmarks β coming soon. Community evaluations welcome.
π License
This model inherits the Apache 2.0 license from the base model.
Quantized with β€οΈ by Sophia AI
NVFP4 via llmcompressor β’ 512 experts fully calibrated β’ Ready for vLLM
- Downloads last month
- 6
Model tree for Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct