⚑ Qwen3-Next-80B-A3B-Instruct-NVFP4

NVFP4 quantization of Qwen3-Next-80B-A3B-Instruct β€” 160GB β†’ 44.6GB, ready for single-GPU deployment.

A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's flagship Mixture-of-Experts model, calibrated on Italian-language data with full expert coverage. Designed for production inference with vLLM on NVIDIA Blackwell, Hopper, and Ada GPUs.


πŸ—οΈ Model Overview

🧬 Architecture Qwen3-Next β€” MoE with DeltaNet (linear attention) + standard attention
πŸ“ Parameters 80B total, 3B active per token (512 experts, top-10 routing)
πŸ—œοΈ Quantization NVFP4 (4-bit floating point) with FP8 KV cache
πŸ“¦ Size 44.6 GB (from 160 GB BF16) β€” 72% reduction
πŸ”§ Format compressed-tensors β€” native vLLM support

πŸš€ Quick Start

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    --kv-cache-dtype fp8

vLLM with Docker

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4 \
    --kv-cache-dtype fp8

Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is DeltaNet and how does it differ from standard attention?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ”¬ Quantization Details

Method

NVFP4 quantization using llmcompressor v0.9.0 with the compressed-tensors format. Weights are quantized to 4-bit NVIDIA floating point with per-channel global scales, and the KV cache is quantized to FP8 for additional memory savings during inference.

Calibration

πŸ“Š Samples 512
πŸ“ Sequence length 1024 tokens
🌍 Calibration language Italian
πŸ”€ MoE coverage All 512 experts calibrated (moe_calibrate_all_experts=True)
βš™οΈ Pipeline Basic (full GPU, no CPU offload)
πŸ–₯️ Hardware 2Γ— NVIDIA B200 SXM (358 GB VRAM)
⏱️ Total time ~4 hours

Preserved Layers (not quantized)

The following layers are kept in their original precision to preserve model quality:

Pattern Reason
lm_head Output projection β€” critical for token prediction
mlp.gate MoE routing gates β€” low parameter count, high impact
mlp.shared_expert_gate Shared expert gating β€” controls expert selection
linear_attn.* DeltaNet layers β€” specialized linear attention mechanism
self_attn.q_proj Query projection on standard attention layers
self_attn.k_proj Key projection on standard attention layers
self_attn.v_proj Value projection on standard attention layers

These exclusions follow NVIDIA's official quantization configuration for this architecture. A total of 385 modules are preserved in original precision.


πŸ’» Hardware Requirements

Setup VRAM Notes
1Γ— B200 (192 GB) ~45 GB βœ… Recommended β€” plenty of headroom for KV cache
1Γ— H200 (141 GB) ~45 GB βœ… Works well
1Γ— A100 (80 GB) ~45 GB βœ… Works β€” monitor KV cache with long contexts
1Γ— H100 (80 GB) ~45 GB βœ… Works β€” same as A100
1Γ— RTX 4090 (24 GB) ~45 GB ❌ Insufficient VRAM

The FP8 KV cache (--kv-cache-dtype fp8) is recommended for all deployments to maximize context length within available VRAM.


πŸ›οΈ Architecture Notes

Qwen3-Next introduces a hybrid attention architecture that alternates between:

  • DeltaNet (linear attention): Layers 0, 1, 2, 4, 5, 6, 8, 9, 10, ... β€” efficient linear-complexity attention
  • Standard attention: Layers 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47 β€” full quadratic attention every 4th layer

This hybrid design enables efficient long-context processing while maintaining the representational power of standard attention at regular intervals. The MoE routing activates 10 out of 512 experts per token, keeping inference compute at ~3B active parameters despite the 80B total.


⚠️ Important Notes

  • 🎯 Calibration language β€” calibrated on Italian data. The model retains its full multilingual capabilities, but quantization quality may be slightly optimized for Italian and similar Romance languages.
  • πŸ“ Sequence length β€” calibrated at 1024 tokens. The model supports longer contexts but quantization statistics are optimized for this range.
  • πŸ”§ vLLM recommended β€” compressed-tensors format is natively supported by vLLM. Other inference engines may require conversion.
  • πŸ“Š Benchmarks β€” coming soon. Community evaluations welcome.

πŸ“œ License

This model inherits the Apache 2.0 license from the base model.


Quantized with ❀️ by Sophia AI
NVFP4 via llmcompressor β€’ 512 experts fully calibrated β€’ Ready for vLLM

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for Sophia-AI/Qwen3-Next-80B-A3B-Instruct-NVFP4

Quantized
(73)
this model