β‘ Qwen3-30B-A3B-Instruct-2507-NVFP4
NVFP4 quantization of Qwen3-30B-A3B-Instruct-2507 β 61GB β 16.9GB, ready for single-GPU deployment.
A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's updated Mixture-of-Experts instruct model (July 2025), calibrated on Italian-language data with full expert coverage. Designed for production inference with vLLM on NVIDIA Blackwell, Hopper, and Ada GPUs.
ποΈ Model Overview
| 𧬠Architecture | Qwen3-MoE β standard transformer with Mixture-of-Experts |
| π Parameters | 30B total, 3B active per token (128 experts, top-8 routing) |
| ποΈ Quantization | NVFP4 (4-bit floating point weights and activations) |
| π¦ Size | 16.9 GB (from 61 GB BF16) β 72% reduction |
| π§ Format | compressed-tensors β native vLLM support |
| π Context | 262,144 tokens natively |
π What's New in 2507
This is the quantized version of the July 2025 update, which brings significant improvements over the original Qwen3-30B-A3B:
- Improved instruction following, logical reasoning, math, science, coding and tool usage
- Better long-tail knowledge coverage across multiple languages
- Enhanced alignment with user preferences in open-ended tasks
- Improved 256K long-context understanding
Note: This model supports only non-thinking mode and does not generate
<think></think>blocks.
π Quick Start
vLLM (recommended)
vllm serve Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4
vLLM with Docker
docker run --gpus all \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4
Python (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
],
max_tokens=512,
)
print(response.choices[0].message.content)
Python (Transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4"
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the key improvements in Qwen3's July 2025 update?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π¬ Quantization Details
Method
NVFP4 quantization using llmcompressor with the compressed-tensors format. Weights and activations are quantized to 4-bit NVIDIA floating point with per-group scales (group size 16).
Calibration
| π Samples | 512 |
| π Sequence length | 1024 tokens (p90-optimized) |
| π Calibration language | Italian |
| π MoE coverage | All 128 experts calibrated (moe_calibrate_all_experts=True) |
| βοΈ Pipeline | Basic (full GPU, no CPU offload) |
| π₯οΈ Hardware | 2Γ NVIDIA B200 SXM (366 GB VRAM) |
| β±οΈ Quantization time | ~18 minutes |
| πΎ Compression time | ~14 minutes |
Preserved Layers (not quantized)
The following layers are kept in their original BF16 precision to preserve model quality:
| Pattern | Count | Reason |
|---|---|---|
lm_head |
1 | Output projection β critical for token prediction |
mlp.gate |
48 | MoE routing gates β low parameter count, high impact on expert selection |
A total of 49 modules are preserved in original precision.
π» Hardware Requirements
| Setup | VRAM | Notes |
|---|---|---|
| 1Γ RTX 4090 (24 GB) | ~17 GB | β Fits with room for KV cache |
| 1Γ RTX 5090 (32 GB) | ~17 GB | β Comfortable fit |
| 1Γ A100 (40/80 GB) | ~17 GB | β Plenty of headroom |
| 1Γ H100 (80 GB) | ~17 GB | β Ideal for long contexts |
| 1Γ B200 (192 GB) | ~17 GB | β Maximum KV cache capacity |
At only 16.9 GB, this model fits comfortably on consumer GPUs. NVFP4 inference requires NVIDIA GPUs with compute capability β₯ 8.9 (Ada Lovelace, Hopper, Blackwell).
ποΈ Architecture Notes
Qwen3-30B-A3B is a standard transformer with Mixture-of-Experts (MoE) feed-forward layers:
- 48 transformer layers, each with a MoE FFN block
- 128 experts per layer, with top-8 routing per token
- ~3B parameters active per token out of 30B total
- Standard multi-head attention (not hybrid like Qwen3-Next)
- 262,144 native context length
This architecture enables strong performance at a fraction of the compute cost of a dense 30B model, while maintaining the full capacity of 128 specialized experts.
β οΈ Important Notes
- π― Calibration language β calibrated on Italian data. The model retains its full multilingual capabilities (100+ languages), but quantization quality may be slightly optimized for Italian and similar Romance languages.
- π Sequence length β calibrated at 1024 tokens. The model supports up to 262K context but quantization statistics are optimized for this range.
- π§ vLLM recommended β
compressed-tensorsformat is natively supported by vLLM. Other inference engines may require conversion. - π§ Non-thinking mode only β this model does not generate
<think></think>blocks. For reasoning mode, use the Thinking variant. - π Benchmarks β coming soon. Community evaluations welcome.
π License
This model inherits the Apache 2.0 license from the base model.
Quantized with β€οΈ by Sophia AI
NVFP4 via llmcompressor β’ 128 experts fully calibrated β’ Ready for vLLM
- Downloads last month
- 200
Model tree for Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4
Base model
Qwen/Qwen3-30B-A3B-Instruct-2507