⚑ Qwen3-30B-A3B-Instruct-2507-NVFP4

NVFP4 quantization of Qwen3-30B-A3B-Instruct-2507 β€” 61GB β†’ 16.9GB, ready for single-GPU deployment.

A high-quality NVFP4 (NVIDIA FP4) quantization of Qwen's updated Mixture-of-Experts instruct model (July 2025), calibrated on Italian-language data with full expert coverage. Designed for production inference with vLLM on NVIDIA Blackwell, Hopper, and Ada GPUs.


πŸ—οΈ Model Overview

🧬 Architecture Qwen3-MoE β€” standard transformer with Mixture-of-Experts
πŸ“ Parameters 30B total, 3B active per token (128 experts, top-8 routing)
πŸ—œοΈ Quantization NVFP4 (4-bit floating point weights and activations)
πŸ“¦ Size 16.9 GB (from 61 GB BF16) β€” 72% reduction
πŸ”§ Format compressed-tensors β€” native vLLM support
πŸ“ Context 262,144 tokens natively

πŸ†• What's New in 2507

This is the quantized version of the July 2025 update, which brings significant improvements over the original Qwen3-30B-A3B:

  • Improved instruction following, logical reasoning, math, science, coding and tool usage
  • Better long-tail knowledge coverage across multiple languages
  • Enhanced alignment with user preferences in open-ended tasks
  • Improved 256K long-context understanding

Note: This model supports only non-thinking mode and does not generate <think></think> blocks.


πŸš€ Quick Start

vLLM (recommended)

vllm serve Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4

vLLM with Docker

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4

Python (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain mixture-of-experts architectures in simple terms."},
    ],
    max_tokens=512,
)
print(response.choices[0].message.content)

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4"
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What are the key improvements in Qwen3's July 2025 update?"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸ”¬ Quantization Details

Method

NVFP4 quantization using llmcompressor with the compressed-tensors format. Weights and activations are quantized to 4-bit NVIDIA floating point with per-group scales (group size 16).

Calibration

πŸ“Š Samples 512
πŸ“ Sequence length 1024 tokens (p90-optimized)
🌍 Calibration language Italian
πŸ”€ MoE coverage All 128 experts calibrated (moe_calibrate_all_experts=True)
βš™οΈ Pipeline Basic (full GPU, no CPU offload)
πŸ–₯️ Hardware 2Γ— NVIDIA B200 SXM (366 GB VRAM)
⏱️ Quantization time ~18 minutes
πŸ’Ύ Compression time ~14 minutes

Preserved Layers (not quantized)

The following layers are kept in their original BF16 precision to preserve model quality:

Pattern Count Reason
lm_head 1 Output projection β€” critical for token prediction
mlp.gate 48 MoE routing gates β€” low parameter count, high impact on expert selection

A total of 49 modules are preserved in original precision.


πŸ’» Hardware Requirements

Setup VRAM Notes
1Γ— RTX 4090 (24 GB) ~17 GB βœ… Fits with room for KV cache
1Γ— RTX 5090 (32 GB) ~17 GB βœ… Comfortable fit
1Γ— A100 (40/80 GB) ~17 GB βœ… Plenty of headroom
1Γ— H100 (80 GB) ~17 GB βœ… Ideal for long contexts
1Γ— B200 (192 GB) ~17 GB βœ… Maximum KV cache capacity

At only 16.9 GB, this model fits comfortably on consumer GPUs. NVFP4 inference requires NVIDIA GPUs with compute capability β‰₯ 8.9 (Ada Lovelace, Hopper, Blackwell).


πŸ›οΈ Architecture Notes

Qwen3-30B-A3B is a standard transformer with Mixture-of-Experts (MoE) feed-forward layers:

  • 48 transformer layers, each with a MoE FFN block
  • 128 experts per layer, with top-8 routing per token
  • ~3B parameters active per token out of 30B total
  • Standard multi-head attention (not hybrid like Qwen3-Next)
  • 262,144 native context length

This architecture enables strong performance at a fraction of the compute cost of a dense 30B model, while maintaining the full capacity of 128 specialized experts.


⚠️ Important Notes

  • 🎯 Calibration language β€” calibrated on Italian data. The model retains its full multilingual capabilities (100+ languages), but quantization quality may be slightly optimized for Italian and similar Romance languages.
  • πŸ“ Sequence length β€” calibrated at 1024 tokens. The model supports up to 262K context but quantization statistics are optimized for this range.
  • πŸ”§ vLLM recommended β€” compressed-tensors format is natively supported by vLLM. Other inference engines may require conversion.
  • 🧠 Non-thinking mode only β€” this model does not generate <think></think> blocks. For reasoning mode, use the Thinking variant.
  • πŸ“Š Benchmarks β€” coming soon. Community evaluations welcome.

πŸ“œ License

This model inherits the Apache 2.0 license from the base model.


Quantized with ❀️ by Sophia AI
NVFP4 via llmcompressor β€’ 128 experts fully calibrated β€’ Ready for vLLM

Downloads last month
200
Safetensors
Model size
17B params
Tensor type
F32
Β·
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for Sophia-AI/Qwen3-30B-A3B-Instruct-2507-NVFP4

Quantized
(116)
this model