Model Description
Qwen3.5-397B-A17B-NVFP4 is an NVFP4-quantized version of Qwen/Qwen3.5-397B-A17B, a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet).
The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer v0.37.0.
What's quantized
Only the routed MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16:
- Shared expert MLPs (process every token, unlike routed experts. thanks to Festr for pointing this out)
- Attention layers (softmax attention and DeltaNet linear attention)
- Vision encoder (ViT)
- Router / gate weights
- MTP (multi-token prediction) draft model
- Embeddings, layer norms, lm_head
Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk).
Calibration methodology
Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone.
Calibration datasets:
- Agentic coding — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens)
- Multimodal VQA — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens)
- Deep reasoning — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens)
- Diverse instruction following — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens)
All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of 184k routed expert quantizers (0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility.
Quality
MMLU Pro results:
| Subject | Score | Accuracy |
|---|---|---|
| math | 1295/1351 | 95.9% |
| biology | 680/717 | 94.8% |
| physics | 1211/1299 | 93.2% |
| chemistry | 1045/1132 | 92.3% |
| business | 724/789 | 91.8% |
| economics | 762/844 | 90.3% |
| computer_science | 368/410 | 89.8% |
| psychology | 702/798 | 88.0% |
| engineering | 833/969 | 86.0% |
| philosophy | 431/499 | 86.4% |
| other | 767/924 | 83.0% |
| health | 665/818 | 81.3% |
| history | 299/381 | 78.5% |
| law | 812/1101 | 73.8% |
| TOTAL | 10594/12032 | 88.0% |
(BF16 baseline is 87.8%)
How to Run
SGLang
Tested on 4x RTX Pro 6000 Blackwell with PCIe.
python3 -m sglang.launch_server \
--model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
--served-model-name Qwen3.5 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--tensor-parallel-size 4 \
--expert-parallel-size 4 \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e4m3 \
--trust-remote-code \
--attention-backend triton \
--moe-runner-backend flashinfer_cutlass \
--fp4-gemm-backend flashinfer_cudnn \
--cuda-graph-max-bs 10 \
--max-running-requests 10 \
--chunked-prefill-size 32768 \
--speculative-algo NEXTN \
--speculative-num-steps 5 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 6 \
--mamba-scheduler-strategy extra_buffer \
--page-size 64 \
--mem-fraction-static 0.8 \
--host 0.0.0.0 --port 8000
Notes:
- The MTP draft model is included for speculative decoding (
--speculative-algo NEXTN). - If you experience NCCL hangs with P2P, make sure you have
iommu=pt(andamd_iommu=pton AMD platforms) in your kernel command line. - Vision/multimodal inference is supported — the full ViT encoder is included unquantized.
License
Apache 2.0, following the base model.
- Downloads last month
- 19,082
Model tree for lukealonso/Qwen3.5-397B-A17B-NVFP4
Base model
Qwen/Qwen3.5-397B-A17B