Model Description

Qwen3.5-397B-A17B-NVFP4 is an NVFP4-quantized version of Qwen/Qwen3.5-397B-A17B, a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet).

The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer v0.37.0.

What's quantized

Only the routed MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16:

Shared expert MLPs (process every token, unlike routed experts. thanks to Festr for pointing this out)
Attention layers (softmax attention and DeltaNet linear attention)
Vision encoder (ViT)
Router / gate weights
MTP (multi-token prediction) draft model
Embeddings, layer norms, lm_head

Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk).

Calibration methodology

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone.

Calibration datasets:

Agentic coding — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens)
Multimodal VQA — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens)
Deep reasoning — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens)
Diverse instruction following — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens)

All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of ~~184k routed expert quantizers (~~0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility.

Quality

MMLU Pro results:

Subject	Score	Accuracy
math	1295/1351	95.9%
biology	680/717	94.8%
physics	1211/1299	93.2%
chemistry	1045/1132	92.3%
business	724/789	91.8%
economics	762/844	90.3%
computer_science	368/410	89.8%
psychology	702/798	88.0%
engineering	833/969	86.0%
philosophy	431/499	86.4%
other	767/924	83.0%
health	665/818	81.3%
history	299/381	78.5%
law	812/1101	73.8%
TOTAL	10594/12032	88.0%

(BF16 baseline is 87.8%)

How to Run

SGLang

Tested on 4x RTX Pro 6000 Blackwell with PCIe.

python3 -m sglang.launch_server \
  --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
  --served-model-name Qwen3.5 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --tensor-parallel-size 4 \
  --expert-parallel-size 4 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --fp4-gemm-backend flashinfer_cudnn \
  --cuda-graph-max-bs 10 \
  --max-running-requests 10 \
  --chunked-prefill-size 32768 \
  --speculative-algo NEXTN \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 6 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 8000

Notes:

The MTP draft model is included for speculative decoding (--speculative-algo NEXTN).
If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.
Vision/multimodal inference is supported — the full ViT encoder is included unquantized.

License

Apache 2.0, following the base model.

Downloads last month: 19,082

Model tree for lukealonso/Qwen3.5-397B-A17B-NVFP4

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(58)

this model