Model Description

Qwen3.5-397B-A17B-NVFP4 is an NVFP4-quantized version of Qwen/Qwen3.5-397B-A17B, a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet).

The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using NVIDIA Model Optimizer v0.37.0.

What's quantized

Only the routed MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16:

  • Shared expert MLPs (process every token, unlike routed experts. thanks to Festr for pointing this out)
  • Attention layers (softmax attention and DeltaNet linear attention)
  • Vision encoder (ViT)
  • Router / gate weights
  • MTP (multi-token prediction) draft model
  • Embeddings, layer norms, lm_head

Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk).

Calibration methodology

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone.

Calibration datasets:

  • Agentic coding — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens)
  • Multimodal VQA — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens)
  • Deep reasoning — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens)
  • Diverse instruction following — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens)

All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of 184k routed expert quantizers (0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility.

Quality

MMLU Pro results:

Subject Score Accuracy
math 1295/1351 95.9%
biology 680/717 94.8%
physics 1211/1299 93.2%
chemistry 1045/1132 92.3%
business 724/789 91.8%
economics 762/844 90.3%
computer_science 368/410 89.8%
psychology 702/798 88.0%
engineering 833/969 86.0%
philosophy 431/499 86.4%
other 767/924 83.0%
health 665/818 81.3%
history 299/381 78.5%
law 812/1101 73.8%
TOTAL 10594/12032 88.0%

(BF16 baseline is 87.8%)

How to Run

SGLang

Tested on 4x RTX Pro 6000 Blackwell with PCIe.

python3 -m sglang.launch_server \
  --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
  --served-model-name Qwen3.5 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --tensor-parallel-size 4 \
  --expert-parallel-size 4 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --fp4-gemm-backend flashinfer_cudnn \
  --cuda-graph-max-bs 10 \
  --max-running-requests 10 \
  --chunked-prefill-size 32768 \
  --speculative-algo NEXTN \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 6 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 8000

Notes:

  • The MTP draft model is included for speculative decoding (--speculative-algo NEXTN).
  • If you experience NCCL hangs with P2P, make sure you have iommu=pt (and amd_iommu=pt on AMD platforms) in your kernel command line.
  • Vision/multimodal inference is supported — the full ViT encoder is included unquantized.

License

Apache 2.0, following the base model.

Downloads last month
19,082
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lukealonso/Qwen3.5-397B-A17B-NVFP4

Quantized
(58)
this model