---
base_model:
- Qwen/Qwen3.5-397B-A17B
license: apache-2.0
tags:
- qwen3.5
- moe
- quantized
- nvfp4
- fp4
- multimodal
- vision
pipeline_tag: text-generation
---

## Model Description

**Qwen3.5-397B-A17B-NVFP4** is an NVFP4-quantized version of [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B), a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet).

The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) v0.37.0.

### What's quantized

Only the **routed** MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16:

- **Shared expert MLPs** (process every token, unlike routed experts. thanks to Festr for pointing this out)
- **Attention layers** (softmax attention and DeltaNet linear attention)
- **Vision encoder** (ViT)
- **Router / gate weights**
- **MTP (multi-token prediction) draft model**
- **Embeddings, layer norms, lm_head**

Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk).

### Calibration methodology

Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone.

Calibration datasets:

- **Agentic coding** — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens)
- **Multimodal VQA** — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens)
- **Deep reasoning** — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens)
- **Diverse instruction following** — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens)

All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of ~184k routed expert quantizers (~0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility.

### Quality

MMLU Pro results:

| Subject | Score | Accuracy |
|---|---|---|
| math | 1295/1351 | **95.9%** |
| biology | 680/717 | **94.8%** |
| physics | 1211/1299 | **93.2%** |
| chemistry | 1045/1132 | **92.3%** |
| business | 724/789 | **91.8%** |
| economics | 762/844 | **90.3%** |
| computer_science | 368/410 | **89.8%** |
| psychology | 702/798 | **88.0%** |
| engineering | 833/969 | **86.0%** |
| philosophy | 431/499 | **86.4%** |
| other | 767/924 | **83.0%** |
| health | 665/818 | **81.3%** |
| history | 299/381 | **78.5%** |
| law | 812/1101 | **73.8%** |
| **TOTAL** | **10594/12032** | **88.0%** |

(BF16 baseline is 87.8%)

### How to Run

#### SGLang

Tested on 4x RTX Pro 6000 Blackwell with PCIe.

```bash
python3 -m sglang.launch_server \
  --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \
  --served-model-name Qwen3.5 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --tensor-parallel-size 4 \
  --expert-parallel-size 4 \
  --quantization modelopt_fp4 \
  --kv-cache-dtype fp8_e4m3 \
  --trust-remote-code \
  --attention-backend triton \
  --moe-runner-backend flashinfer_cutlass \
  --fp4-gemm-backend flashinfer_cudnn \
  --cuda-graph-max-bs 10 \
  --max-running-requests 10 \
  --chunked-prefill-size 32768 \
  --speculative-algo NEXTN \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 6 \
  --mamba-scheduler-strategy extra_buffer \
  --page-size 64 \
  --mem-fraction-static 0.8 \
  --host 0.0.0.0 --port 8000
```

Notes:
- The MTP draft model is included for speculative decoding (`--speculative-algo NEXTN`).
- If you experience NCCL hangs with P2P, make sure you have `iommu=pt` (and `amd_iommu=pt` on AMD platforms) in your kernel command line.
- Vision/multimodal inference is supported — the full ViT encoder is included unquantized.

### License

[Apache 2.0](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE), following the base model.