--- base_model: - Qwen/Qwen3.5-397B-A17B license: apache-2.0 tags: - qwen3.5 - moe - quantized - nvfp4 - fp4 - multimodal - vision pipeline_tag: text-generation --- ## Model Description **Qwen3.5-397B-A17B-NVFP4** is an NVFP4-quantized version of [Qwen/Qwen3.5-397B-A17B](https://huggingface.co/Qwen/Qwen3.5-397B-A17B), a 397B-parameter Mixture-of-Experts vision-language model with 17B active parameters, 512 experts per layer (10 active), and hybrid attention (softmax + linear/DeltaNet). The original BF16 weights were quantized to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) v0.37.0. ### What's quantized Only the **routed** MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. The following are left unquantized in BF16: - **Shared expert MLPs** (process every token, unlike routed experts. thanks to Festr for pointing this out) - **Attention layers** (softmax attention and DeltaNet linear attention) - **Vision encoder** (ViT) - **Router / gate weights** - **MTP (multi-token prediction) draft model** - **Embeddings, layer norms, lm_head** Since the expert weights constitute the vast majority of the 397B parameters, this still yields significant memory savings (~233 GB on disk). ### Calibration methodology Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical (~16k+ samples across multiple datasets) to ensure broad expert coverage through natural routing alone. Calibration datasets: - **Agentic coding** — function calling, tool use, and code generation samples (4096 samples, max 1024 tokens) - **Multimodal VQA** — image + text visual question answering from COCO val2017 (~200 samples, max 2048 tokens) - **Deep reasoning** — long-context math, science, and multi-step reasoning (all samples, max 8192 tokens) - **Diverse instruction following** — multilingual, multi-domain instruction/response pairs (4096 samples, max 1024 tokens) All 512 routed experts per layer were activated during calibration — no experts were missed entirely. Post-calibration, 256 out of ~184k routed expert quantizers (~0.14%) had their amaxes floored to median/10 of their peer group to stabilize rarely-hit experts. Gate/up projection weight scales were tied for fused w13 export compatibility. ### Quality MMLU Pro results: | Subject | Score | Accuracy | |---|---|---| | math | 1295/1351 | **95.9%** | | biology | 680/717 | **94.8%** | | physics | 1211/1299 | **93.2%** | | chemistry | 1045/1132 | **92.3%** | | business | 724/789 | **91.8%** | | economics | 762/844 | **90.3%** | | computer_science | 368/410 | **89.8%** | | psychology | 702/798 | **88.0%** | | engineering | 833/969 | **86.0%** | | philosophy | 431/499 | **86.4%** | | other | 767/924 | **83.0%** | | health | 665/818 | **81.3%** | | history | 299/381 | **78.5%** | | law | 812/1101 | **73.8%** | | **TOTAL** | **10594/12032** | **88.0%** | (BF16 baseline is 87.8%) ### How to Run #### SGLang Tested on 4x RTX Pro 6000 Blackwell with PCIe. ```bash python3 -m sglang.launch_server \ --model lukealonso/Qwen3.5-397B-A17B-NVFP4 \ --served-model-name Qwen3.5 \ --reasoning-parser qwen3 \ --tool-call-parser qwen3_coder \ --tensor-parallel-size 4 \ --expert-parallel-size 4 \ --quantization modelopt_fp4 \ --kv-cache-dtype fp8_e4m3 \ --trust-remote-code \ --attention-backend triton \ --moe-runner-backend flashinfer_cutlass \ --fp4-gemm-backend flashinfer_cudnn \ --cuda-graph-max-bs 10 \ --max-running-requests 10 \ --chunked-prefill-size 32768 \ --speculative-algo NEXTN \ --speculative-num-steps 5 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 6 \ --mamba-scheduler-strategy extra_buffer \ --page-size 64 \ --mem-fraction-static 0.8 \ --host 0.0.0.0 --port 8000 ``` Notes: - The MTP draft model is included for speculative decoding (`--speculative-algo NEXTN`). - If you experience NCCL hangs with P2P, make sure you have `iommu=pt` (and `amd_iommu=pt` on AMD platforms) in your kernel command line. - Vision/multimodal inference is supported — the full ViT encoder is included unquantized. ### License [Apache 2.0](https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE), following the base model.