--- license: apache-2.0 base_model: Qwen/Qwen3.5-27B tags: - quantized - nvfp4 - vllm - dgx-spark - qwen3.5 - deltanet library_name: transformers quantization: compressed-tensors --- # Qwen3.5-27B-NVFP4-Full (W4A4) NVFP4 quantization of [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) with **all linear layers quantized**, including the DeltaNet linear attention projections that are typically excluded. ## Key differences from standard NVFP4 checkpoints | | Standard NVFP4 (e.g., Sehyo) | This checkpoint | |---|---|---| | MoE experts | FP4 | FP4 | | Shared experts | FP4 | FP4 | | Self-attention (q/k/v/o) | FP4 | FP4 | | **DeltaNet (in_proj_qkv, in_proj_z, out_proj)** | **BF16** | **FP4** | | DeltaNet (in_proj_a, in_proj_b) | BF16 | BF16 (N=48, below CUTLASS tile minimum) | | Model size | 27 GB | **20 GB** | ## Performance (DGX Spark / GB10 / SM121) Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP: | Metric | Standard NVFP4 | This checkpoint | Improvement | |---|---|---|---| | Decode (tg32) | 7.93 tok/s | **11.98 tok/s** | **+51%** | | Decode @ d4096 | 7.66 tok/s | **11.90 tok/s** | **+55%** | | Decode @ d8192 | 7.92 tok/s | **11.80 tok/s** | **+49%** | | Prefill (pp2048) | 1855 tok/s | **2383 tok/s** | **+28%** | The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads. ## Quantization details - **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4) - **Calibration**: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096 - **Format**: compressed-tensors `nvfp4-pack-quantized` with calibrated `input_global_scale` - **Excluded layers**: `in_proj_a`, `in_proj_b` (N=48, CUTLASS FP4 requires N%64==0), `conv1d` (3D), norms, `A_log`, `dt_bias`, `lm_head`, `embed_tokens` ## Usage ### vLLM (recommended) Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7. ```bash vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \ --trust-remote-code \ --kv-cache-dtype fp8 \ --attention-backend flashinfer \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' ``` ### Quality notes FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact. ## Required llm-compressor fix Quantizing the DeltaNet layers requires [vllm-project/llm-compressor#2566](https://github.com/vllm-project/llm-compressor/pull/2566), which fixes `model_free_ptq` for models with non-contiguous fused attention layers (Qwen3.5's interleaved `self_attn` + `linear_attn` architecture). ## Acknowledgments - [Sehyo](https://huggingface.co/Sehyo) for the original Qwen3.5 NVFP4 quantization work and llm-compressor PR #2383 - [eugr](https://github.com/eugr) for spark-vllm-docker infrastructure - Built on DGX Spark (GB10, SM121)