---
license: apache-2.0
base_model: Qwen/Qwen3.5-27B
tags:
  - quantized
  - nvfp4
  - vllm
  - dgx-spark
  - qwen3.5
  - deltanet
library_name: transformers
quantization: compressed-tensors
---

# Qwen3.5-27B-NVFP4-Full (W4A4)

NVFP4 quantization of [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) with **all linear layers quantized**, including the DeltaNet linear attention projections that are typically excluded.

## Key differences from standard NVFP4 checkpoints

| | Standard NVFP4 (e.g., Sehyo) | This checkpoint |
|---|---|---|
| MoE experts | FP4 | FP4 |
| Shared experts | FP4 | FP4 |
| Self-attention (q/k/v/o) | FP4 | FP4 |
| **DeltaNet (in_proj_qkv, in_proj_z, out_proj)** | **BF16** | **FP4** |
| DeltaNet (in_proj_a, in_proj_b) | BF16 | BF16 (N=48, below CUTLASS tile minimum) |
| Model size | 27 GB | **20 GB** |

## Performance (DGX Spark / GB10 / SM121)

Measured with vLLM 0.19.1 + FlashInfer 0.6.7, CUTLASS W4A4 backend, no MTP:

| Metric | Standard NVFP4 | This checkpoint | Improvement |
|---|---|---|---|
| Decode (tg32) | 7.93 tok/s | **11.98 tok/s** | **+51%** |
| Decode @ d4096 | 7.66 tok/s | **11.90 tok/s** | **+55%** |
| Decode @ d8192 | 7.92 tok/s | **11.80 tok/s** | **+49%** |
| Prefill (pp2048) | 1855 tok/s | **2383 tok/s** | **+28%** |

The speedup comes from eliminating ~5 GB of BF16 weight loads per token for the DeltaNet layers, replacing them with ~1.4 GB of FP4 loads.

## Quantization details

- **Method**: llm-compressor `oneshot` with calibrated NVFP4 (W4A4)
- **Calibration**: 256 samples from HuggingFaceH4/ultrachat_200k, max_seq_length=4096
- **Format**: compressed-tensors `nvfp4-pack-quantized` with calibrated `input_global_scale`
- **Excluded layers**: `in_proj_a`, `in_proj_b` (N=48, CUTLASS FP4 requires N%64==0), `conv1d` (3D), norms, `A_log`, `dt_bias`, `lm_head`, `embed_tokens`

## Usage

### vLLM (recommended)

Requires vLLM >= 0.19.1 with PR #38423 (W4A4 SM120/SM121 support) and FlashInfer >= 0.6.7.

```bash
vllm serve rdtand/Qwen3.5-27B-NVFP4-DeltaNet-Included \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
```

### Quality notes

FP4 activation quantization on DeltaNet layers was widely assumed to be destructive for model quality. Our analysis shows the quantization error (SNR ~24 dB, relative error ~26%) is comparable to other layer types (SNR ~24 dB, relative error ~26%). The model produces coherent output with reasoning capabilities intact.

## Required llm-compressor fix

Quantizing the DeltaNet layers requires [vllm-project/llm-compressor#2566](https://github.com/vllm-project/llm-compressor/pull/2566), which fixes `model_free_ptq` for models with non-contiguous fused attention layers (Qwen3.5's interleaved `self_attn` + `linear_attn` architecture).

## Acknowledgments

- [Sehyo](https://huggingface.co/Sehyo) for the original Qwen3.5 NVFP4 quantization work and llm-compressor PR #2383
- [eugr](https://github.com/eugr) for spark-vllm-docker infrastructure
- Built on DGX Spark (GB10, SM121)