Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-NVFP4
Overview
This is a partial NVFP4 quantization of Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v by lightx2v, produced using convert_to_quant by silveroxides.
Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v is an image-to-video generation model built on Wan2.1-I2V-14B-480P. It applies step distillation and classifier-free guidance distillation to reduce inference to 4 steps without CFG, cutting generation time substantially while preserving output quality.
IMPORTANT
Since NVFP4 is only supported on NVIDIA Blackwell architecture GPUs, running this model requires a Blackwell GPU with its corresponding support enabled in torch, along with a recent version of ComfyUI and comfy-kitchen built against CUDA 13.
âž¡
Quantization
The model weights have been partially quantized to NVFP4 (NVIDIA Floating Point 4-bit), a quantization format supported on NVIDIA Blackwell architecture GPUs. Out of the 480 layers eligible for quantization, only a subset has been quantized to NVFP4; the remaining eligible layers are quantized to FP8 to preserve output quality.
The quantization format assigned to each layer is based on a sensitivity analysis performed with a custom script, which scores each weight tensor using excess kurtosis, dynamic range, and aspect ratio. Thresholds are derived automatically from the model's own score distribution.
The analysis yields the following convert_to_quant parameters. This conversion takes about 140 minutes on an RTX 5060 resulting in a 9.76 safetensors file.
convert_to_quant -i Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-bf16.safetensors \
--nvfp4 --wan --comfy_quant --save-quant-metadata \
--custom-type fp8 \
--custom-layers "blocks\.(1|2|3)\.cross_attn\.k\.weight|blocks\.(6|8|9|10)\.cross_attn\.k\.weight|blocks\.(0|1|2|3)\.cross_attn\.v\.weight|blocks\.(6)\.cross_attn\.q\.weight|blocks\.(6|14)\.cross_attn\.o\.weight|blocks\.(0|1|2|3)\.cross_attn\.v_img\.weight|blocks\.(0|1|2|3)\.ffn\.0\.weight|blocks\.(36|37|38|39)\.ffn\.0\.weight" \
--exclude-layers "blocks\.(4|5|7)\.cross_attn\.k\.weight|blocks\.(0)\.cross_attn\.q\.weight|blocks\.(5|7|9|10|11|12|19|20)\.cross_attn\.o\.weight" \
-o Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-nvfp4.safetensors
The table below details the quantization format applied per layer type across block ranges:
| Layer | 0–3 | 4–9 | 10–15 | 16–22 | 23–29 | 30–35 | 36–39 |
|---|---|---|---|---|---|---|---|
| self_attn.q | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| self_attn.k | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| self_attn.v | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| self_attn.o | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.q | BF16 (25%) / NVFP4 (75%) | FP8 (17%) / NVFP4 (83%) | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.k | FP8 (75%) / NVFP4 (25%) | BF16 (50%) / FP8 (50%) | FP8 (17%) / NVFP4 (83%) | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.v | FP8 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.o | NVFP4 | BF16 (50%) / FP8 (17%) / NVFP4 (33%) | BF16 (50%) / FP8 (17%) / NVFP4 (33%) | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.k_img | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| cross_attn.v_img | FP8 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
| ffn.0 | FP8 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | FP8 |
| ffn.2 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 | NVFP4 |
Inference
The model can be used in ComfyUI with the following parameters, based on the distilled model's own recommendations:
| Parameter | Value |
|---|---|
| Shift | 5.0 |
| Sampler | LCM |
| Scheduler | normal |
| CFG | 1.0 |
| Steps | 4 |
The combinations euler/simple and heun/linear_quadratic (sampler/scheduler) are also known to produce good results.
The model is designed to generate 81 frames and is not compatible with LoRAs. Sampling completes in under 60 seconds on an RTX 5060, making it possible to produce a full 81-frame video in under two minutes; with RIFE, those 81 frames convert to a 10-second video.
Abrupt camera movements or fast subject motion may produce artifacts. This is an inherent limitation of applying aggressive quantization to an already distilled model.
License Agreement
This model is licensed under the Apache 2.0 License. You retain full ownership of your generated content, but are solely responsible for its use in compliance with the license terms and applicable laws.
Acknowledgements
Big kudos to the contributors to the Wan2.1 and Self-Forcing repositories for their open research, and to silveroxides for their quantization tools.
- Downloads last month
- 75
Model tree for InsecureErasure/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-NVFP4
Base model
Wan-AI/Wan2.1-I2V-14B-480P