Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only
This is a FP8 W8A8 quantized version of Qwen/Qwen2.5-VL-7B-Instruct, created using llm-compressor.
Only the LLM decoder is quantized. The Vision Transformer (ViT) encoder remains in BF16 precision.
Model Summary
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-VL-7B-Instruct |
| Quantization | FP8 W8A8 (8-bit float weights, 8-bit float activations) |
| Quantization Scope | LLM decoder only (ViT encoder in BF16) |
| Strategy | Per-tensor, static, symmetric (minmax observer) |
| Format | compressed-tensors (float-quantized) |
| Model Size | ~9.8 GB (3 shards) |
| Ignored Layers | lm_head, all model.visual.* layers |
| Tool | llm-compressor v0.7.1 |
| Supported Runtime | vLLM (with compressed-tensors) |
Quantization Details
- Weights: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
- Activations: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
- Ignored:
lm_head(kept in BF16) and all ViT encoder layers (model.visual.*) - Calibration: 512 samples from CNN/DailyMail, max sequence length 2048
Quantization Recipe
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head", "re:model.visual.*"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
Usage
With vLLM
export VLLM_ATTENTION_BACKEND=TORCH_SDPA
vllm serve JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only \
--trust-remote-code \
--max-model-len 4096 \
--enforce-eager
With Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
messages = [{"role": "user", "content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result[0])
Environment
| Component | Version |
|---|---|
| llm-compressor | 0.7.1 |
| compressed-tensors | 0.11.0 |
| transformers | 4.55.2 |
| torch | 2.8.0+cu128 |
| GPU | NVIDIA RTX PRO 6000 (98GB) |
Acknowledgments
- Base model by Qwen Team
- Quantization powered by llm-compressor and compressed-tensors
- Downloads last month
- 14
Model tree for JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only
Base model
Qwen/Qwen2.5-VL-7B-Instruct