Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only

This is a FP8 W8A8 quantized version of Qwen/Qwen2.5-VL-7B-Instruct, created using llm-compressor.

Only the LLM decoder is quantized. The Vision Transformer (ViT) encoder remains in BF16 precision.

Model Summary

Property	Value
Base Model	Qwen/Qwen2.5-VL-7B-Instruct
Quantization	FP8 W8A8 (8-bit float weights, 8-bit float activations)
Quantization Scope	LLM decoder only (ViT encoder in BF16)
Strategy	Per-tensor, static, symmetric (minmax observer)
Format	`compressed-tensors` (`float-quantized`)
Model Size	~9.8 GB (3 shards)
Ignored Layers	`lm_head`, all `model.visual.*` layers
Tool	llm-compressor v0.7.1
Supported Runtime	vLLM (with `compressed-tensors`)

Quantization Details

Weights: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
Activations: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
Ignored: lm_head (kept in BF16) and all ViT encoder layers (model.visual.*)
Calibration: 512 samples from CNN/DailyMail, max sequence length 2048

Quantization Recipe

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:model.visual.*"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          targets: ["Linear"]

Usage

With vLLM

export VLLM_ATTENTION_BACKEND=TORCH_SDPA

vllm serve JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only \
    --trust-remote-code \
    --max-model-len 4096 \
    --enforce-eager

With Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result[0])

Environment

Component	Version
llm-compressor	0.7.1
compressed-tensors	0.11.0
transformers	4.55.2
torch	2.8.0+cu128
GPU	NVIDIA RTX PRO 6000 (98GB)

Acknowledgments

Base model by Qwen Team
Quantization powered by llm-compressor and compressed-tensors

Downloads last month: 14

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3

Model tree for JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Quantized

(136)

this model