Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only

This is a FP8 W8A8 quantized version of Qwen/Qwen2.5-VL-7B-Instruct, created using llm-compressor.

Only the LLM decoder is quantized. The Vision Transformer (ViT) encoder remains in BF16 precision.

Model Summary

Property Value
Base Model Qwen/Qwen2.5-VL-7B-Instruct
Quantization FP8 W8A8 (8-bit float weights, 8-bit float activations)
Quantization Scope LLM decoder only (ViT encoder in BF16)
Strategy Per-tensor, static, symmetric (minmax observer)
Format compressed-tensors (float-quantized)
Model Size ~9.8 GB (3 shards)
Ignored Layers lm_head, all model.visual.* layers
Tool llm-compressor v0.7.1
Supported Runtime vLLM (with compressed-tensors)

Quantization Details

  • Weights: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
  • Activations: FP8 (float8_e4m3fn), per-tensor static quantization with minmax observer
  • Ignored: lm_head (kept in BF16) and all ViT encoder layers (model.visual.*)
  • Calibration: 512 samples from CNN/DailyMail, max sequence length 2048

Quantization Recipe

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:model.visual.*"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          targets: ["Linear"]

Usage

With vLLM

export VLLM_ATTENTION_BACKEND=TORCH_SDPA

vllm serve JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only \
    --trust-remote-code \
    --max-model-len 4096 \
    --enforce-eager

With Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

messages = [{"role": "user", "content": [
    {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
result = processor.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result[0])

Environment

Component Version
llm-compressor 0.7.1
compressed-tensors 0.11.0
transformers 4.55.2
torch 2.8.0+cu128
GPU NVIDIA RTX PRO 6000 (98GB)

Acknowledgments

Downloads last month
14
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JongYeop/Qwen2.5-VL-7B-Instruct-FP8-W8A8-LM-Only

Quantized
(136)
this model