Qwen2.5-VL-72B-Instruct-heretic — FP8 (compressed-tensors)

FP8 weight-only quantization of coder3101/Qwen2.5-VL-72B-Instruct-heretic in compressed-tensors format for native vLLM inference.

Built for and tested on AMD Instinct MI-300x (192GB) with ROCm 7.1.

Why This Exists

There are very few FP8-quantized vision-language models available for the ROCm ecosystem. Running Qwen2.5-VL-72B in bf16 requires ~~144GB of VRAM, which fits on MI-300x but leaves limited headroom for KV cache and concurrent requests. This FP8 quantization cuts model weight memory in half (~~72GB) while preserving vision encoder quality, making it practical for production VL workloads on AMD hardware.

Model Details


Base model	Qwen/Qwen2.5-VL-72B-Instruct
Abliteration	coder3101/Qwen2.5-VL-72B-Instruct-heretic — Heretic v1.2.0 (MPOA), KL divergence 0.0156, refusals 8/100
Quantization	FP8 weight-only (`float8_e4m3fn`), per-channel scales (`float16`)
Format	`compressed-tensors` (`quant_method: compressed-tensors`) — auto-detected by vLLM
Size on disk	~72GB (31 safetensors shards)
Quantized layers	560 Linear layers (q/k/v/o_proj, gate/up/down_proj across 80 transformer layers)
Preserved in original precision	Vision encoder, lm_head, embeddings, RMSNorm layers, RoPE

ROCm Compatibility

Tested Configuration

Component	Version / Details
GPU	AMD Instinct MI-300x (192GB HBM3)
ROCm	7.1
vLLM	0.17.2 via `rocm/vllm-dev:nightly` Docker image
KV cache	FP8 (`--kv-cache-dtype fp8`)
Driver	amdgpu (gfx942)

ROCm-Specific Notes

Use rocm/vllm-dev:nightly, not :main. The :main tag ships vLLM 0.7.4 which predates Qwen2.5-VL support. The :nightly tag (0.17.2+) has full support.
HSA_OVERRIDE_GFX_VERSION=9.4.2 is required for MI-300x.
VLLM_USE_TRITON_FLASH_ATTN=0 disables Triton flash attention in favor of CK (Composable Kernel) flash attention, which is more stable on ROCm.
VLLM_USE_AITER=1 enables AMD's AIter optimizations for improved throughput.
PYTORCH_ALLOC_CONF=expandable_segments:True reduces memory fragmentation on AMD GPUs.
No --quantization flag needed. vLLM reads the quantization_config from config.json and handles FP8 weights automatically via the compressed-tensors format.
Should also work on MI-250X and MI-210 with the appropriate HSA_OVERRIDE_GFX_VERSION, but this is untested.

VRAM Budget (MI-300x, 192GB)

Component	VRAM
Model weights (FP8)	~72GB
KV cache (FP8, 32K context, 16 seqs)	~20-25GB
Overhead	~5GB
Total at `gpu-memory-utilization 0.50`	~96GB
Free for other workloads	~96GB

This leaves substantial headroom. On MI-300x you can comfortably run this model alongside other GPU workloads, or increase --max-model-len and --max-num-seqs for higher throughput.

Quick Start — vLLM + ROCm

Docker Compose (recommended for ROCm)

services:
  vllm:
    image: rocm/vllm-dev:nightly
    container_name: qwen25-vl-72b
    restart: unless-stopped

    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    security_opt:
      - seccomp=unconfined
    ipc: host
    shm_size: "16gb"

    environment:
      - HIP_VISIBLE_DEVICES=0
      - HSA_OVERRIDE_GFX_VERSION=9.4.2
      - PYTORCH_ROCM_ARCH=gfx942
      - VLLM_USE_TRITON_FLASH_ATTN=0
      - VLLM_USE_AITER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - AMD_LOG_LEVEL=0

    volumes:
      - /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8:/models/qwen25-vl:ro
      - /path/to/cache/aiter:/root/.cache/aiter
      - /path/to/cache/huggingface:/root/.cache/huggingface

    ports:
      - "127.0.0.1:8000:8000"

    command: >
      vllm serve /models/qwen25-vl
      --host 0.0.0.0
      --port 8000
      --kv-cache-dtype fp8
      --dtype bfloat16
      --tensor-parallel-size 1
      --max-model-len 32768
      --gpu-memory-utilization 0.50
      --enable-chunked-prefill
      --max-num-seqs 16
      --served-model-name qwen2.5-vl-72b
      --limit-mm-per-prompt '{"image": 4}'
      --trust-remote-code

CLI (inside rocm/vllm-dev:nightly container)

vllm serve /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --kv-cache-dtype fp8 \
    --dtype bfloat16 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.50 \
    --enable-chunked-prefill \
    --max-num-seqs 16 \
    --served-model-name qwen2.5-vl-72b \
    --limit-mm-per-prompt '{"image": 4}' \
    --trust-remote-code

Example — Vision-Language Request

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="qwen2.5-vl-72b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }],
    max_tokens=512
)
print(response.choices[0].message.content)

Quantization Method

Direct per-channel FP8 quantization applied to safetensors shards offline — no calibration data required, no GPU required for conversion. For each 2D Linear weight matrix:

Compute per-channel absolute max: amax = weight.abs().amax(dim=1, keepdim=True)
Compute scale: scale = amax / 448.0 (448 = float8_e4m3fn max representable value)
Quantize: (weight / scale).clamp(-448, 448).to(float8_e4m3fn)
Store scale alongside weight as float16 with shape [out_features, 1]

The quantization_config in config.json uses the compressed-tensors spec, so vLLM auto-detects the format with no --quantization flag needed.

What's preserved (not quantized)

Component	Reason
Vision encoder (`visual.*`)	Quantizing degrades image understanding quality — the vision encoder is small relative to the LLM backbone
lm_head	Quantizing the output projection hurts token distribution quality
Embeddings (`embed_tokens`)	Lookup table, not a matrix multiply
RMSNorm layers	Must remain high precision for numerical stability
RoPE embeddings	Positional encoding, not learned weights

Abliteration Details

From the heretic base model card:

Parameter	Value
Method	Heretic v1.2.0, Magnitude-Preserving Orthogonal Ablation (MPOA)
KL divergence from original	0.0156 (minimal quality impact)
Refusals	8/100 (down from 100/100)
direction_index	56.27
attn.o_proj.max_weight	1.33
mlp.down_proj.max_weight	1.48

Known Issues

Tokenizer regex warning: You'll see a warning about "incorrect regex pattern" referencing a Mistral HF discussion. This is a false positive from the transformers tokenizer validator — tokenization works correctly. Ignore it.
rocm/vllm-dev:main does NOT work: That image ships vLLM 0.7.4, which fails with AssertionError on config.text_config.num_attention_heads for Qwen2.5-VL models. Use :nightly (0.17.2+).

Credits

Qwen team for Qwen2.5-VL-72B-Instruct
coder3101 for the heretic abliteration
vLLM project for inference serving
AMD ROCm team for the open-source GPU compute platform

Downloads last month: 21

Safetensors

Model size

73B params

Tensor type

BF16

F16

F8_E4M3

Model tree for anongecko/Qwen2.5-VL-72B-Instruct-heretic-FP8

Base model

Qwen/Qwen2.5-VL-72B-Instruct

Quantized

(33)

this model