Qwen2.5-VL-72B-Instruct-heretic โ€” FP8 (compressed-tensors)

FP8 weight-only quantization of coder3101/Qwen2.5-VL-72B-Instruct-heretic in compressed-tensors format for native vLLM inference.

Built for and tested on AMD Instinct MI-300x (192GB) with ROCm 7.1.

Why This Exists

There are very few FP8-quantized vision-language models available for the ROCm ecosystem. Running Qwen2.5-VL-72B in bf16 requires 144GB of VRAM, which fits on MI-300x but leaves limited headroom for KV cache and concurrent requests. This FP8 quantization cuts model weight memory in half (72GB) while preserving vision encoder quality, making it practical for production VL workloads on AMD hardware.

Model Details

Base model Qwen/Qwen2.5-VL-72B-Instruct
Abliteration coder3101/Qwen2.5-VL-72B-Instruct-heretic โ€” Heretic v1.2.0 (MPOA), KL divergence 0.0156, refusals 8/100
Quantization FP8 weight-only (float8_e4m3fn), per-channel scales (float16)
Format compressed-tensors (quant_method: compressed-tensors) โ€” auto-detected by vLLM
Size on disk ~72GB (31 safetensors shards)
Quantized layers 560 Linear layers (q/k/v/o_proj, gate/up/down_proj across 80 transformer layers)
Preserved in original precision Vision encoder, lm_head, embeddings, RMSNorm layers, RoPE

ROCm Compatibility

Tested Configuration

Component Version / Details
GPU AMD Instinct MI-300x (192GB HBM3)
ROCm 7.1
vLLM 0.17.2 via rocm/vllm-dev:nightly Docker image
KV cache FP8 (--kv-cache-dtype fp8)
Driver amdgpu (gfx942)

ROCm-Specific Notes

  • Use rocm/vllm-dev:nightly, not :main. The :main tag ships vLLM 0.7.4 which predates Qwen2.5-VL support. The :nightly tag (0.17.2+) has full support.
  • HSA_OVERRIDE_GFX_VERSION=9.4.2 is required for MI-300x.
  • VLLM_USE_TRITON_FLASH_ATTN=0 disables Triton flash attention in favor of CK (Composable Kernel) flash attention, which is more stable on ROCm.
  • VLLM_USE_AITER=1 enables AMD's AIter optimizations for improved throughput.
  • PYTORCH_ALLOC_CONF=expandable_segments:True reduces memory fragmentation on AMD GPUs.
  • No --quantization flag needed. vLLM reads the quantization_config from config.json and handles FP8 weights automatically via the compressed-tensors format.
  • Should also work on MI-250X and MI-210 with the appropriate HSA_OVERRIDE_GFX_VERSION, but this is untested.

VRAM Budget (MI-300x, 192GB)

Component VRAM
Model weights (FP8) ~72GB
KV cache (FP8, 32K context, 16 seqs) ~20-25GB
Overhead ~5GB
Total at gpu-memory-utilization 0.50 ~96GB
Free for other workloads ~96GB

This leaves substantial headroom. On MI-300x you can comfortably run this model alongside other GPU workloads, or increase --max-model-len and --max-num-seqs for higher throughput.

Quick Start โ€” vLLM + ROCm

Docker Compose (recommended for ROCm)

services:
  vllm:
    image: rocm/vllm-dev:nightly
    container_name: qwen25-vl-72b
    restart: unless-stopped

    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    security_opt:
      - seccomp=unconfined
    ipc: host
    shm_size: "16gb"

    environment:
      - HIP_VISIBLE_DEVICES=0
      - HSA_OVERRIDE_GFX_VERSION=9.4.2
      - PYTORCH_ROCM_ARCH=gfx942
      - VLLM_USE_TRITON_FLASH_ATTN=0
      - VLLM_USE_AITER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
      - AMD_LOG_LEVEL=0

    volumes:
      - /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8:/models/qwen25-vl:ro
      - /path/to/cache/aiter:/root/.cache/aiter
      - /path/to/cache/huggingface:/root/.cache/huggingface

    ports:
      - "127.0.0.1:8000:8000"

    command: >
      vllm serve /models/qwen25-vl
      --host 0.0.0.0
      --port 8000
      --kv-cache-dtype fp8
      --dtype bfloat16
      --tensor-parallel-size 1
      --max-model-len 32768
      --gpu-memory-utilization 0.50
      --enable-chunked-prefill
      --max-num-seqs 16
      --served-model-name qwen2.5-vl-72b
      --limit-mm-per-prompt '{"image": 4}'
      --trust-remote-code

CLI (inside rocm/vllm-dev:nightly container)

vllm serve /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8 \
    --host 0.0.0.0 \
    --port 8000 \
    --kv-cache-dtype fp8 \
    --dtype bfloat16 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.50 \
    --enable-chunked-prefill \
    --max-num-seqs 16 \
    --served-model-name qwen2.5-vl-72b \
    --limit-mm-per-prompt '{"image": 4}' \
    --trust-remote-code

Example โ€” Vision-Language Request

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="qwen2.5-vl-72b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }],
    max_tokens=512
)
print(response.choices[0].message.content)

Quantization Method

Direct per-channel FP8 quantization applied to safetensors shards offline โ€” no calibration data required, no GPU required for conversion. For each 2D Linear weight matrix:

  1. Compute per-channel absolute max: amax = weight.abs().amax(dim=1, keepdim=True)
  2. Compute scale: scale = amax / 448.0 (448 = float8_e4m3fn max representable value)
  3. Quantize: (weight / scale).clamp(-448, 448).to(float8_e4m3fn)
  4. Store scale alongside weight as float16 with shape [out_features, 1]

The quantization_config in config.json uses the compressed-tensors spec, so vLLM auto-detects the format with no --quantization flag needed.

What's preserved (not quantized)

Component Reason
Vision encoder (visual.*) Quantizing degrades image understanding quality โ€” the vision encoder is small relative to the LLM backbone
lm_head Quantizing the output projection hurts token distribution quality
Embeddings (embed_tokens) Lookup table, not a matrix multiply
RMSNorm layers Must remain high precision for numerical stability
RoPE embeddings Positional encoding, not learned weights

Abliteration Details

From the heretic base model card:

Parameter Value
Method Heretic v1.2.0, Magnitude-Preserving Orthogonal Ablation (MPOA)
KL divergence from original 0.0156 (minimal quality impact)
Refusals 8/100 (down from 100/100)
direction_index 56.27
attn.o_proj.max_weight 1.33
mlp.down_proj.max_weight 1.48

Known Issues

  • Tokenizer regex warning: You'll see a warning about "incorrect regex pattern" referencing a Mistral HF discussion. This is a false positive from the transformers tokenizer validator โ€” tokenization works correctly. Ignore it.
  • rocm/vllm-dev:main does NOT work: That image ships vLLM 0.7.4, which fails with AssertionError on config.text_config.num_attention_heads for Qwen2.5-VL models. Use :nightly (0.17.2+).

Credits

Downloads last month
21
Safetensors
Model size
73B params
Tensor type
BF16
ยท
F16
ยท
F8_E4M3
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anongecko/Qwen2.5-VL-72B-Instruct-heretic-FP8

Quantized
(33)
this model