Qwen2.5-VL-72B-Instruct-heretic โ FP8 (compressed-tensors)
FP8 weight-only quantization of coder3101/Qwen2.5-VL-72B-Instruct-heretic in compressed-tensors format for native vLLM inference.
Built for and tested on AMD Instinct MI-300x (192GB) with ROCm 7.1.
Why This Exists
There are very few FP8-quantized vision-language models available for the ROCm ecosystem. Running Qwen2.5-VL-72B in bf16 requires 144GB of VRAM, which fits on MI-300x but leaves limited headroom for KV cache and concurrent requests. This FP8 quantization cuts model weight memory in half (72GB) while preserving vision encoder quality, making it practical for production VL workloads on AMD hardware.
Model Details
| Base model | Qwen/Qwen2.5-VL-72B-Instruct |
| Abliteration | coder3101/Qwen2.5-VL-72B-Instruct-heretic โ Heretic v1.2.0 (MPOA), KL divergence 0.0156, refusals 8/100 |
| Quantization | FP8 weight-only (float8_e4m3fn), per-channel scales (float16) |
| Format | compressed-tensors (quant_method: compressed-tensors) โ auto-detected by vLLM |
| Size on disk | ~72GB (31 safetensors shards) |
| Quantized layers | 560 Linear layers (q/k/v/o_proj, gate/up/down_proj across 80 transformer layers) |
| Preserved in original precision | Vision encoder, lm_head, embeddings, RMSNorm layers, RoPE |
ROCm Compatibility
Tested Configuration
| Component | Version / Details |
|---|---|
| GPU | AMD Instinct MI-300x (192GB HBM3) |
| ROCm | 7.1 |
| vLLM | 0.17.2 via rocm/vllm-dev:nightly Docker image |
| KV cache | FP8 (--kv-cache-dtype fp8) |
| Driver | amdgpu (gfx942) |
ROCm-Specific Notes
- Use
rocm/vllm-dev:nightly, not:main. The:maintag ships vLLM 0.7.4 which predates Qwen2.5-VL support. The:nightlytag (0.17.2+) has full support. HSA_OVERRIDE_GFX_VERSION=9.4.2is required for MI-300x.VLLM_USE_TRITON_FLASH_ATTN=0disables Triton flash attention in favor of CK (Composable Kernel) flash attention, which is more stable on ROCm.VLLM_USE_AITER=1enables AMD's AIter optimizations for improved throughput.PYTORCH_ALLOC_CONF=expandable_segments:Truereduces memory fragmentation on AMD GPUs.- No
--quantizationflag needed. vLLM reads thequantization_configfromconfig.jsonand handles FP8 weights automatically via the compressed-tensors format. - Should also work on MI-250X and MI-210 with the appropriate
HSA_OVERRIDE_GFX_VERSION, but this is untested.
VRAM Budget (MI-300x, 192GB)
| Component | VRAM |
|---|---|
| Model weights (FP8) | ~72GB |
| KV cache (FP8, 32K context, 16 seqs) | ~20-25GB |
| Overhead | ~5GB |
Total at gpu-memory-utilization 0.50 |
~96GB |
| Free for other workloads | ~96GB |
This leaves substantial headroom. On MI-300x you can comfortably run this model alongside other GPU workloads, or increase --max-model-len and --max-num-seqs for higher throughput.
Quick Start โ vLLM + ROCm
Docker Compose (recommended for ROCm)
services:
vllm:
image: rocm/vllm-dev:nightly
container_name: qwen25-vl-72b
restart: unless-stopped
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
group_add:
- video
- render
security_opt:
- seccomp=unconfined
ipc: host
shm_size: "16gb"
environment:
- HIP_VISIBLE_DEVICES=0
- HSA_OVERRIDE_GFX_VERSION=9.4.2
- PYTORCH_ROCM_ARCH=gfx942
- VLLM_USE_TRITON_FLASH_ATTN=0
- VLLM_USE_AITER=1
- PYTORCH_ALLOC_CONF=expandable_segments:True
- AMD_LOG_LEVEL=0
volumes:
- /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8:/models/qwen25-vl:ro
- /path/to/cache/aiter:/root/.cache/aiter
- /path/to/cache/huggingface:/root/.cache/huggingface
ports:
- "127.0.0.1:8000:8000"
command: >
vllm serve /models/qwen25-vl
--host 0.0.0.0
--port 8000
--kv-cache-dtype fp8
--dtype bfloat16
--tensor-parallel-size 1
--max-model-len 32768
--gpu-memory-utilization 0.50
--enable-chunked-prefill
--max-num-seqs 16
--served-model-name qwen2.5-vl-72b
--limit-mm-per-prompt '{"image": 4}'
--trust-remote-code
CLI (inside rocm/vllm-dev:nightly container)
vllm serve /path/to/Qwen2.5-VL-72B-Instruct-heretic-FP8 \
--host 0.0.0.0 \
--port 8000 \
--kv-cache-dtype fp8 \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.50 \
--enable-chunked-prefill \
--max-num-seqs 16 \
--served-model-name qwen2.5-vl-72b \
--limit-mm-per-prompt '{"image": 4}' \
--trust-remote-code
Example โ Vision-Language Request
from openai import OpenAI
import base64
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
with open("photo.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="qwen2.5-vl-72b",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text", "text": "Describe this image in detail."}
]
}],
max_tokens=512
)
print(response.choices[0].message.content)
Quantization Method
Direct per-channel FP8 quantization applied to safetensors shards offline โ no calibration data required, no GPU required for conversion. For each 2D Linear weight matrix:
- Compute per-channel absolute max:
amax = weight.abs().amax(dim=1, keepdim=True) - Compute scale:
scale = amax / 448.0(448 =float8_e4m3fnmax representable value) - Quantize:
(weight / scale).clamp(-448, 448).to(float8_e4m3fn) - Store scale alongside weight as
float16with shape[out_features, 1]
The quantization_config in config.json uses the compressed-tensors spec, so vLLM auto-detects the format with no --quantization flag needed.
What's preserved (not quantized)
| Component | Reason |
|---|---|
Vision encoder (visual.*) |
Quantizing degrades image understanding quality โ the vision encoder is small relative to the LLM backbone |
| lm_head | Quantizing the output projection hurts token distribution quality |
Embeddings (embed_tokens) |
Lookup table, not a matrix multiply |
| RMSNorm layers | Must remain high precision for numerical stability |
| RoPE embeddings | Positional encoding, not learned weights |
Abliteration Details
From the heretic base model card:
| Parameter | Value |
|---|---|
| Method | Heretic v1.2.0, Magnitude-Preserving Orthogonal Ablation (MPOA) |
| KL divergence from original | 0.0156 (minimal quality impact) |
| Refusals | 8/100 (down from 100/100) |
| direction_index | 56.27 |
| attn.o_proj.max_weight | 1.33 |
| mlp.down_proj.max_weight | 1.48 |
Known Issues
- Tokenizer regex warning: You'll see a warning about "incorrect regex pattern" referencing a Mistral HF discussion. This is a false positive from the transformers tokenizer validator โ tokenization works correctly. Ignore it.
rocm/vllm-dev:maindoes NOT work: That image ships vLLM 0.7.4, which fails withAssertionErroronconfig.text_config.num_attention_headsfor Qwen2.5-VL models. Use:nightly(0.17.2+).
Credits
- Qwen team for Qwen2.5-VL-72B-Instruct
- coder3101 for the heretic abliteration
- vLLM project for inference serving
- AMD ROCm team for the open-source GPU compute platform
- Downloads last month
- 21
Model tree for anongecko/Qwen2.5-VL-72B-Instruct-heretic-FP8
Base model
Qwen/Qwen2.5-VL-72B-Instruct