olmOCR-2-7B-1025-GPTQ-W4A16
INT4 weight-only quantization of allenai/olmOCR-2-7B-1025 using GPTQ with group activation ordering.
Benchmark results
Evaluated on olmOCR-bench sample data (23 pages) with vLLM 0.18.0:
| Aggregate | absent | baseline | math | order | present | table | |
|---|---|---|---|---|---|---|---|
| This model (GPTQ W4A16) | 0.949 | 0.875 | 1.000 | 0.833 | 0.923 | 0.733 | 0.950 |
| Base BF16 | 0.909 | 0.875 | 0.952 | 0.750 | 1.000 | 0.700 | 0.650 |
The quantized model exceeds the BF16 baseline (+4.4%) on this benchmark, primarily from stronger table and math handling.
Inference time for 23 pages at concurrency 4: 144s vs 201s for the official FP8 model (with cuBLAS fallback).
Quantization details
| Parameter | Value |
|---|---|
| Method | GPTQ |
| Weight dtype | INT4 |
| Activation dtype | BF16 (weight-only, no activation quantization) |
| Group size | 128 |
| Symmetric | Yes |
| Activation ordering | Group (actorder="group") |
| Format | pack-quantized (compressed-tensors) |
| Excluded layers | lm_head, re:model.visual.* (entire visual encoder kept in BF16) |
Activation ordering
actorder="group" reorders weight columns by descending Hessian diagonal magnitude within each group before packing. This allocates quantization precision to the most sensitive weight dimensions and consistently improves perplexity over naive column ordering at negligible extra calibration cost.
Visual encoder exclusion
The Qwen2.5-VL visual encoder (model.visual.*) is excluded from quantization and kept in BF16. INT4 quantization of vision transformer weights causes significant image quality degradation without offsetting memory savings (the encoder is a small fraction of total parameters). Only the language model backbone is quantized.
Calibration
| Parameter | Value |
|---|---|
| Dataset | allenai/olmOCR-mix-1025, 00_documents config, train split |
| Samples | 2000 (randomly shuffled, seed 42) |
| Max sequence length | 8192 tokens |
| Input modality | Text-only (document natural_text wrapped in model chat template) |
Each calibration sample consists of a document page's extracted text wrapped in the standard olmOCR instruction prompt ("Convert this document to markdown:\n\n{text}"), tokenized with the model's processor. Images were not used during calibration — the visual encoder is excluded from quantization so image-side statistics are not needed.
Software
| Package | Version |
|---|---|
| llm-compressor | 0.10.1.dev48+g464d0004.d20260405 (editable dev install) |
| compressed-tensors | 0.14.1a20260326 |
| torch | 2.10.0+cu130 |
| Python | 3.13 |
| Hardware | NVIDIA GB10 (SM121), 128 GB unified memory |
Reproduction
git clone https://github.com/charitarthchugh/olmocr-nvfp4
cd olmocr-nvfp4
uv sync --extra gpu
python quantize.py \
--recipe quantization_configs/qwen2_5vl_gptq_w4a16.yaml \
--output models/gptq_w4a16_2k \
--num-samples 2000 \
--max-seq-length 8192
The recipe (quantization_configs/qwen2_5vl_gptq_w4a16.yaml):
quant_stage:
quant_modifiers:
GPTQModifier:
ignore:
- "lm_head"
- "re:model.visual.*"
actorder: "group"
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "group"
group_size: 128
targets: ["Linear"]
Intended use
Drop-in replacement for allenai/olmOCR-2-7B-1025 in vLLM-based OCR pipelines. Requires vLLM ≥ 0.18.0 with compressed-tensors support. The Marlin W4A16 GPTQ kernel is used automatically for fast dequantize-and-multiply on SM75+ GPUs.
from vllm import LLM
llm = LLM(
model="Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16",
gpu_memory_utilization=0.4,
max_model_len=16384,
)
- Downloads last month
- 65
Model tree for Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16
Dataset used to train Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16
Evaluation results
- Aggregate Score on olmOCR-benchself-reported0.949