olmOCR-2-7B-1025-GPTQ-W4A16

INT4 weight-only quantization of allenai/olmOCR-2-7B-1025 using GPTQ with group activation ordering.

Benchmark results

Evaluated on olmOCR-bench sample data (23 pages) with vLLM 0.18.0:

	Aggregate	absent	baseline	math	order	present	table
This model (GPTQ W4A16)	0.949	0.875	1.000	0.833	0.923	0.733	0.950
Base BF16	0.909	0.875	0.952	0.750	1.000	0.700	0.650

The quantized model exceeds the BF16 baseline (+4.4%) on this benchmark, primarily from stronger table and math handling.

Inference time for 23 pages at concurrency 4: 144s vs 201s for the official FP8 model (with cuBLAS fallback).

Quantization details

Parameter	Value
Method	GPTQ
Weight dtype	INT4
Activation dtype	BF16 (weight-only, no activation quantization)
Group size	128
Symmetric	Yes
Activation ordering	Group (`actorder="group"`)
Format	`pack-quantized` (compressed-tensors)
Excluded layers	`lm_head`, `re:model.visual.*` (entire visual encoder kept in BF16)

Activation ordering

actorder="group" reorders weight columns by descending Hessian diagonal magnitude within each group before packing. This allocates quantization precision to the most sensitive weight dimensions and consistently improves perplexity over naive column ordering at negligible extra calibration cost.

Visual encoder exclusion

The Qwen2.5-VL visual encoder (model.visual.*) is excluded from quantization and kept in BF16. INT4 quantization of vision transformer weights causes significant image quality degradation without offsetting memory savings (the encoder is a small fraction of total parameters). Only the language model backbone is quantized.

Calibration

Parameter	Value
Dataset	`allenai/olmOCR-mix-1025`, `00_documents` config, `train` split
Samples	2000 (randomly shuffled, seed 42)
Max sequence length	8192 tokens
Input modality	Text-only (document `natural_text` wrapped in model chat template)

Each calibration sample consists of a document page's extracted text wrapped in the standard olmOCR instruction prompt ("Convert this document to markdown:\n\n{text}"), tokenized with the model's processor. Images were not used during calibration — the visual encoder is excluded from quantization so image-side statistics are not needed.

Software

Package	Version
llm-compressor	`0.10.1.dev48+g464d0004.d20260405` (editable dev install)
compressed-tensors	`0.14.1a20260326`
torch	`2.10.0+cu130`
Python	`3.13`
Hardware	NVIDIA GB10 (SM121), 128 GB unified memory

Reproduction

git clone https://github.com/charitarthchugh/olmocr-nvfp4
cd olmocr-nvfp4
uv sync --extra gpu

python quantize.py \
  --recipe quantization_configs/qwen2_5vl_gptq_w4a16.yaml \
  --output models/gptq_w4a16_2k \
  --num-samples 2000 \
  --max-seq-length 8192

The recipe (quantization_configs/qwen2_5vl_gptq_w4a16.yaml):

quant_stage:
  quant_modifiers:
    GPTQModifier:
      ignore:
        - "lm_head"
        - "re:model.visual.*"
      actorder: "group"
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: ["Linear"]

Intended use

Drop-in replacement for allenai/olmOCR-2-7B-1025 in vLLM-based OCR pipelines. Requires vLLM ≥ 0.18.0 with compressed-tensors support. The Marlin W4A16 GPTQ kernel is used automatically for fast dequantize-and-multiply on SM75+ GPUs.

from vllm import LLM
llm = LLM(
    model="Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16",
    gpu_memory_utilization=0.4,
    max_model_len=16384,
)

Downloads last month: 65

Safetensors

Model size

3B params

Tensor type

I64

I32

BF16

Model tree for Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

allenai/olmOCR-2-7B-1025

Quantized

(11)

this model

Dataset used to train Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16

Evaluation results

Aggregate Score on olmOCR-bench
self-reported

0.949