olmOCR-2-7B-1025-GPTQ-W4A16

INT4 weight-only quantization of allenai/olmOCR-2-7B-1025 using GPTQ with group activation ordering.

Benchmark results

Evaluated on olmOCR-bench sample data (23 pages) with vLLM 0.18.0:

Aggregate absent baseline math order present table
This model (GPTQ W4A16) 0.949 0.875 1.000 0.833 0.923 0.733 0.950
Base BF16 0.909 0.875 0.952 0.750 1.000 0.700 0.650

The quantized model exceeds the BF16 baseline (+4.4%) on this benchmark, primarily from stronger table and math handling.

Inference time for 23 pages at concurrency 4: 144s vs 201s for the official FP8 model (with cuBLAS fallback).

Quantization details

Parameter Value
Method GPTQ
Weight dtype INT4
Activation dtype BF16 (weight-only, no activation quantization)
Group size 128
Symmetric Yes
Activation ordering Group (actorder="group")
Format pack-quantized (compressed-tensors)
Excluded layers lm_head, re:model.visual.* (entire visual encoder kept in BF16)

Activation ordering

actorder="group" reorders weight columns by descending Hessian diagonal magnitude within each group before packing. This allocates quantization precision to the most sensitive weight dimensions and consistently improves perplexity over naive column ordering at negligible extra calibration cost.

Visual encoder exclusion

The Qwen2.5-VL visual encoder (model.visual.*) is excluded from quantization and kept in BF16. INT4 quantization of vision transformer weights causes significant image quality degradation without offsetting memory savings (the encoder is a small fraction of total parameters). Only the language model backbone is quantized.

Calibration

Parameter Value
Dataset allenai/olmOCR-mix-1025, 00_documents config, train split
Samples 2000 (randomly shuffled, seed 42)
Max sequence length 8192 tokens
Input modality Text-only (document natural_text wrapped in model chat template)

Each calibration sample consists of a document page's extracted text wrapped in the standard olmOCR instruction prompt ("Convert this document to markdown:\n\n{text}"), tokenized with the model's processor. Images were not used during calibration — the visual encoder is excluded from quantization so image-side statistics are not needed.

Software

Package Version
llm-compressor 0.10.1.dev48+g464d0004.d20260405 (editable dev install)
compressed-tensors 0.14.1a20260326
torch 2.10.0+cu130
Python 3.13
Hardware NVIDIA GB10 (SM121), 128 GB unified memory

Reproduction

git clone https://github.com/charitarthchugh/olmocr-nvfp4
cd olmocr-nvfp4
uv sync --extra gpu

python quantize.py \
  --recipe quantization_configs/qwen2_5vl_gptq_w4a16.yaml \
  --output models/gptq_w4a16_2k \
  --num-samples 2000 \
  --max-seq-length 8192

The recipe (quantization_configs/qwen2_5vl_gptq_w4a16.yaml):

quant_stage:
  quant_modifiers:
    GPTQModifier:
      ignore:
        - "lm_head"
        - "re:model.visual.*"
      actorder: "group"
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: ["Linear"]

Intended use

Drop-in replacement for allenai/olmOCR-2-7B-1025 in vLLM-based OCR pipelines. Requires vLLM ≥ 0.18.0 with compressed-tensors support. The Marlin W4A16 GPTQ kernel is used automatically for fast dequantize-and-multiply on SM75+ GPUs.

from vllm import LLM
llm = LLM(
    model="Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16",
    gpu_memory_utilization=0.4,
    max_model_len=16384,
)
Downloads last month
65
Safetensors
Model size
3B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16

Quantized
(11)
this model

Dataset used to train Charitarth/olmOCR-2-7B-1025-GPTQ-W4A16

Evaluation results