|
download
raw
2.08 kB
metadata
configs:
  - config_name: default
    data_files:
      - split: train
        path: data/*.jsonl.gz
tags:
  - arxiv
  - ocr
  - chandra
  - chandra-ocr-2
  - markdown
  - html
  - hf-jobs
  - uv-script

arXiv OCR with Chandra OCR 2

This output bundle stores OCR results for arXiv PDFs using datalab-to/chandra-ocr-2.

Summary

  • Output bucket: hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11
  • Source paper IDs in input list: 27,584
  • Processed IDs recorded in state/processed_ids.txt: 1,772
  • Successes: 1,768
  • Partial successes: 0
  • Errors: 4
  • Next shard index: 178
  • Updated at: 2026-04-04T06:50:37.863820+00:00

Files

  • data/part-*.jsonl.gz: OCR result shards, one JSON object per paper
  • state/processed_ids.txt: completed paper IDs used for resume
  • state/summary.json: aggregate counters and bookkeeping

Each paper record includes:

  • num_pages: total number of pages in the source PDF
  • num_pages_processed: number of pages actually sent to OCR
  • pdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR cap
  • max_pages_per_paper: configured OCR page cap for the run

Load the results

from datasets import load_dataset

dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["paper_id"])
print(dataset[0]["markdown"][:1000])

Job config

  • Prompt type: ocr_layout
  • Page batch size: 28
  • Max output tokens: 12384
  • Max model length: 18000
  • GPU memory utilization: 0.85
  • Minimum arXiv request interval: 3.1 seconds
  • Max pages per paper sent to OCR: 30
  • Bucket backend: hf-cli
  • Paginate output: False
  • Include headers/footers: False

Reproduction

hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
  -s HF_TOKEN --timeout 2d \
  ./chandra2-arxiv-ocr.py --output-dataset hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11 \
  --output-bucket hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11 \
  --paper-ids-url https://.../hf_missing_paper_ids.txt

Xet Storage Details

Size:
2.08 kB
·
Xet hash:
adeed56b863f1e35009dcf7d8509bb671a878aacf239b0389a72c3b39ea3b5f1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.