metadata
configs:
- config_name: default
data_files:
- split: train
path: data/*.jsonl.gz
tags:
- arxiv
- ocr
- chandra
- chandra-ocr-2
- markdown
- html
- hf-jobs
- uv-script
arXiv OCR with Chandra OCR 2
This output bundle stores OCR results for arXiv PDFs using datalab-to/chandra-ocr-2.
Summary
- Output bucket:
hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11 - Source paper IDs in input list: 27,584
- Processed IDs recorded in
state/processed_ids.txt: 1,772 - Successes: 1,768
- Partial successes: 0
- Errors: 4
- Next shard index: 178
- Updated at: 2026-04-04T06:50:37.863820+00:00
Files
data/part-*.jsonl.gz: OCR result shards, one JSON object per paperstate/processed_ids.txt: completed paper IDs used for resumestate/summary.json: aggregate counters and bookkeeping
Each paper record includes:
num_pages: total number of pages in the source PDFnum_pages_processed: number of pages actually sent to OCRpdf_exceeds_page_limit: whether the PDF had more pages than the configured OCR capmax_pages_per_paper: configured OCR page cap for the run
Load the results
from datasets import load_dataset
dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["paper_id"])
print(dataset[0]["markdown"][:1000])
Job config
- Prompt type:
ocr_layout - Page batch size: 28
- Max output tokens: 12384
- Max model length: 18000
- GPU memory utilization: 0.85
- Minimum arXiv request interval: 3.1 seconds
- Max pages per paper sent to OCR: 30
- Bucket backend: hf-cli
- Paginate output: False
- Include headers/footers: False
Reproduction
hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
-s HF_TOKEN --timeout 2d \
./chandra2-arxiv-ocr.py --output-dataset hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11 \
--output-bucket hf://buckets/nielsr/arxiv-chandra-ocr-full-20260402-l40sx1-s11 \
--paper-ids-url https://.../hf_missing_paper_ids.txt
Xet Storage Details
- Size:
- 2.08 kB
- Xet hash:
- adeed56b863f1e35009dcf7d8509bb671a878aacf239b0389a72c3b39ea3b5f1
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.