diff --git a/experiments/olmocr-bench-oldscans/BENCHMARKING.md b/experiments/olmocr-bench-oldscans/BENCHMARKING.md new file mode 100644 index 0000000..e837fd6 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/BENCHMARKING.md @@ -0,0 +1,93 @@ +# olmOCR-bench old_scans — multi-model comparison + +Extends the PaddleOCR-VL experiment ([README.md](README.md)) to two more open +document models, scored through the same harness — for my own benchmarking. Same +subset (`old_scans`, 98 Library-of-Congress scans), same scorer (stock +`olmocr.bench.benchmark`), greedy decoding, no tuning. Run 2026-06-27. + +## Results + +| Model | params | **old_scans** | present | absent | order | baseline | +|---|---|---|---|---|---|---| +| PaddleOCR-VL v1.6 | 0.9B | **38.6** | 31.2 | 95.7 | 27.7 | 84.7 | +| PaddleOCR-VL v1 | 0.9B | **38.2** | 32.3 | 95.7 | 24.9 | 88.8 | +| NuExtract3 | 4.5B | **37.8** | **41.6** | 41.4 | 30.5 | **100.0** | +| Unlimited-OCR | 3.3B | **30.6** | 29.0 | 50.0 | 25.4 | 89.8 | + +`old_scans` = present + absent + order (the leaderboard's "Old scans" column). + +## ⚠️ Read the sub-scores, not just the headline + +The single `old_scans` number **conflates two different abilities**, and is +misleading on its own: + +- **`present`** — did the model transcribe the body text? *(transcription quality)* +- **`absent`** — did it *exclude* the boilerplate the bench wants dropped: + letterheads (`LUCIEN BECKNER`, `TELEPHONE 478`), archival stamps (`ack 5/27/14`), + page numbers (`31`, `2590`)? *(boilerplate exclusion)* + +These pull in opposite directions **by architecture**: + +- **NuExtract3 is the best *transcriber*** — `present` 41.6, well above paddle's + 31.2, and it never hallucinates CJK (`baseline` 100%). It scores low on `absent` + (41.4) only because markdown-mode transcribes the letterhead/stamps. Inspected: + those appear as **unmarked plain body text** at the top of the page, *not* in + `
`/HTML you could filter on — so the `absent` failures are real, not a + formatting artifact you could strip away. +- **PaddleOCR-VL is the best *boilerplate-excluder*** — `absent` 95.7, because its + layout pipeline drops running headers/footers. But it reads less of the body + (`present` 31.2) and hallucinates CJK glyphs on the hardest scans (`baseline` + 84.7). + +So **NuExtract "losing" on `old_scans` (37.8 vs 38.6) is not a transcription +deficit** — it reads *better*; it just doesn't do boilerplate exclusion, which a +layout pipeline gets for free. Pick by use case: + +- want the most faithful text → **NuExtract3** (`present` 41.6). +- want clean body-only markdown without boilerplate → **PaddleOCR-VL** (`absent` 95.7). + +*(Deliberately not done: a heuristic "drop the leading letterhead + standalone page +numbers" pass that would lift NuExtract's `absent`. That's gaming the very +boilerplate-exclusion criterion the bench is testing the model to do natively, so +it's left out — the honest move is to report the sub-scores.)* + +## Why it matters + +olmOCR-bench encodes olmOCR's own goal: clean, linearised **reading-order body +text** for LLM training/RAG, where headers, letterheads and page numbers are noise +to drop. So the headline rewards leaving them out — reasonable if that's what you +want. For **faithful OCR of the whole page** (archives, where the letterhead / +stamp / marginal note *is* the record), the same number ranks the model you'd +prefer *lower*. A benchmark score measures fitness for one purpose — check it's +yours before trusting the ranking. + +## Fairness / processing notes + +- **DPI**: each model's recommended PDF→PNG render DPI — NuExtract **170**, + Unlimited-OCR **300**; paddle rasterizes internally. Different DPI is a confound; + using each model's own default is the per-model-fair choice (footnoted here). +- **Unlimited-OCR** emits DeepSeek-OCR-style `<|det|>category [bbox]<|/det|>` + grounding before each span; we **strip it** to recover plain text comparable to + the others' clean markdown. +- **NuExtract3**: `mode="markdown"`, **non-thinking + greedy** — the decoding- + comparable setting (thinking mode is temperature 0.6, non-deterministic). +- **Decoding**: greedy for all four. +- These two models take **images**, so the convert scripts render PDF→PNG (the + paddle pipeline rasterizes internally). They run as `hf jobs uv run` (root), so + they write the bucket mount **directly** — no `sync_bucket`, because + `transformers==4.57.1` pins an older `huggingface_hub` that lacks it. +- **Not size-matched**: NuExtract (4.5B) and Unlimited-OCR (3.3B) are 3–5× larger + than PaddleOCR-VL (0.9B). Fine for "what's the best number," but note it. + +## Run + +```bash +B=hf://buckets/davanstrien/paddleocr-vl16-oldscans +# convert (each writes its own candidate folder; LIMIT=3 for a smoke) +hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket unlimited_ocr.py +hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket nuextract3.py +# score every candidate folder in the bucket together +hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $B:/bucket:ro score.py +``` + +`NUM_SHARDS`/`SHARD` on either convert for data-parallel fan-out (default 1). diff --git a/experiments/olmocr-bench-oldscans/nuextract3.py b/experiments/olmocr-bench-oldscans/nuextract3.py new file mode 100644 index 0000000..9db2f80 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/nuextract3.py @@ -0,0 +1,156 @@ +# /// script +# requires-python = ">=3.11" +# dependencies = [ +# "transformers", +# "torch", +# "torchvision", +# "accelerate", +# "pillow", +# "pymupdf", +# "einops", +# "huggingface-hub>=0.34", +# ] +# /// +""" +Convert the olmOCR-bench `old_scans` subset with numind/NuExtract3 (markdown mode) +and write candidate markdown in the layout `olmocr.bench.benchmark` expects -- the +same convention as convert.py / unlimited_ocr.py, so the score job ranks them all +together. + +NuExtract3 is extraction-first but has a native image->Markdown mode; we use +`mode="markdown"`, **non-thinking + greedy** (`enable_thinking=False`, +`do_sample=False`) -- the decoding-comparable setting vs the other models (thinking +mode would be temperature 0.6, non-deterministic). It takes IMAGES, so we render +each PDF page to PNG at the card's recommended DPI (170). + +Runs as a uv script (`hf jobs uv run`), root, writing the bucket mount directly: + + hf jobs uv run --flavor l4x1 -s HF_TOKEN \ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket \ + nuextract3.py + +Env: + CANDIDATE output subfolder + label (default nuextract3) + DPI PDF->PNG render DPI (default 170, the card's recommendation) + OUT_ROOT where to write (default /bucket, i.e. the bucket mount) + NUM_SHARDS data-parallel fan-out (default 1); SHARD = 0..N-1 + SHARD which shard this job handles (default 0) + LIMIT cap number of PDFs (plumbing smoke test; 0 = all) +""" +import json +import os +import shutil +import tempfile +from collections import defaultdict +from pathlib import Path + +import fitz # PyMuPDF +from huggingface_hub import hf_hub_download + +BENCH_REPO = "allenai/olmOCR-bench" +JSONL_PATH = "bench_data/old_scans.jsonl" +MODEL_ID = "numind/NuExtract3" +CANDIDATE = os.environ.get("CANDIDATE", "nuextract3") +DPI = int(os.environ.get("DPI", "170")) +OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/bucket")) +NUM_SHARDS = int(os.environ.get("NUM_SHARDS", "1")) +SHARD = int(os.environ.get("SHARD", "0")) +LIMIT = int(os.environ.get("LIMIT", "0")) + +# ---- test manifest ---------------------------------------------------------- +jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset") +tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()] + +pages_by_pdf = defaultdict(set) +for t in tests: + pages_by_pdf[t["pdf"]].add(int(t.get("page", 1))) +print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs candidate={CANDIDATE} dpi={DPI}", flush=True) + +# ---- model ------------------------------------------------------------------ +import torch # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoModelForImageTextToText, AutoProcessor # noqa: E402 + +print(f"cuda available: {torch.cuda.is_available()}", flush=True) +processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True) +model = AutoModelForImageTextToText.from_pretrained( + MODEL_ID, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True +).eval() + + +def resolve_pdf(pdf_field): + mounted = Path("/bucket/pdfs") / pdf_field + if mounted.is_file(): + return str(mounted) + for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field): + try: + local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset") + except Exception: + continue + dest = OUT_ROOT / "pdfs" / pdf_field + dest.parent.mkdir(parents=True, exist_ok=True) + shutil.copy(local, dest) + return local + raise FileNotFoundError(pdf_field) + + +def render_page(pdf_path, page_num, png_path): + doc = fitz.open(pdf_path) + mat = fitz.Matrix(DPI / 72, DPI / 72) + doc[page_num - 1].get_pixmap(matrix=mat).save(png_path) + doc.close() + + +def infer_markdown(image_path): + """NuExtract3 image->Markdown, non-thinking + greedy.""" + image = Image.open(image_path).convert("RGB") + messages = [{"role": "user", "content": [{"type": "image", "image": image}]}] + inputs = processor.apply_chat_template( + messages, + add_generation_prompt=True, + tokenize=True, + return_dict=True, + return_tensors="pt", + mode="markdown", + enable_thinking=False, + ).to(model.device) + with torch.inference_mode(): + gen = model.generate(**inputs, max_new_tokens=4096, do_sample=False) + gen = gen[:, inputs.input_ids.shape[1]:] + return processor.batch_decode( + gen, skip_special_tokens=True, clean_up_tokenization_spaces=False + )[0].strip() + + +# ---- convert ---------------------------------------------------------------- +cand_dir = OUT_ROOT / CANDIDATE +items = sorted(pages_by_pdf.items()) +if NUM_SHARDS > 1: + items = items[SHARD::NUM_SHARDS] + print(f"shard {SHARD}/{NUM_SHARDS}: {len(items)} PDFs", flush=True) +if LIMIT: + items = items[:LIMIT] + print(f"LIMIT={LIMIT} (plumbing smoke test)", flush=True) + +for i, (pdf_field, pages) in enumerate(items, 1): + try: + pdf_path = resolve_pdf(pdf_field) + with tempfile.TemporaryDirectory() as td: + mds = {} + for pg in pages: + png = Path(td) / f"pg{pg}.png" + render_page(pdf_path, pg, str(png)) + mds[pg] = infer_markdown(png) + except Exception as e: + print(f"[WARN] {pdf_field}: {e}", flush=True) + mds = {pg: "" for pg in pages} + md_base = os.path.splitext(pdf_field)[0] + for pg in pages: + fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md" + fp.parent.mkdir(parents=True, exist_ok=True) + fp.write_text(mds.get(pg, "")) + n = len(next(iter(mds.values()), "")) if mds else 0 + print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True) + +(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text()) +print(f"Done. Wrote {CANDIDATE} candidate to {OUT_ROOT}", flush=True) diff --git a/experiments/olmocr-bench-oldscans/unlimited_ocr.py b/experiments/olmocr-bench-oldscans/unlimited_ocr.py new file mode 100644 index 0000000..6609215 --- /dev/null +++ b/experiments/olmocr-bench-oldscans/unlimited_ocr.py @@ -0,0 +1,176 @@ +# /// script +# requires-python = ">=3.12,<3.13" +# dependencies = [ +# "torch==2.10.0", +# "torchvision==0.25.0", +# "transformers==4.57.1", +# "pillow==12.1.1", +# "einops==0.8.2", +# "addict==2.4.0", +# "easydict==1.13", +# "pymupdf==1.27.2.2", +# "matplotlib==3.10.8", +# "psutil==7.2.2", +# "huggingface-hub>=0.34", +# ] +# /// +""" +Convert the olmOCR-bench `old_scans` subset with baidu/Unlimited-OCR and write +candidate markdown in the layout `olmocr.bench.benchmark` expects -- the same +convention as convert.py (PaddleOCR-VL), so the score job ranks them together. + +Unlimited-OCR is a native document parser; it takes IMAGES, so we render each +PDF page to PNG (PyMuPDF) at the model's recommended DPI (300) before inference. +Decoding is deterministic per the card: temperature 0 + the DeepSeek-OCR-style +no-repeat-ngram processor (no_repeat_ngram_size=35), single-image "gundam" config. + +Runs as a uv script (`hf jobs uv run`) -- transformers + the model live here, NOT +olmocr (that is only in the score job). Pins transformers==4.57.1 per the card, +which in turn pins an older huggingface_hub without `sync_bucket`; since the uv +image runs as ROOT it can write the bucket FUSE mount directly, so we mount it +read-write and write candidate files straight to /bucket (no sync needed). + + hf jobs uv run --flavor l4x1 -s HF_TOKEN \ + -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket \ + unlimited_ocr.py + +Env: + CANDIDATE output subfolder + label (default unlimited_ocr) + DPI PDF->PNG render DPI (default 300, the card's recommendation) + OUT_ROOT where to write (default /bucket, i.e. the bucket mount) + NUM_SHARDS data-parallel fan-out: run this many jobs with SHARD=0..N-1 (default 1) + SHARD which shard this job handles (default 0) + LIMIT cap number of PDFs (plumbing smoke test; 0 = all) +""" +import json +import os +import re +import shutil +import tempfile +from collections import defaultdict +from pathlib import Path + +import fitz # PyMuPDF +from huggingface_hub import hf_hub_download + +# Unlimited-OCR's "document parsing." prompt emits DeepSeek-OCR-style grounding: +# `<|det|>category [x1,y1,x2,y2]<|/det|>` before each text span. Strip it to +# recover the plain transcription, matching the other models' clean-text output. +_DET = re.compile(r"<\|det\|>.*?<\|/det\|>", re.S) +_SPECIAL = re.compile(r"<\|[^|]*\|>") + + +def clean(text): + return _SPECIAL.sub("", _DET.sub("", text)).strip() + +BENCH_REPO = "allenai/olmOCR-bench" +JSONL_PATH = "bench_data/old_scans.jsonl" +MODEL_ID = "baidu/Unlimited-OCR" +CANDIDATE = os.environ.get("CANDIDATE", "unlimited_ocr") +DPI = int(os.environ.get("DPI", "300")) +OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/bucket")) +NUM_SHARDS = int(os.environ.get("NUM_SHARDS", "1")) +SHARD = int(os.environ.get("SHARD", "0")) +LIMIT = int(os.environ.get("LIMIT", "0")) + +# ---- test manifest ---------------------------------------------------------- +jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset") +tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()] + +pages_by_pdf = defaultdict(set) +for t in tests: + pages_by_pdf[t["pdf"]].add(int(t.get("page", 1))) +print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs candidate={CANDIDATE} dpi={DPI}", flush=True) + +# ---- model ------------------------------------------------------------------ +import torch # noqa: E402 +from transformers import AutoModel, AutoTokenizer # noqa: E402 + +print(f"cuda available: {torch.cuda.is_available()}", flush=True) +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) +model = AutoModel.from_pretrained( + MODEL_ID, trust_remote_code=True, use_safetensors=True, torch_dtype=torch.bfloat16 +).eval().cuda() + + +def resolve_pdf(pdf_field): + """Reuse the PDF on the bucket mount if present; else download once and stage + it under OUT_ROOT/pdfs so sync_bucket adds it for the scorer.""" + mounted = Path("/bucket/pdfs") / pdf_field + if mounted.is_file(): + return str(mounted) + for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field): + try: + local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset") + except Exception: + continue + dest = OUT_ROOT / "pdfs" / pdf_field + dest.parent.mkdir(parents=True, exist_ok=True) + shutil.copy(local, dest) + return local + raise FileNotFoundError(pdf_field) + + +def render_page(pdf_path, page_num, png_path): + doc = fitz.open(pdf_path) + mat = fitz.Matrix(DPI / 72, DPI / 72) + doc[page_num - 1].get_pixmap(matrix=mat).save(png_path) + doc.close() + + +def infer_markdown(image_path, work): + """Unlimited-OCR document parsing. The transformers `model.infer` saves to + output_path; capture its return if it gives text, else read the saved file.""" + out_dir = work / "out" + out_dir.mkdir(parents=True, exist_ok=True) + result = model.infer( + tokenizer, + prompt="document parsing.", + image_file=str(image_path), + output_path=str(out_dir), + base_size=1024, image_size=640, crop_mode=True, # single-image "gundam" + max_length=32768, + no_repeat_ngram_size=35, ngram_window=128, + save_results=True, + ) + if isinstance(result, str) and result.strip(): + return clean(result), "return" + for f in sorted(out_dir.glob("**/*")): + if f.is_file() and f.suffix.lower() in (".mmd", ".md", ".txt"): + return clean(f.read_text()), f"file:{f.name}" + return "", "empty" + + +# ---- convert ---------------------------------------------------------------- +cand_dir = OUT_ROOT / CANDIDATE +items = sorted(pages_by_pdf.items()) +if NUM_SHARDS > 1: + items = items[SHARD::NUM_SHARDS] + print(f"shard {SHARD}/{NUM_SHARDS}: {len(items)} PDFs", flush=True) +if LIMIT: + items = items[:LIMIT] + print(f"LIMIT={LIMIT} (plumbing smoke test)", flush=True) + +for i, (pdf_field, pages) in enumerate(items, 1): + try: + pdf_path = resolve_pdf(pdf_field) + with tempfile.TemporaryDirectory() as td: + work = Path(td) + mds = {} + for pg in pages: + png = work / f"pg{pg}.png" + render_page(pdf_path, pg, str(png)) + mds[pg], src = infer_markdown(png, work) + except Exception as e: + print(f"[WARN] {pdf_field}: {e}", flush=True) + mds, src = {pg: "" for pg in pages}, "error" + md_base = os.path.splitext(pdf_field)[0] + for pg in pages: + fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md" + fp.parent.mkdir(parents=True, exist_ok=True) + fp.write_text(mds.get(pg, "")) + n = len(next(iter(mds.values()), "")) if mds else 0 + print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars ({src})", flush=True) + +(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text()) +print(f"Done. Wrote {CANDIDATE} candidate to {OUT_ROOT}", flush=True)