Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions experiments/olmocr-bench-oldscans/BENCHMARKING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# olmOCR-bench old_scans — multi-model comparison

Extends the PaddleOCR-VL experiment ([README.md](README.md)) to two more open
document models, scored through the same harness — for my own benchmarking. Same
subset (`old_scans`, 98 Library-of-Congress scans), same scorer (stock
`olmocr.bench.benchmark`), greedy decoding, no tuning. Run 2026-06-27.

## Results

| Model | params | **old_scans** | present | absent | order | baseline |
|---|---|---|---|---|---|---|
| PaddleOCR-VL v1.6 | 0.9B | **38.6** | 31.2 | 95.7 | 27.7 | 84.7 |
| PaddleOCR-VL v1 | 0.9B | **38.2** | 32.3 | 95.7 | 24.9 | 88.8 |
| NuExtract3 | 4.5B | **37.8** | **41.6** | 41.4 | 30.5 | **100.0** |
| Unlimited-OCR | 3.3B | **30.6** | 29.0 | 50.0 | 25.4 | 89.8 |

`old_scans` = present + absent + order (the leaderboard's "Old scans" column).

## ⚠️ Read the sub-scores, not just the headline

The single `old_scans` number **conflates two different abilities**, and is
misleading on its own:

- **`present`** — did the model transcribe the body text? *(transcription quality)*
- **`absent`** — did it *exclude* the boilerplate the bench wants dropped:
letterheads (`LUCIEN BECKNER`, `TELEPHONE 478`), archival stamps (`ack 5/27/14`),
page numbers (`31`, `2590`)? *(boilerplate exclusion)*

These pull in opposite directions **by architecture**:

- **NuExtract3 is the best *transcriber*** — `present` 41.6, well above paddle's
31.2, and it never hallucinates CJK (`baseline` 100%). It scores low on `absent`
(41.4) only because markdown-mode transcribes the letterhead/stamps. Inspected:
those appear as **unmarked plain body text** at the top of the page, *not* in
`<figure>`/HTML you could filter on — so the `absent` failures are real, not a
formatting artifact you could strip away.
- **PaddleOCR-VL is the best *boilerplate-excluder*** — `absent` 95.7, because its
layout pipeline drops running headers/footers. But it reads less of the body
(`present` 31.2) and hallucinates CJK glyphs on the hardest scans (`baseline`
84.7).

So **NuExtract "losing" on `old_scans` (37.8 vs 38.6) is not a transcription
deficit** — it reads *better*; it just doesn't do boilerplate exclusion, which a
layout pipeline gets for free. Pick by use case:

- want the most faithful text → **NuExtract3** (`present` 41.6).
- want clean body-only markdown without boilerplate → **PaddleOCR-VL** (`absent` 95.7).

*(Deliberately not done: a heuristic "drop the leading letterhead + standalone page
numbers" pass that would lift NuExtract's `absent`. That's gaming the very
boilerplate-exclusion criterion the bench is testing the model to do natively, so
it's left out — the honest move is to report the sub-scores.)*

## Why it matters

olmOCR-bench encodes olmOCR's own goal: clean, linearised **reading-order body
text** for LLM training/RAG, where headers, letterheads and page numbers are noise
to drop. So the headline rewards leaving them out — reasonable if that's what you
want. For **faithful OCR of the whole page** (archives, where the letterhead /
stamp / marginal note *is* the record), the same number ranks the model you'd
prefer *lower*. A benchmark score measures fitness for one purpose — check it's
yours before trusting the ranking.

## Fairness / processing notes

- **DPI**: each model's recommended PDF→PNG render DPI — NuExtract **170**,
Unlimited-OCR **300**; paddle rasterizes internally. Different DPI is a confound;
using each model's own default is the per-model-fair choice (footnoted here).
- **Unlimited-OCR** emits DeepSeek-OCR-style `<|det|>category [bbox]<|/det|>`
grounding before each span; we **strip it** to recover plain text comparable to
the others' clean markdown.
- **NuExtract3**: `mode="markdown"`, **non-thinking + greedy** — the decoding-
comparable setting (thinking mode is temperature 0.6, non-deterministic).
- **Decoding**: greedy for all four.
- These two models take **images**, so the convert scripts render PDF→PNG (the
paddle pipeline rasterizes internally). They run as `hf jobs uv run` (root), so
they write the bucket mount **directly** — no `sync_bucket`, because
`transformers==4.57.1` pins an older `huggingface_hub` that lacks it.
- **Not size-matched**: NuExtract (4.5B) and Unlimited-OCR (3.3B) are 3–5× larger
than PaddleOCR-VL (0.9B). Fine for "what's the best number," but note it.

## Run

```bash
B=hf://buckets/davanstrien/paddleocr-vl16-oldscans
# convert (each writes its own candidate folder; LIMIT=3 for a smoke)
hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket unlimited_ocr.py
hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket nuextract3.py
# score every candidate folder in the bucket together
hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $B:/bucket:ro score.py
```

`NUM_SHARDS`/`SHARD` on either convert for data-parallel fan-out (default 1).
156 changes: 156 additions & 0 deletions experiments/olmocr-bench-oldscans/nuextract3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "transformers",
# "torch",
# "torchvision",
# "accelerate",
# "pillow",
# "pymupdf",
# "einops",
# "huggingface-hub>=0.34",
# ]
# ///
"""
Convert the olmOCR-bench `old_scans` subset with numind/NuExtract3 (markdown mode)
and write candidate markdown in the layout `olmocr.bench.benchmark` expects -- the
same convention as convert.py / unlimited_ocr.py, so the score job ranks them all
together.

NuExtract3 is extraction-first but has a native image->Markdown mode; we use
`mode="markdown"`, **non-thinking + greedy** (`enable_thinking=False`,
`do_sample=False`) -- the decoding-comparable setting vs the other models (thinking
mode would be temperature 0.6, non-deterministic). It takes IMAGES, so we render
each PDF page to PNG at the card's recommended DPI (170).

Runs as a uv script (`hf jobs uv run`), root, writing the bucket mount directly:

hf jobs uv run --flavor l4x1 -s HF_TOKEN \
-v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket \
nuextract3.py

Env:
CANDIDATE output subfolder + label (default nuextract3)
DPI PDF->PNG render DPI (default 170, the card's recommendation)
OUT_ROOT where to write (default /bucket, i.e. the bucket mount)
NUM_SHARDS data-parallel fan-out (default 1); SHARD = 0..N-1
SHARD which shard this job handles (default 0)
LIMIT cap number of PDFs (plumbing smoke test; 0 = all)
"""
import json
import os
import shutil
import tempfile
from collections import defaultdict
from pathlib import Path

import fitz # PyMuPDF
from huggingface_hub import hf_hub_download

BENCH_REPO = "allenai/olmOCR-bench"
JSONL_PATH = "bench_data/old_scans.jsonl"
MODEL_ID = "numind/NuExtract3"
CANDIDATE = os.environ.get("CANDIDATE", "nuextract3")
DPI = int(os.environ.get("DPI", "170"))
OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/bucket"))
NUM_SHARDS = int(os.environ.get("NUM_SHARDS", "1"))
SHARD = int(os.environ.get("SHARD", "0"))
LIMIT = int(os.environ.get("LIMIT", "0"))

# ---- test manifest ----------------------------------------------------------
jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset")
tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()]

pages_by_pdf = defaultdict(set)
for t in tests:
pages_by_pdf[t["pdf"]].add(int(t.get("page", 1)))
print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs candidate={CANDIDATE} dpi={DPI}", flush=True)

# ---- model ------------------------------------------------------------------
import torch # noqa: E402
from PIL import Image # noqa: E402
from transformers import AutoModelForImageTextToText, AutoProcessor # noqa: E402

print(f"cuda available: {torch.cuda.is_available()}", flush=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
).eval()


def resolve_pdf(pdf_field):
mounted = Path("/bucket/pdfs") / pdf_field
if mounted.is_file():
return str(mounted)
for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field):
try:
local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset")
except Exception:
continue
dest = OUT_ROOT / "pdfs" / pdf_field
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy(local, dest)
return local
raise FileNotFoundError(pdf_field)


def render_page(pdf_path, page_num, png_path):
doc = fitz.open(pdf_path)
mat = fitz.Matrix(DPI / 72, DPI / 72)
doc[page_num - 1].get_pixmap(matrix=mat).save(png_path)
doc.close()


def infer_markdown(image_path):
"""NuExtract3 image->Markdown, non-thinking + greedy."""
image = Image.open(image_path).convert("RGB")
messages = [{"role": "user", "content": [{"type": "image", "image": image}]}]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
mode="markdown",
enable_thinking=False,
).to(model.device)
with torch.inference_mode():
gen = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
gen = gen[:, inputs.input_ids.shape[1]:]
return processor.batch_decode(
gen, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0].strip()


# ---- convert ----------------------------------------------------------------
cand_dir = OUT_ROOT / CANDIDATE
items = sorted(pages_by_pdf.items())
if NUM_SHARDS > 1:
items = items[SHARD::NUM_SHARDS]
print(f"shard {SHARD}/{NUM_SHARDS}: {len(items)} PDFs", flush=True)
if LIMIT:
items = items[:LIMIT]
print(f"LIMIT={LIMIT} (plumbing smoke test)", flush=True)

for i, (pdf_field, pages) in enumerate(items, 1):
try:
pdf_path = resolve_pdf(pdf_field)
with tempfile.TemporaryDirectory() as td:
mds = {}
for pg in pages:
png = Path(td) / f"pg{pg}.png"
render_page(pdf_path, pg, str(png))
mds[pg] = infer_markdown(png)
except Exception as e:
print(f"[WARN] {pdf_field}: {e}", flush=True)
mds = {pg: "" for pg in pages}
md_base = os.path.splitext(pdf_field)[0]
for pg in pages:
fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md"
fp.parent.mkdir(parents=True, exist_ok=True)
fp.write_text(mds.get(pg, ""))
n = len(next(iter(mds.values()), "")) if mds else 0
print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True)

(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text())
print(f"Done. Wrote {CANDIDATE} candidate to {OUT_ROOT}", flush=True)
Loading
Loading