davanstrien · davanstrien · Jun 29, 2026 · Jun 27, 2026 · Jun 29, 2026
diff --git a/experiments/olmocr-bench-oldscans/BENCHMARKING.md b/experiments/olmocr-bench-oldscans/BENCHMARKING.md
@@ -0,0 +1,93 @@
+# olmOCR-bench old_scans — multi-model comparison
+
+Extends the PaddleOCR-VL experiment ([README.md](README.md)) to two more open
+document models, scored through the same harness — for my own benchmarking. Same
+subset (`old_scans`, 98 Library-of-Congress scans), same scorer (stock
+`olmocr.bench.benchmark`), greedy decoding, no tuning. Run 2026-06-27.
+
+## Results
+
+| Model | params | **old_scans** | present | absent | order | baseline |
+|---|---|---|---|---|---|---|
+| PaddleOCR-VL v1.6 | 0.9B | **38.6** | 31.2 | 95.7 | 27.7 | 84.7 |
+| PaddleOCR-VL v1 | 0.9B | **38.2** | 32.3 | 95.7 | 24.9 | 88.8 |
+| NuExtract3 | 4.5B | **37.8** | **41.6** | 41.4 | 30.5 | **100.0** |
+| Unlimited-OCR | 3.3B | **30.6** | 29.0 | 50.0 | 25.4 | 89.8 |
+
+`old_scans` = present + absent + order (the leaderboard's "Old scans" column).
+
+## ⚠️ Read the sub-scores, not just the headline
+
+The single `old_scans` number **conflates two different abilities**, and is
+misleading on its own:
+
+- **`present`** — did the model transcribe the body text? *(transcription quality)*
+- **`absent`** — did it *exclude* the boilerplate the bench wants dropped:
+  letterheads (`LUCIEN BECKNER`, `TELEPHONE 478`), archival stamps (`ack 5/27/14`),
+  page numbers (`31`, `2590`)? *(boilerplate exclusion)*
+
+These pull in opposite directions **by architecture**:
+
+- **NuExtract3 is the best *transcriber*** — `present` 41.6, well above paddle's
+  31.2, and it never hallucinates CJK (`baseline` 100%). It scores low on `absent`
+  (41.4) only because markdown-mode transcribes the letterhead/stamps. Inspected:
+  those appear as **unmarked plain body text** at the top of the page, *not* in
+  `<figure>`/HTML you could filter on — so the `absent` failures are real, not a
+  formatting artifact you could strip away.
+- **PaddleOCR-VL is the best *boilerplate-excluder*** — `absent` 95.7, because its
+  layout pipeline drops running headers/footers. But it reads less of the body
+  (`present` 31.2) and hallucinates CJK glyphs on the hardest scans (`baseline`
+  84.7).
+
+So **NuExtract "losing" on `old_scans` (37.8 vs 38.6) is not a transcription
+deficit** — it reads *better*; it just doesn't do boilerplate exclusion, which a
+layout pipeline gets for free. Pick by use case:
+
+- want the most faithful text → **NuExtract3** (`present` 41.6).
+- want clean body-only markdown without boilerplate → **PaddleOCR-VL** (`absent` 95.7).
+
+*(Deliberately not done: a heuristic "drop the leading letterhead + standalone page
+numbers" pass that would lift NuExtract's `absent`. That's gaming the very
+boilerplate-exclusion criterion the bench is testing the model to do natively, so
+it's left out — the honest move is to report the sub-scores.)*
+
+## Why it matters
+
+olmOCR-bench encodes olmOCR's own goal: clean, linearised **reading-order body
+text** for LLM training/RAG, where headers, letterheads and page numbers are noise
+to drop. So the headline rewards leaving them out — reasonable if that's what you
+want. For **faithful OCR of the whole page** (archives, where the letterhead /
+stamp / marginal note *is* the record), the same number ranks the model you'd
+prefer *lower*. A benchmark score measures fitness for one purpose — check it's
+yours before trusting the ranking.
+
+## Fairness / processing notes
+
+- **DPI**: each model's recommended PDF→PNG render DPI — NuExtract **170**,
+  Unlimited-OCR **300**; paddle rasterizes internally. Different DPI is a confound;
+  using each model's own default is the per-model-fair choice (footnoted here).
+- **Unlimited-OCR** emits DeepSeek-OCR-style `<|det|>category [bbox]<|/det|>`
+  grounding before each span; we **strip it** to recover plain text comparable to
+  the others' clean markdown.
+- **NuExtract3**: `mode="markdown"`, **non-thinking + greedy** — the decoding-
+  comparable setting (thinking mode is temperature 0.6, non-deterministic).
+- **Decoding**: greedy for all four.
+- These two models take **images**, so the convert scripts render PDF→PNG (the
+  paddle pipeline rasterizes internally). They run as `hf jobs uv run` (root), so
+  they write the bucket mount **directly** — no `sync_bucket`, because
+  `transformers==4.57.1` pins an older `huggingface_hub` that lacks it.
+- **Not size-matched**: NuExtract (4.5B) and Unlimited-OCR (3.3B) are 3–5× larger
+  than PaddleOCR-VL (0.9B). Fine for "what's the best number," but note it.
+
+## Run
+
+```bash
+B=hf://buckets/davanstrien/paddleocr-vl16-oldscans
+# convert (each writes its own candidate folder; LIMIT=3 for a smoke)
+hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket unlimited_ocr.py
+hf jobs uv run --flavor l4x1 --timeout 1h -s HF_TOKEN -v $B:/bucket nuextract3.py
+# score every candidate folder in the bucket together
+hf jobs uv run --flavor cpu-upgrade -s HF_TOKEN -v $B:/bucket:ro score.py
+```
+
+`NUM_SHARDS`/`SHARD` on either convert for data-parallel fan-out (default 1).
diff --git a/experiments/olmocr-bench-oldscans/nuextract3.py b/experiments/olmocr-bench-oldscans/nuextract3.py
@@ -0,0 +1,156 @@
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#     "transformers",
+#     "torch",
+#     "torchvision",
+#     "accelerate",
+#     "pillow",
+#     "pymupdf",
+#     "einops",
+#     "huggingface-hub>=0.34",
+# ]
+# ///
+"""
+Convert the olmOCR-bench `old_scans` subset with numind/NuExtract3 (markdown mode)
+and write candidate markdown in the layout `olmocr.bench.benchmark` expects -- the
+same convention as convert.py / unlimited_ocr.py, so the score job ranks them all
+together.
+
+NuExtract3 is extraction-first but has a native image->Markdown mode; we use
+`mode="markdown"`, **non-thinking + greedy** (`enable_thinking=False`,
+`do_sample=False`) -- the decoding-comparable setting vs the other models (thinking
+mode would be temperature 0.6, non-deterministic). It takes IMAGES, so we render
+each PDF page to PNG at the card's recommended DPI (170).
+
+Runs as a uv script (`hf jobs uv run`), root, writing the bucket mount directly:
+
+    hf jobs uv run --flavor l4x1 -s HF_TOKEN \
+        -v hf://buckets/davanstrien/paddleocr-vl16-oldscans:/bucket \
+        nuextract3.py
+
+Env:
+  CANDIDATE    output subfolder + label (default nuextract3)
+  DPI          PDF->PNG render DPI (default 170, the card's recommendation)
+  OUT_ROOT     where to write (default /bucket, i.e. the bucket mount)
+  NUM_SHARDS   data-parallel fan-out (default 1); SHARD = 0..N-1
+  SHARD        which shard this job handles (default 0)
+  LIMIT        cap number of PDFs (plumbing smoke test; 0 = all)
+"""
+import json
+import os
+import shutil
+import tempfile
+from collections import defaultdict
+from pathlib import Path
+
+import fitz  # PyMuPDF
+from huggingface_hub import hf_hub_download
+
+BENCH_REPO = "allenai/olmOCR-bench"
+JSONL_PATH = "bench_data/old_scans.jsonl"
+MODEL_ID = "numind/NuExtract3"
+CANDIDATE = os.environ.get("CANDIDATE", "nuextract3")
+DPI = int(os.environ.get("DPI", "170"))
+OUT_ROOT = Path(os.environ.get("OUT_ROOT", "/bucket"))
+NUM_SHARDS = int(os.environ.get("NUM_SHARDS", "1"))
+SHARD = int(os.environ.get("SHARD", "0"))
+LIMIT = int(os.environ.get("LIMIT", "0"))
+
+# ---- test manifest ----------------------------------------------------------
+jsonl_local = hf_hub_download(BENCH_REPO, JSONL_PATH, repo_type="dataset")
+tests = [json.loads(ln) for ln in Path(jsonl_local).read_text().splitlines() if ln.strip()]
+
+pages_by_pdf = defaultdict(set)
+for t in tests:
+    pages_by_pdf[t["pdf"]].add(int(t.get("page", 1)))
+print(f"{len(tests)} tests across {len(pages_by_pdf)} PDFs  candidate={CANDIDATE} dpi={DPI}", flush=True)
+
+# ---- model ------------------------------------------------------------------
+import torch  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoModelForImageTextToText, AutoProcessor  # noqa: E402
+
+print(f"cuda available: {torch.cuda.is_available()}", flush=True)
+processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
+model = AutoModelForImageTextToText.from_pretrained(
+    MODEL_ID, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
+).eval()
+
+
+def resolve_pdf(pdf_field):
+    mounted = Path("/bucket/pdfs") / pdf_field
+    if mounted.is_file():
+        return str(mounted)
+    for cand in (f"bench_data/{pdf_field}", f"bench_data/pdfs/{pdf_field}", pdf_field):
+        try:
+            local = hf_hub_download(BENCH_REPO, cand, repo_type="dataset")
+        except Exception:
+            continue
+        dest = OUT_ROOT / "pdfs" / pdf_field
+        dest.parent.mkdir(parents=True, exist_ok=True)
+        shutil.copy(local, dest)
+        return local
+    raise FileNotFoundError(pdf_field)
+
+
+def render_page(pdf_path, page_num, png_path):
+    doc = fitz.open(pdf_path)
+    mat = fitz.Matrix(DPI / 72, DPI / 72)
+    doc[page_num - 1].get_pixmap(matrix=mat).save(png_path)
+    doc.close()
+
+
+def infer_markdown(image_path):
+    """NuExtract3 image->Markdown, non-thinking + greedy."""
+    image = Image.open(image_path).convert("RGB")
+    messages = [{"role": "user", "content": [{"type": "image", "image": image}]}]
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_dict=True,
+        return_tensors="pt",
+        mode="markdown",
+        enable_thinking=False,
+    ).to(model.device)
+    with torch.inference_mode():
+        gen = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
+    gen = gen[:, inputs.input_ids.shape[1]:]
+    return processor.batch_decode(
+        gen, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )[0].strip()
+
+
+# ---- convert ----------------------------------------------------------------
+cand_dir = OUT_ROOT / CANDIDATE
+items = sorted(pages_by_pdf.items())
+if NUM_SHARDS > 1:
+    items = items[SHARD::NUM_SHARDS]
+    print(f"shard {SHARD}/{NUM_SHARDS}: {len(items)} PDFs", flush=True)
+if LIMIT:
+    items = items[:LIMIT]
+    print(f"LIMIT={LIMIT} (plumbing smoke test)", flush=True)
+
+for i, (pdf_field, pages) in enumerate(items, 1):
+    try:
+        pdf_path = resolve_pdf(pdf_field)
+        with tempfile.TemporaryDirectory() as td:
+            mds = {}
+            for pg in pages:
+                png = Path(td) / f"pg{pg}.png"
+                render_page(pdf_path, pg, str(png))
+                mds[pg] = infer_markdown(png)
+    except Exception as e:
+        print(f"[WARN] {pdf_field}: {e}", flush=True)
+        mds = {pg: "" for pg in pages}
+    md_base = os.path.splitext(pdf_field)[0]
+    for pg in pages:
+        fp = cand_dir / f"{md_base}_pg{pg}_repeat1.md"
+        fp.parent.mkdir(parents=True, exist_ok=True)
+        fp.write_text(mds.get(pg, ""))
+    n = len(next(iter(mds.values()), "")) if mds else 0
+    print(f"[{i}/{len(items)}] {pdf_field} -> {n} chars", flush=True)
+
+(OUT_ROOT / "old_scans.jsonl").write_text(Path(jsonl_local).read_text())
+print(f"Done. Wrote {CANDIDATE} candidate to {OUT_ROOT}", flush=True)