Native Rust PDF extraction — text, tables, metadata, quality scoring. On the full ICDAR 2013 Table Competition corpus (67 unique PDFs) in strict mode, with the time-to-failure on 3 CID-font PDFs included: 1.19× faster than pymupdf, 26.8× faster than pdfminer.six, 41.5× faster than pdfplumber, with character-count parity within ±10% on 92% of documents vs pymupdf. Pure Rust, no C dependencies. Per-PDF data is in data/icdar2013-results.csv; reproduce with python scripts/bench_icdar2013.py. (While benchmarking, we found and patched a real defect in upstream lopdf that caused 4 of the 67 PDFs to extract at 31–89% of pymupdf's char count — fix vendored at lopdf-fork/, upstream PR open at J-F-Liu/lopdf#492.)
let bytes = std::fs::read("doc.pdf")?;
let text = spectre_rs::extract_text(&bytes)?;
let pages = spectre_rs::extract_pages(&bytes)?;
let tables = spectre_rs::extract_tables(&bytes, None)?;
let meta = spectre_rs::extract_metadata(&bytes)?;
let score = spectre_rs::score_text(&text); // 0.0 garbage .. 1.0 cleanimport spectre_rs
text = spectre_rs.extract_text(open("doc.pdf", "rb").read())
tables = spectre_rs.extract_tables(open("doc.pdf", "rb").read())Most "fast" Python PDF tooling either takes a C dependency (pymupdf/MuPDF, AGPL) or falls off a cliff above ~100 pages (pdfminer.six is pure Python and ~ten seconds per 200-page document; pdfplumber adds another order of magnitude on top for tables). spectre-rs replaces all three with a single Rust crate built on lopdf, exposed to Python via PyO3.
The table extractor uses the same whitespace-column strategy as pdfplumber's vertical_strategy="text" / horizontal_strategy="text" — the only strategy that handles borderless tables correctly, and the format virtually every long-form financial document uses (10-Ks, prospectuses, municipal bond official statements). Geometric line-detection ("find the boxes") fails silently on these documents; whitespace columns succeed.
On the ICDAR 2013 Table Competition corpus (67 unique PDFs, the canonical public PDF-extraction benchmark — used by pdfplumber, Camelot, Tabula, GROBID for their own evaluation), spectre_rs running extract_text (strict mode) is:
| Library | Total time | Successful PDFs | vs spectre_rs |
|---|---|---|---|
spectre_rs (extract_text, strict) |
310 ms | 64/67 (3 raise; time-to-failure included) | 1.0× (baseline) |
pymupdf (page.get_text) |
368 ms | 67/67 | 1.19× slower |
pdfminer.six (extract_text) |
8,301 ms | 67/67 | 26.81× slower |
pdfplumber (extract_text/page) |
12,858 ms | 67/67 | 41.52× slower |
Char-count parity vs pymupdf (closest SOTA reference): median Δ +0.6%, mean Δ −0.1%, 92% of PDFs within ±10% (60 of 65 PDFs where both tools returned non-empty output). spectre_rs raises ExtractError::PageExtractFailed on 3 of 67 PDFs; the other libraries silently produce output on those documents — see Correctness finding below for why that matters.
Methodology note: these numbers depend on a small patch to
lopdf0.39.0 (the underlying PDF parser) that we discovered and fixed while investigating the parity tail — it adds match arms for the',", andT*PDF content-stream operators that upstream silently drops. Net effect of the patch on this corpus: +19,767 characters of previously-dropped content (12 PDFs moved by ≥2 percentage points toward char-count parity); spectre's wall time was unchanged (309.70 ms pre-patch, 309.65 ms post-patch — the patch is timing-flat). Before the patch, 4 PDFs extracted at 31% to 89% ofpymupdf's char count; after the patch they extract within ±1% (3 of the 4) or ±5.4% (us-001, where a separate whitespace + ligature issue accounts for the residual). The fix is vendored in this repo atlopdf-fork/(referenced via[patch.crates-io]inCargo.tomlsocargo buildworks on a fresh clone) and is open as upstream PR J-F-Liu/lopdf#492, scoped narrowly to the missing operator dispatch — full text-state tracking and ligature-decomposition at the encoding layer are explicitly not in the PR. Seedata/lopdf-patch.difffor the verbatim patch,data/lopdf-patch-impact.mdfor the side-by-side pre/post-patch CSVs and the full per-PDF impact table, anddata/parity-tail-investigation.mdfor the investigation that surfaced it.
-
Corpus: ICDAR 2013 Table Competition (27 EU government documents + 40 US Federal Reserve / BLS / agency reports = 67 unique PDFs, ~25 MB). Pulled from the
liminghao1630/ICDAR_2013_table_evaluatemirror. The mirror's directory layout duplicates every PDF into bothpdf/andcompetition-dataset-{eu,us}/; the harness dedupes by content SHA-256 so each unique PDF is benchmarked exactly once. -
Conditions: single-threaded, release build, warm cache, 2 warm iterations per file, median.
-
What's timed: for
spectre_rs, the cold call is followed by two warm calls; the bench reads the warm-2 line. For Python tools, the harness calls the function 2 times per file in-process and takes the median. All four tools are timed warm against the same on-disk PDF. -
Strict-mode timing audit: the 3 CID-font failure PDFs are included in
spectre_rs's total time.lopdfraisesMissingDictionaryKey("ToUnicode")early during page-extraction; that fail-time is counted, not excluded. No empty-string-in-0ms inflation. -
Run-to-run noise: single-run measurements; per-PDF times are committed in the CSV for reproducibility. Long ratios (pdfminer.six, pdfplumber) drift ~2% across runs due to OS scheduling noise on small absolute times — readers wanting tighter bounds can rerun the bench script with their own corpus.
-
Reproduce:
pip install pdfminer.six pdfplumber pymupdf cargo build --release --example parity curl -sL -o icdar2013.zip https://codeload.github.com/liminghao1630/ICDAR_2013_table_evaluate/zip/refs/heads/master unzip -q icdar2013.zip export ICDAR2013_CORPUS=$PWD/ICDAR_2013_table_evaluate-master/icdar2013-competition-dataset-with-gt python scripts/bench_icdar2013.py
Full table — every PDF, every tool, every time, every char count, every spectre status — is committed in data/icdar2013-results.csv. 67 rows + a TOTAL row, no aggregation tricks; readers who want to verify any aggregate claim can do so directly from the per-document data.
Five-row spot check showing the shape of the data (one average, one large, one over-extraction outlier, the largest under-residual within the parity band, one CID-font failure):
| spectre strict | pymupdf | pdfminer | pdfplumber | spectre chars | pymupdf chars | Δ vs pymupdf | spectre status | |
|---|---|---|---|---|---|---|---|---|
eu-001.pdf (avg) |
4.85 ms | 5.86 ms | 114.67 ms | 181.55 ms | 4,809 | 4,849 | −0.8% | OK |
eu-004.pdf (large, fast) |
11.10 ms | 13.92 ms | 378.56 ms | 775.65 ms | 31,941 | 31,882 | +0.2% | OK |
eu-002.pdf (over-extract) |
2.40 ms | 2.97 ms | 38.73 ms | 53.23 ms | 1,884 | 1,600 | +17.8% | OK |
us-001.pdf (under-residual) |
5.05 ms | 7.77 ms | 129.29 ms | 225.33 ms | 12,479 | 13,188 | −5.4% | OK |
us-005.pdf (CID failure) |
(raises in 1.20 ms) | 1.57 ms | 21.37 ms | 36.79 ms | 0 | 2,259 | — | STRICT_FAIL_CID |
| TOTAL (67 PDFs) | 310 ms | 368 ms | 8,301 ms | 12,858 ms | 561,179 | 568,990 | mean −0.1%, median +0.6% | 64 OK / 3 STRICT_FAIL_CID |
3 of 67 PDFs (us-005, us-010, us-011a) contain composite CID fonts that omit the /ToUnicode mapping. Without that map, the font's character codes cannot be unambiguously decoded back to Unicode — the document still renders correctly in any PDF viewer (Adobe Reader uses font-internal encoding tables and glyph names as a fallback), but a text extractor that wants to produce correct, portable Unicode output is missing the information it needs.
spectre_rs (via lopdf) raises ExtractError::PageExtractFailed { page, source }. The Python comparators all return non-empty text on these documents, and the bulk of the prose is correctly extracted. The failure mode is concentrated in specific glyphs whose CID font lacks a /ToUnicode map — typically bullet symbols rendered from a Wingdings-style font. What each tool emits in place of those glyphs:
| pymupdf | pdfminer.six | pdfplumber | |
|---|---|---|---|
us-005.pdf |
raw U+0099 C1 control character (× 5) |
(cid:153) text placeholder (× 5) |
(cid:153) text placeholder (× 5) |
us-010.pdf |
PUA codepoint U+F0B7 (× 8) |
U+F0B7 (× 8) |
U+F0B7 (× 8) |
us-011a.pdf |
U+F0B7 (× 4) |
U+F0B7 (× 4) |
U+F0B7 (× 4) |
Each of these is invalid Unicode for a strict downstream consumer:
U+F0B7sits in the Unicode Private Use Area — by definition it has no defined meaning. The same codepoint renders as different glyphs depending on which font happens to be present.U+0099is a C1 control character (STRING TERMINATOR) — not text content, a control byte. Most logging pipelines, LLM tokenizers, and audit databases under SOC 2 strip or reject control characters silently.(cid:153)is just a literal seven-character ASCII string — a reader sees the placeholder, but an automated downstream consumer matches it character-by-character against the wrong content.
The full extracted text from each tool on the 3 failing PDFs is committed in data/cid-font-evidence/ so you can verify any of these claims directly. None of these failure modes is catastrophic — the bullet glyphs are not load-bearing content. But all three are silent: nothing in the Python libraries' return values tells the caller that the extractor couldn't fully decode the document. For compliance pipelines, audit trails, legal-document ingestion, or LLM context construction where "we know exactly what's in this document" is a requirement, surfacing the failure is the better default.
spectre_rs ships two extraction surfaces:
extract_text(strict, default): surfaces per-page failures as a typed error. Callers know which pages they aren't getting and can route them to OCR / human review.extract_text_lenient: drops failing pages silently, matching the established libraries' silent-skip behaviour. Available for callers who explicitly want feature parity rather than transparency.
The trade-off is real: lenient mode loses fidelity silently in exchange for "100% success rate" on benchmark sheets. spectre_rs defaults to the side that compliance customers actually need.
After the lopdf patch (see methodology note above), 4 PDFs remain outside the ±10% band, all on the over-extraction side: eu-002 (+18%), eu-025 (+15%), us-022 (+14%), us-037 (+10%). Cause: lopdf's positional text-emission inserts more inter-word spaces between visually-separated glyph runs than pymupdf's reading-order-aware extractor. Same content, more whitespace. (Plus us-010 shows a large negative delta because the strict mode raises and lenient extracts a partial page; that document is also in the strict-failure set below.)
Before the patch, an additional 4 PDFs (us-001, us-034, us-038, us-039) appeared in the under-extraction tail at −11% to −69%, with the cause not yet pinned. Investigation traced it to three PDF content-stream operators (', ", T*) that lopdf's extract_text silently dropped — see data/parity-tail-investigation.md for the full investigation, data/lopdf-patch.diff for the fix, and lopdf-fork/CHANGES.md for the patch details and upstream-PR plan. Post-patch, us-034 extracts at −0.8%, us-038 at −0.3%, us-039 at −1.0%, and us-001 at −5.4% (all within ±10%).
us-001's residual −5.3% decomposes into two mechanisms, neither of which is the operator-dispatch issue the patch addresses:
- Whitespace handling in tabular content (~600 chars of the −737-char delta). lopdf emits inter-word spacing differently from pymupdf inside table cells; same content, different whitespace. This is in lopdf's positional-emission layer, not the operator dispatch.
- Ligature codepoints (~21–42 chars). The PDF uses precomposed ligature glyphs (
ff,fi,ffi,ffl,fl— Unicode block U+FB00–U+FB06) which pymupdf preserves and lopdf drops somewhere in its encoding layer. This is a separate concern from operator dispatch — it touches font CMap resolution and Unicode glyph-name tables, and a fix needs to be scoped carefully so it doesn't affect Arabic presentation forms or CJK compatibility ideographs that share the same plane. It is not bundled into the operator-dispatch upstream PR.
Neither remaining mechanism is spectre_rs silently dropping pages or returning wrong text. The over-extraction is whitespace handling, fully characterized; the under-extraction residual is whitespace + ligatures, both isolated to lopdf's layout/encoding internals and tracked separately.
# 1. Build the parity example (release profile)
cargo build --release --example parity
# 2. Install the Python comparators
pip install pdfminer.six pdfplumber pymupdf
# 3a. Single-PDF comparison
python scripts/compare.py /path/to/your.pdf
# 3b. Full ICDAR 2013 corpus (134 file-runs deduped to 67 unique PDFs
# by content hash; fetches the corpus automatically)
python scripts/bench_icdar2013.pyOr the criterion bench, which gives variance + throughput stats:
export SPECTRE_BENCH_PDF=/path/to/your.pdf
cargo bench --bench extract_text(With no env var set, criterion builds a 5-page synthetic PDF in memory so CI runs without external test data — useful for sanity but not for headline numbers.)
| Function | Returns | Notes |
|---|---|---|
extract_text(bytes) |
String |
Full document, page breaks normalized |
extract_pages(bytes) |
Vec<String> |
One string per page, indices line up |
extract_tables(bytes, page_num?) |
Vec<Vec<Vec<String>>> |
[table[row[cell]]]; whitespace columns |
extract_metadata(bytes) |
HashMap<String, String> |
pages, pdf_version, Info dict |
score_text(text) |
f64 in [0.0, 1.0] |
Garbage / binary detector |
score_batch(&[text]) |
Vec<f64> |
Parallel via [rayon] |
All functions take &[u8] raw PDF bytes — no path handling, no implicit IO. Bring your own fs::read or tokio::fs::read.
[dependencies]
spectre_rs = "0.4"pip install maturin
maturin develop --release --features pythonThe Python build needs pyo3 and python features enabled. The pre-built wheel publishing pipeline (PyPI) is on the roadmap but not yet shipped — for now, build from source.
- v0.4.x is the current public release line. v0.4 surfaces per-page extraction errors as a typed
ExtractError::PageExtractFailed(pre-0.4 they were silently dropped), enforces resource bounds (MAX_PAGES,MAX_OUTPUT_BYTES,MAX_TABLES), fixes a chars-vs-bytes bug in the column-break detector that produced wrong cell splits on non-Latin tables, and rewritesscore_textto be language-fair (CJK / accented Latin no longer rated as binary noise). - v0.2.x and v0.1.x existed inside Crucible Engineering as the Python-only
RustValidator. TheRustValidatorclass is retained behind--features pythonfor backward compat; new code should use the module-level functions. - spectre-rs is not yet a complete replacement for every
pymupdfuse case (no image rendering, no OCR, no annotation editing). Scope is read-only structured-text extraction.
cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo bench --bench extract_text -- --quickCI runs the same on every push and PR. PRs that change extraction behavior should include a regression test under src/tables.rs::tests (or a new tests/ integration test), a benchmark delta in the PR description, and — if the change is structural — a note on why pdfminer.six / pdfplumber don't already do this.
Built by Ryan Stewart