spectre-rs

Native Rust PDF extraction — text, tables, metadata, quality scoring. On the full ICDAR 2013 Table Competition corpus (67 unique PDFs) in strict mode, with the time-to-failure on 3 CID-font PDFs included: 1.19× faster than pymupdf, 26.8× faster than pdfminer.six, 41.5× faster than pdfplumber, with character-count parity within ±10% on 92% of documents vs pymupdf. Pure Rust, no C dependencies. Per-PDF data is in data/icdar2013-results.csv; reproduce with python scripts/bench_icdar2013.py. (While benchmarking, we found and patched a real defect in upstream lopdf that caused 4 of the 67 PDFs to extract at 31–89% of pymupdf's char count — fix vendored at lopdf-fork/, upstream PR open at J-F-Liu/lopdf#492.)

let bytes  = std::fs::read("doc.pdf")?;
let text   = spectre_rs::extract_text(&bytes)?;
let pages  = spectre_rs::extract_pages(&bytes)?;
let tables = spectre_rs::extract_tables(&bytes, None)?;
let meta   = spectre_rs::extract_metadata(&bytes)?;
let score  = spectre_rs::score_text(&text);  // 0.0 garbage .. 1.0 clean

import spectre_rs
text   = spectre_rs.extract_text(open("doc.pdf", "rb").read())
tables = spectre_rs.extract_tables(open("doc.pdf", "rb").read())

Why

Most "fast" Python PDF tooling either takes a C dependency (pymupdf/MuPDF, AGPL) or falls off a cliff above ~100 pages (pdfminer.six is pure Python and ~ten seconds per 200-page document; pdfplumber adds another order of magnitude on top for tables). spectre-rs replaces all three with a single Rust crate built on lopdf, exposed to Python via PyO3.

The table extractor uses the same whitespace-column strategy as pdfplumber's vertical_strategy="text" / horizontal_strategy="text" — the only strategy that handles borderless tables correctly, and the format virtually every long-form financial document uses (10-Ks, prospectuses, municipal bond official statements). Geometric line-detection ("find the boxes") fails silently on these documents; whitespace columns succeed.

Performance

Headline — full ICDAR 2013 corpus, strict mode

On the ICDAR 2013 Table Competition corpus (67 unique PDFs, the canonical public PDF-extraction benchmark — used by pdfplumber, Camelot, Tabula, GROBID for their own evaluation), spectre_rs running extract_text (strict mode) is:

Library	Total time	Successful PDFs	vs spectre_rs
spectre_rs (`extract_text`, strict)	310 ms	64/67 (3 raise; time-to-failure included)	1.0× (baseline)
pymupdf (`page.get_text`)	368 ms	67/67	1.19× slower
pdfminer.six (`extract_text`)	8,301 ms	67/67	26.81× slower
pdfplumber (`extract_text`/page)	12,858 ms	67/67	41.52× slower

Char-count parity vs pymupdf (closest SOTA reference): median Δ +0.6%, mean Δ −0.1%, 92% of PDFs within ±10% (60 of 65 PDFs where both tools returned non-empty output). spectre_rs raises ExtractError::PageExtractFailed on 3 of 67 PDFs; the other libraries silently produce output on those documents — see Correctness finding below for why that matters.

Methodology note: these numbers depend on a small patch to lopdf 0.39.0 (the underlying PDF parser) that we discovered and fixed while investigating the parity tail — it adds match arms for the ', ", and T* PDF content-stream operators that upstream silently drops. Net effect of the patch on this corpus: +19,767 characters of previously-dropped content (12 PDFs moved by ≥2 percentage points toward char-count parity); spectre's wall time was unchanged (309.70 ms pre-patch, 309.65 ms post-patch — the patch is timing-flat). Before the patch, 4 PDFs extracted at 31% to 89% of pymupdf's char count; after the patch they extract within ±1% (3 of the 4) or ±5.4% (us-001, where a separate whitespace + ligature issue accounts for the residual). The fix is vendored in this repo at lopdf-fork/ (referenced via [patch.crates-io] in Cargo.toml so cargo build works on a fresh clone) and is open as upstream PR J-F-Liu/lopdf#492, scoped narrowly to the missing operator dispatch — full text-state tracking and ligature-decomposition at the encoding layer are explicitly not in the PR. See data/lopdf-patch.diff for the verbatim patch, data/lopdf-patch-impact.md for the side-by-side pre/post-patch CSVs and the full per-PDF impact table, and data/parity-tail-investigation.md for the investigation that surfaced it.

Methodology + reproducibility

Corpus: ICDAR 2013 Table Competition (27 EU government documents + 40 US Federal Reserve / BLS / agency reports = 67 unique PDFs, ~25 MB). Pulled from the liminghao1630/ICDAR_2013_table_evaluate mirror. The mirror's directory layout duplicates every PDF into both pdf/ and competition-dataset-{eu,us}/; the harness dedupes by content SHA-256 so each unique PDF is benchmarked exactly once.
Conditions: single-threaded, release build, warm cache, 2 warm iterations per file, median.
What's timed: for spectre_rs, the cold call is followed by two warm calls; the bench reads the warm-2 line. For Python tools, the harness calls the function 2 times per file in-process and takes the median. All four tools are timed warm against the same on-disk PDF.
Strict-mode timing audit: the 3 CID-font failure PDFs are included in spectre_rs's total time. lopdf raises MissingDictionaryKey("ToUnicode") early during page-extraction; that fail-time is counted, not excluded. No empty-string-in-0ms inflation.
Run-to-run noise: single-run measurements; per-PDF times are committed in the CSV for reproducibility. Long ratios (pdfminer.six, pdfplumber) drift ~2% across runs due to OS scheduling noise on small absolute times — readers wanting tighter bounds can rerun the bench script with their own corpus.

Reproduce:

pip install pdfminer.six pdfplumber pymupdf
cargo build --release --example parity
curl -sL -o icdar2013.zip https://codeload.github.com/liminghao1630/ICDAR_2013_table_evaluate/zip/refs/heads/master
unzip -q icdar2013.zip
export ICDAR2013_CORPUS=$PWD/ICDAR_2013_table_evaluate-master/icdar2013-competition-dataset-with-gt
python scripts/bench_icdar2013.py

Per-PDF results

Full table — every PDF, every tool, every time, every char count, every spectre status — is committed in data/icdar2013-results.csv. 67 rows + a TOTAL row, no aggregation tricks; readers who want to verify any aggregate claim can do so directly from the per-document data.

Five-row spot check showing the shape of the data (one average, one large, one over-extraction outlier, the largest under-residual within the parity band, one CID-font failure):

pdf	spectre strict	pymupdf	pdfminer	pdfplumber	spectre chars	pymupdf chars	Δ vs pymupdf	spectre status
`eu-001.pdf` (avg)	4.85 ms	5.86 ms	114.67 ms	181.55 ms	4,809	4,849	−0.8%	OK
`eu-004.pdf` (large, fast)	11.10 ms	13.92 ms	378.56 ms	775.65 ms	31,941	31,882	+0.2%	OK
`eu-002.pdf` (over-extract)	2.40 ms	2.97 ms	38.73 ms	53.23 ms	1,884	1,600	+17.8%	OK
`us-001.pdf` (under-residual)	5.05 ms	7.77 ms	129.29 ms	225.33 ms	12,479	13,188	−5.4%	OK
`us-005.pdf` (CID failure)	(raises in 1.20 ms)	1.57 ms	21.37 ms	36.79 ms	0	2,259	—	STRICT_FAIL_CID
TOTAL (67 PDFs)	310 ms	368 ms	8,301 ms	12,858 ms	561,179	568,990	mean −0.1%, median +0.6%	64 OK / 3 STRICT_FAIL_CID

Correctness finding: CID fonts without `/ToUnicode`

3 of 67 PDFs (us-005, us-010, us-011a) contain composite CID fonts that omit the /ToUnicode mapping. Without that map, the font's character codes cannot be unambiguously decoded back to Unicode — the document still renders correctly in any PDF viewer (Adobe Reader uses font-internal encoding tables and glyph names as a fallback), but a text extractor that wants to produce correct, portable Unicode output is missing the information it needs.

spectre_rs (via lopdf) raises ExtractError::PageExtractFailed { page, source }. The Python comparators all return non-empty text on these documents, and the bulk of the prose is correctly extracted. The failure mode is concentrated in specific glyphs whose CID font lacks a /ToUnicode map — typically bullet symbols rendered from a Wingdings-style font. What each tool emits in place of those glyphs:

PDF	pymupdf	pdfminer.six	pdfplumber
`us-005.pdf`	raw `U+0099` C1 control character (× 5)	`(cid:153)` text placeholder (× 5)	`(cid:153)` text placeholder (× 5)
`us-010.pdf`	PUA codepoint `U+F0B7` (× 8)	`U+F0B7` (× 8)	`U+F0B7` (× 8)
`us-011a.pdf`	`U+F0B7` (× 4)	`U+F0B7` (× 4)	`U+F0B7` (× 4)

Each of these is invalid Unicode for a strict downstream consumer:

U+F0B7 sits in the Unicode Private Use Area — by definition it has no defined meaning. The same codepoint renders as different glyphs depending on which font happens to be present.
U+0099 is a C1 control character (STRING TERMINATOR) — not text content, a control byte. Most logging pipelines, LLM tokenizers, and audit databases under SOC 2 strip or reject control characters silently.
(cid:153) is just a literal seven-character ASCII string — a reader sees the placeholder, but an automated downstream consumer matches it character-by-character against the wrong content.

The full extracted text from each tool on the 3 failing PDFs is committed in data/cid-font-evidence/ so you can verify any of these claims directly. None of these failure modes is catastrophic — the bullet glyphs are not load-bearing content. But all three are silent: nothing in the Python libraries' return values tells the caller that the extractor couldn't fully decode the document. For compliance pipelines, audit trails, legal-document ingestion, or LLM context construction where "we know exactly what's in this document" is a requirement, surfacing the failure is the better default.

spectre_rs ships two extraction surfaces:

extract_text (strict, default): surfaces per-page failures as a typed error. Callers know which pages they aren't getting and can route them to OCR / human review.
extract_text_lenient: drops failing pages silently, matching the established libraries' silent-skip behaviour. Available for callers who explicitly want feature parity rather than transparency.

The trade-off is real: lenient mode loses fidelity silently in exchange for "100% success rate" on benchmark sheets. spectre_rs defaults to the side that compliance customers actually need.

Parity tail

After the lopdf patch (see methodology note above), 4 PDFs remain outside the ±10% band, all on the over-extraction side: eu-002 (+18%), eu-025 (+15%), us-022 (+14%), us-037 (+10%). Cause: lopdf's positional text-emission inserts more inter-word spaces between visually-separated glyph runs than pymupdf's reading-order-aware extractor. Same content, more whitespace. (Plus us-010 shows a large negative delta because the strict mode raises and lenient extracts a partial page; that document is also in the strict-failure set below.)

Before the patch, an additional 4 PDFs (us-001, us-034, us-038, us-039) appeared in the under-extraction tail at −11% to −69%, with the cause not yet pinned. Investigation traced it to three PDF content-stream operators (', ", T*) that lopdf's extract_text silently dropped — see data/parity-tail-investigation.md for the full investigation, data/lopdf-patch.diff for the fix, and lopdf-fork/CHANGES.md for the patch details and upstream-PR plan. Post-patch, us-034 extracts at −0.8%, us-038 at −0.3%, us-039 at −1.0%, and us-001 at −5.4% (all within ±10%).

us-001's residual −5.3% decomposes into two mechanisms, neither of which is the operator-dispatch issue the patch addresses:

Whitespace handling in tabular content (~600 chars of the −737-char delta). lopdf emits inter-word spacing differently from pymupdf inside table cells; same content, different whitespace. This is in lopdf's positional-emission layer, not the operator dispatch.
Ligature codepoints (~21–42 chars). The PDF uses precomposed ligature glyphs (ﬀ, ﬁ, ﬃ, ﬄ, ﬂ — Unicode block U+FB00–U+FB06) which pymupdf preserves and lopdf drops somewhere in its encoding layer. This is a separate concern from operator dispatch — it touches font CMap resolution and Unicode glyph-name tables, and a fix needs to be scoped carefully so it doesn't affect Arabic presentation forms or CJK compatibility ideographs that share the same plane. It is not bundled into the operator-dispatch upstream PR.

Neither remaining mechanism is spectre_rs silently dropping pages or returning wrong text. The over-extraction is whitespace handling, fully characterized; the under-extraction residual is whitespace + ligatures, both isolated to lopdf's layout/encoding internals and tracked separately.

Reproduce in 60 seconds

# 1. Build the parity example (release profile)
cargo build --release --example parity

# 2. Install the Python comparators
pip install pdfminer.six pdfplumber pymupdf

# 3a. Single-PDF comparison
python scripts/compare.py /path/to/your.pdf

# 3b. Full ICDAR 2013 corpus (134 file-runs deduped to 67 unique PDFs
#     by content hash; fetches the corpus automatically)
python scripts/bench_icdar2013.py

Or the criterion bench, which gives variance + throughput stats:

export SPECTRE_BENCH_PDF=/path/to/your.pdf
cargo bench --bench extract_text

(With no env var set, criterion builds a 5-page synthetic PDF in memory so CI runs without external test data — useful for sanity but not for headline numbers.)

API

Function	Returns	Notes
`extract_text(bytes)`	`String`	Full document, page breaks normalized
`extract_pages(bytes)`	`Vec<String>`	One string per page, indices line up
`extract_tables(bytes, page_num?)`	`Vec<Vec<Vec<String>>>`	`[table[row[cell]]]`; whitespace columns
`extract_metadata(bytes)`	`HashMap<String, String>`	`pages`, `pdf_version`, Info dict
`score_text(text)`	`f64` in `[0.0, 1.0]`	Garbage / binary detector
`score_batch(&[text])`	`Vec<f64>`	Parallel via [`rayon`]

All functions take &[u8] raw PDF bytes — no path handling, no implicit IO. Bring your own fs::read or tokio::fs::read.

Install

Rust

[dependencies]
spectre_rs = "0.4"

Python (via maturin)

pip install maturin
maturin develop --release --features python

The Python build needs pyo3 and python features enabled. The pre-built wheel publishing pipeline (PyPI) is on the roadmap but not yet shipped — for now, build from source.

Status

v0.4.x is the current public release line. v0.4 surfaces per-page extraction errors as a typed ExtractError::PageExtractFailed (pre-0.4 they were silently dropped), enforces resource bounds (MAX_PAGES, MAX_OUTPUT_BYTES, MAX_TABLES), fixes a chars-vs-bytes bug in the column-break detector that produced wrong cell splits on non-Latin tables, and rewrites score_text to be language-fair (CJK / accented Latin no longer rated as binary noise).
v0.2.x and v0.1.x existed inside Crucible Engineering as the Python-only RustValidator. The RustValidator class is retained behind --features python for backward compat; new code should use the module-level functions.
spectre-rs is not yet a complete replacement for every pymupdf use case (no image rendering, no OCR, no annotation editing). Scope is read-only structured-text extraction.

Contributing

cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo bench --bench extract_text -- --quick

CI runs the same on every push and PR. PRs that change extraction behavior should include a regression test under src/tables.rs::tests (or a new tests/ integration test), a benchmark delta in the PR description, and — if the change is structural — a note on why pdfminer.six / pdfplumber don't already do this.

License

MIT

Built by Ryan Stewart

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
benches		benches
data		data
examples		examples
lopdf-fork		lopdf-fork
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spectre-rs

Why

Performance

Headline — full ICDAR 2013 corpus, strict mode

Methodology + reproducibility

Per-PDF results

Correctness finding: CID fonts without `/ToUnicode`

Parity tail

Reproduce in 60 seconds

API

Install

Rust

Python (via maturin)

Status

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

spectre-rs

Why

Performance

Headline — full ICDAR 2013 corpus, strict mode

Methodology + reproducibility

Per-PDF results

Correctness finding: CID fonts without /ToUnicode

Parity tail

Reproduce in 60 seconds

API

Install

Rust

Python (via maturin)

Status

Contributing

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Correctness finding: CID fonts without `/ToUnicode`

Packages