Skip to content

RyanJamesStewart/spectre-rs

Repository files navigation

spectre-rs

CI Crate License

Native Rust PDF extraction — text, tables, metadata, quality scoring. On the full ICDAR 2013 Table Competition corpus (67 unique PDFs) in strict mode, with the time-to-failure on 3 CID-font PDFs included: 1.19× faster than pymupdf, 26.8× faster than pdfminer.six, 41.5× faster than pdfplumber, with character-count parity within ±10% on 92% of documents vs pymupdf. Pure Rust, no C dependencies. Per-PDF data is in data/icdar2013-results.csv; reproduce with python scripts/bench_icdar2013.py. (While benchmarking, we found and patched a real defect in upstream lopdf that caused 4 of the 67 PDFs to extract at 31–89% of pymupdf's char count — fix vendored at lopdf-fork/, upstream PR open at J-F-Liu/lopdf#492.)

let bytes  = std::fs::read("doc.pdf")?;
let text   = spectre_rs::extract_text(&bytes)?;
let pages  = spectre_rs::extract_pages(&bytes)?;
let tables = spectre_rs::extract_tables(&bytes, None)?;
let meta   = spectre_rs::extract_metadata(&bytes)?;
let score  = spectre_rs::score_text(&text);  // 0.0 garbage .. 1.0 clean
import spectre_rs
text   = spectre_rs.extract_text(open("doc.pdf", "rb").read())
tables = spectre_rs.extract_tables(open("doc.pdf", "rb").read())

Why

Most "fast" Python PDF tooling either takes a C dependency (pymupdf/MuPDF, AGPL) or falls off a cliff above ~100 pages (pdfminer.six is pure Python and ~ten seconds per 200-page document; pdfplumber adds another order of magnitude on top for tables). spectre-rs replaces all three with a single Rust crate built on lopdf, exposed to Python via PyO3.

The table extractor uses the same whitespace-column strategy as pdfplumber's vertical_strategy="text" / horizontal_strategy="text" — the only strategy that handles borderless tables correctly, and the format virtually every long-form financial document uses (10-Ks, prospectuses, municipal bond official statements). Geometric line-detection ("find the boxes") fails silently on these documents; whitespace columns succeed.

Performance

Headline — full ICDAR 2013 corpus, strict mode

On the ICDAR 2013 Table Competition corpus (67 unique PDFs, the canonical public PDF-extraction benchmark — used by pdfplumber, Camelot, Tabula, GROBID for their own evaluation), spectre_rs running extract_text (strict mode) is:

Library Total time Successful PDFs vs spectre_rs
spectre_rs (extract_text, strict) 310 ms 64/67 (3 raise; time-to-failure included) 1.0× (baseline)
pymupdf (page.get_text) 368 ms 67/67 1.19× slower
pdfminer.six (extract_text) 8,301 ms 67/67 26.81× slower
pdfplumber (extract_text/page) 12,858 ms 67/67 41.52× slower

Char-count parity vs pymupdf (closest SOTA reference): median Δ +0.6%, mean Δ −0.1%, 92% of PDFs within ±10% (60 of 65 PDFs where both tools returned non-empty output). spectre_rs raises ExtractError::PageExtractFailed on 3 of 67 PDFs; the other libraries silently produce output on those documents — see Correctness finding below for why that matters.

Methodology note: these numbers depend on a small patch to lopdf 0.39.0 (the underlying PDF parser) that we discovered and fixed while investigating the parity tail — it adds match arms for the ', ", and T* PDF content-stream operators that upstream silently drops. Net effect of the patch on this corpus: +19,767 characters of previously-dropped content (12 PDFs moved by ≥2 percentage points toward char-count parity); spectre's wall time was unchanged (309.70 ms pre-patch, 309.65 ms post-patch — the patch is timing-flat). Before the patch, 4 PDFs extracted at 31% to 89% of pymupdf's char count; after the patch they extract within ±1% (3 of the 4) or ±5.4% (us-001, where a separate whitespace + ligature issue accounts for the residual). The fix is vendored in this repo at lopdf-fork/ (referenced via [patch.crates-io] in Cargo.toml so cargo build works on a fresh clone) and is open as upstream PR J-F-Liu/lopdf#492, scoped narrowly to the missing operator dispatch — full text-state tracking and ligature-decomposition at the encoding layer are explicitly not in the PR. See data/lopdf-patch.diff for the verbatim patch, data/lopdf-patch-impact.md for the side-by-side pre/post-patch CSVs and the full per-PDF impact table, and data/parity-tail-investigation.md for the investigation that surfaced it.

Methodology + reproducibility

  • Corpus: ICDAR 2013 Table Competition (27 EU government documents + 40 US Federal Reserve / BLS / agency reports = 67 unique PDFs, ~25 MB). Pulled from the liminghao1630/ICDAR_2013_table_evaluate mirror. The mirror's directory layout duplicates every PDF into both pdf/ and competition-dataset-{eu,us}/; the harness dedupes by content SHA-256 so each unique PDF is benchmarked exactly once.

  • Conditions: single-threaded, release build, warm cache, 2 warm iterations per file, median.

  • What's timed: for spectre_rs, the cold call is followed by two warm calls; the bench reads the warm-2 line. For Python tools, the harness calls the function 2 times per file in-process and takes the median. All four tools are timed warm against the same on-disk PDF.

  • Strict-mode timing audit: the 3 CID-font failure PDFs are included in spectre_rs's total time. lopdf raises MissingDictionaryKey("ToUnicode") early during page-extraction; that fail-time is counted, not excluded. No empty-string-in-0ms inflation.

  • Run-to-run noise: single-run measurements; per-PDF times are committed in the CSV for reproducibility. Long ratios (pdfminer.six, pdfplumber) drift ~2% across runs due to OS scheduling noise on small absolute times — readers wanting tighter bounds can rerun the bench script with their own corpus.

  • Reproduce:

    pip install pdfminer.six pdfplumber pymupdf
    cargo build --release --example parity
    curl -sL -o icdar2013.zip https://codeload.github.com/liminghao1630/ICDAR_2013_table_evaluate/zip/refs/heads/master
    unzip -q icdar2013.zip
    export ICDAR2013_CORPUS=$PWD/ICDAR_2013_table_evaluate-master/icdar2013-competition-dataset-with-gt
    python scripts/bench_icdar2013.py

Per-PDF results

Full table — every PDF, every tool, every time, every char count, every spectre status — is committed in data/icdar2013-results.csv. 67 rows + a TOTAL row, no aggregation tricks; readers who want to verify any aggregate claim can do so directly from the per-document data.

Five-row spot check showing the shape of the data (one average, one large, one over-extraction outlier, the largest under-residual within the parity band, one CID-font failure):

pdf spectre strict pymupdf pdfminer pdfplumber spectre chars pymupdf chars Δ vs pymupdf spectre status
eu-001.pdf (avg) 4.85 ms 5.86 ms 114.67 ms 181.55 ms 4,809 4,849 −0.8% OK
eu-004.pdf (large, fast) 11.10 ms 13.92 ms 378.56 ms 775.65 ms 31,941 31,882 +0.2% OK
eu-002.pdf (over-extract) 2.40 ms 2.97 ms 38.73 ms 53.23 ms 1,884 1,600 +17.8% OK
us-001.pdf (under-residual) 5.05 ms 7.77 ms 129.29 ms 225.33 ms 12,479 13,188 −5.4% OK
us-005.pdf (CID failure) (raises in 1.20 ms) 1.57 ms 21.37 ms 36.79 ms 0 2,259 STRICT_FAIL_CID
TOTAL (67 PDFs) 310 ms 368 ms 8,301 ms 12,858 ms 561,179 568,990 mean −0.1%, median +0.6% 64 OK / 3 STRICT_FAIL_CID

Correctness finding: CID fonts without /ToUnicode

3 of 67 PDFs (us-005, us-010, us-011a) contain composite CID fonts that omit the /ToUnicode mapping. Without that map, the font's character codes cannot be unambiguously decoded back to Unicode — the document still renders correctly in any PDF viewer (Adobe Reader uses font-internal encoding tables and glyph names as a fallback), but a text extractor that wants to produce correct, portable Unicode output is missing the information it needs.

spectre_rs (via lopdf) raises ExtractError::PageExtractFailed { page, source }. The Python comparators all return non-empty text on these documents, and the bulk of the prose is correctly extracted. The failure mode is concentrated in specific glyphs whose CID font lacks a /ToUnicode map — typically bullet symbols rendered from a Wingdings-style font. What each tool emits in place of those glyphs:

PDF pymupdf pdfminer.six pdfplumber
us-005.pdf raw U+0099 C1 control character (× 5) (cid:153) text placeholder (× 5) (cid:153) text placeholder (× 5)
us-010.pdf PUA codepoint U+F0B7 (× 8) U+F0B7 (× 8) U+F0B7 (× 8)
us-011a.pdf U+F0B7 (× 4) U+F0B7 (× 4) U+F0B7 (× 4)

Each of these is invalid Unicode for a strict downstream consumer:

  • U+F0B7 sits in the Unicode Private Use Area — by definition it has no defined meaning. The same codepoint renders as different glyphs depending on which font happens to be present.
  • U+0099 is a C1 control character (STRING TERMINATOR) — not text content, a control byte. Most logging pipelines, LLM tokenizers, and audit databases under SOC 2 strip or reject control characters silently.
  • (cid:153) is just a literal seven-character ASCII string — a reader sees the placeholder, but an automated downstream consumer matches it character-by-character against the wrong content.

The full extracted text from each tool on the 3 failing PDFs is committed in data/cid-font-evidence/ so you can verify any of these claims directly. None of these failure modes is catastrophic — the bullet glyphs are not load-bearing content. But all three are silent: nothing in the Python libraries' return values tells the caller that the extractor couldn't fully decode the document. For compliance pipelines, audit trails, legal-document ingestion, or LLM context construction where "we know exactly what's in this document" is a requirement, surfacing the failure is the better default.

spectre_rs ships two extraction surfaces:

  • extract_text (strict, default): surfaces per-page failures as a typed error. Callers know which pages they aren't getting and can route them to OCR / human review.
  • extract_text_lenient: drops failing pages silently, matching the established libraries' silent-skip behaviour. Available for callers who explicitly want feature parity rather than transparency.

The trade-off is real: lenient mode loses fidelity silently in exchange for "100% success rate" on benchmark sheets. spectre_rs defaults to the side that compliance customers actually need.

Parity tail

After the lopdf patch (see methodology note above), 4 PDFs remain outside the ±10% band, all on the over-extraction side: eu-002 (+18%), eu-025 (+15%), us-022 (+14%), us-037 (+10%). Cause: lopdf's positional text-emission inserts more inter-word spaces between visually-separated glyph runs than pymupdf's reading-order-aware extractor. Same content, more whitespace. (Plus us-010 shows a large negative delta because the strict mode raises and lenient extracts a partial page; that document is also in the strict-failure set below.)

Before the patch, an additional 4 PDFs (us-001, us-034, us-038, us-039) appeared in the under-extraction tail at −11% to −69%, with the cause not yet pinned. Investigation traced it to three PDF content-stream operators (', ", T*) that lopdf's extract_text silently dropped — see data/parity-tail-investigation.md for the full investigation, data/lopdf-patch.diff for the fix, and lopdf-fork/CHANGES.md for the patch details and upstream-PR plan. Post-patch, us-034 extracts at −0.8%, us-038 at −0.3%, us-039 at −1.0%, and us-001 at −5.4% (all within ±10%).

us-001's residual −5.3% decomposes into two mechanisms, neither of which is the operator-dispatch issue the patch addresses:

  • Whitespace handling in tabular content (~600 chars of the −737-char delta). lopdf emits inter-word spacing differently from pymupdf inside table cells; same content, different whitespace. This is in lopdf's positional-emission layer, not the operator dispatch.
  • Ligature codepoints (~21–42 chars). The PDF uses precomposed ligature glyphs (, , , , — Unicode block U+FB00–U+FB06) which pymupdf preserves and lopdf drops somewhere in its encoding layer. This is a separate concern from operator dispatch — it touches font CMap resolution and Unicode glyph-name tables, and a fix needs to be scoped carefully so it doesn't affect Arabic presentation forms or CJK compatibility ideographs that share the same plane. It is not bundled into the operator-dispatch upstream PR.

Neither remaining mechanism is spectre_rs silently dropping pages or returning wrong text. The over-extraction is whitespace handling, fully characterized; the under-extraction residual is whitespace + ligatures, both isolated to lopdf's layout/encoding internals and tracked separately.

Reproduce in 60 seconds

# 1. Build the parity example (release profile)
cargo build --release --example parity

# 2. Install the Python comparators
pip install pdfminer.six pdfplumber pymupdf

# 3a. Single-PDF comparison
python scripts/compare.py /path/to/your.pdf

# 3b. Full ICDAR 2013 corpus (134 file-runs deduped to 67 unique PDFs
#     by content hash; fetches the corpus automatically)
python scripts/bench_icdar2013.py

Or the criterion bench, which gives variance + throughput stats:

export SPECTRE_BENCH_PDF=/path/to/your.pdf
cargo bench --bench extract_text

(With no env var set, criterion builds a 5-page synthetic PDF in memory so CI runs without external test data — useful for sanity but not for headline numbers.)

API

Function Returns Notes
extract_text(bytes) String Full document, page breaks normalized
extract_pages(bytes) Vec<String> One string per page, indices line up
extract_tables(bytes, page_num?) Vec<Vec<Vec<String>>> [table[row[cell]]]; whitespace columns
extract_metadata(bytes) HashMap<String, String> pages, pdf_version, Info dict
score_text(text) f64 in [0.0, 1.0] Garbage / binary detector
score_batch(&[text]) Vec<f64> Parallel via [rayon]

All functions take &[u8] raw PDF bytes — no path handling, no implicit IO. Bring your own fs::read or tokio::fs::read.

Install

Rust

[dependencies]
spectre_rs = "0.4"

Python (via maturin)

pip install maturin
maturin develop --release --features python

The Python build needs pyo3 and python features enabled. The pre-built wheel publishing pipeline (PyPI) is on the roadmap but not yet shipped — for now, build from source.

Status

  • v0.4.x is the current public release line. v0.4 surfaces per-page extraction errors as a typed ExtractError::PageExtractFailed (pre-0.4 they were silently dropped), enforces resource bounds (MAX_PAGES, MAX_OUTPUT_BYTES, MAX_TABLES), fixes a chars-vs-bytes bug in the column-break detector that produced wrong cell splits on non-Latin tables, and rewrites score_text to be language-fair (CJK / accented Latin no longer rated as binary noise).
  • v0.2.x and v0.1.x existed inside Crucible Engineering as the Python-only RustValidator. The RustValidator class is retained behind --features python for backward compat; new code should use the module-level functions.
  • spectre-rs is not yet a complete replacement for every pymupdf use case (no image rendering, no OCR, no annotation editing). Scope is read-only structured-text extraction.

Contributing

cargo fmt
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-features
cargo bench --bench extract_text -- --quick

CI runs the same on every push and PR. PRs that change extraction behavior should include a regression test under src/tables.rs::tests (or a new tests/ integration test), a benchmark delta in the PR description, and — if the change is structural — a note on why pdfminer.six / pdfplumber don't already do this.

License

MIT


Built by Ryan Stewart

About

Native Rust PDF extraction engine — text, tables, metadata, quality scoring.

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors