Support transformers 5.x and huggingface-hub 1.x by gkriegspeedbay · Pull Request #520 · datalab-to/surya

gkriegspeedbay · 2026-06-13T07:39:31Z

What

Make surya-ocr work under transformers 5.x (and the huggingface-hub 1.x it pulls in). Fixes #492. Supersedes #487, which patched surya/common/donut/encoder.py — a file that no longer exists on master.

surya declares transformers>=4.56.1 with no upper bound, so fresh installs already resolve transformers 5.x and break; CI stays green only because uv sync --frozen pins transformers 4.57 in uv.lock.

Changes

transformers 5.x removed several symbols surya imports. Each is guarded with try/except ImportError, so transformers 4.x behavior is unchanged:

pytorch_utils.find_pruneable_heads_and_indices → vendored polyfill (ocr_error/model/encoder.py)
transformers.onnx (built-in ONNX exporter, moved to optimum) → guarded import; the export-only DistilBertOnnxConfig is skipped when unavailable (ocr_error/model/config.py)
tokenization_utils._is_control / _is_punctuation / _is_whitespace → vendored (ocr_error/tokenizer.py)
ModuleUtilsMixin.get_head_mask → vendored onto DistilBertModel (ocr_error/model/encoder.py)
Embeddings position ids: derived from arange in forward instead of the persistent=False position_ids buffer. Under 5.x's meta-device init that buffer is never materialized (and isn't in the checkpoint), giving out-of-range indices into position_embeddings.

Silent weight-load failure (common/s3.py): under transformers 5.x, from_pretrained's meta-device path does not materialize checkpoint weights into these custom model classes — it reports a clean load (no missing/unexpected keys, no error) while leaving every parameter at its random init, which collapses model output. S3DownloaderMixin now re-materializes weights from the local single-file model.safetensors after from_pretrained, scoped to transformers>=5 and to nn.Module results (so 4.x and the config/tokenizer/processor classes are untouched). This affected the OCR-error DistilBert and the EfficientViT detector; stock transformers.DistilBertForSequenceClassification loads the same checkpoint fine under 5.x, so this is specific to the vendored classes' interaction with the 5.x loader.

Dependencies:

huggingface-hub: <1 → <2 (transformers 5.x requires hub ≥ 1.3)
requests: added as a direct dependency (imported in common/s3.py and debug/fonts.py; previously satisfied transitively via huggingface-hub 0.x, which hub 1.x dropped)

Testing

Full test suite passes on both transformers 5.12.0 and 4.57.6 (Python 3.11, Pillow 12.2.0, CPU): test_ocr_errors, test_detection, test_layout, test_recognition, test_table_rec. The VLM-backed suites ran against a local llama-server (llama.cpp) backend.

Note: I did not touch uv.lock, so CI (uv sync --frozen) still exercises transformers 4.57. Happy to regenerate the lockfile onto transformers 5.x in this PR, or leave that as a separate step — your call.

🤖 Generated with Claude Code

surya-ocr 0.20.0 declares `transformers>=4.56.1` with no upper bound, so fresh installs resolve transformers 5.x and break, while CI stays green against the pinned uv.lock (transformers 4.57). This makes the library actually work under transformers 5.x. Fixes datalab-to#492, supersedes datalab-to#487. transformers 5.x removals handled (all guarded so 4.x is unchanged): - pytorch_utils.find_pruneable_heads_and_indices -> vendored polyfill (surya/ocr_error/model/encoder.py) - transformers.onnx (ONNX exporter, moved to `optimum`) -> guarded import; the ONNX-export-only DistilBertOnnxConfig is skipped when absent - tokenization_utils._is_control/_is_punctuation/_is_whitespace -> vendored - ModuleUtilsMixin.get_head_mask -> vendored onto DistilBertModel - Embeddings position_ids: derive from arange in forward instead of a persistent=False buffer; under 5.x's meta-device init that buffer is never materialized (and is absent from the checkpoint), yielding out-of-range indices into position_embeddings Silent weight-load failure (surya/common/s3.py): transformers 5.x's meta-device from_pretrained does not materialize checkpoint weights into these vendored model classes -- it reports a clean load (no missing/ unexpected keys) while leaving every parameter at random init, collapsing model output. S3DownloaderMixin now re-materializes weights from the local single-file safetensors after from_pretrained. Scoped to transformers >= 5 and to nn.Module results, so 4.x and the config/tokenizer/processor classes are untouched. Affected the OCR-error DistilBert and the EfficientViT detector. Dependencies: - huggingface-hub: <1 -> <2 (transformers 5.x requires hub >= 1.3) - requests: added as a direct dep (used in common/s3.py + debug/fonts.py; was previously transitive via huggingface-hub 0.x, dropped by hub 1.x) Verified: full test suite passes on both transformers 5.12.0 and 4.57.6 (ocr_error, detection, layout, recognition, table_rec).

github-actions · 2026-06-13T07:39:42Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

gkriegspeedbay · 2026-06-13T07:41:39Z

I have read the CLA Document and I hereby sign the CLA

gkriegspeedbay · 2026-06-13T08:05:29Z

For downstream context on why this matters: surya's transitive transformers<5 / pillow<11 constraints (via marker-pdf) block security updates in tools that depend on marker, so these two PRs are the root unblock.

One caveat I verified for the marker side: marker 1.10.2 can't consume a surya 0.20 build directly — it imports surya.common.surya.schema.TaskNames and other 0.17-era internals that 0.20 restructured — so propagating this fix downstream also needs marker's surya-0.20 migration (separate from these PRs). Flagging in case it helps prioritize; happy to help where useful.

gkriegspeedbay · 2026-06-13T20:15:33Z

Heads-up for whoever reviews: this pairs with #519 (relaxes pillow<11 → <13). Together they let downstream consumers (e.g. marker-pdf) move to transformers 5 + Pillow 12; merging only one leaves the other ceiling in place. Glad to combine the two into one PR if that's simpler on your end.

This was referenced Jun 13, 2026

Allow Pillow 11 and 12 (relax pillow<11 to pillow<13) #519

Open

Support surya-ocr 0.20 (transformers 5.x / Pillow 12 / huggingface-hub 1.x) datalab-to/marker#1048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transformers 5.x and huggingface-hub 1.x#520

Support transformers 5.x and huggingface-hub 1.x#520
gkriegspeedbay wants to merge 1 commit into
datalab-to:masterfrom
gkriegspeedbay:transformers-5-compat

gkriegspeedbay commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gkriegspeedbay commented Jun 13, 2026

What

Changes

Testing

Uh oh!

github-actions Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

gkriegspeedbay commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 13, 2026 •

edited

Loading