Skip to content

Support transformers 5.x and huggingface-hub 1.x#520

Open
gkriegspeedbay wants to merge 1 commit into
datalab-to:masterfrom
gkriegspeedbay:transformers-5-compat
Open

Support transformers 5.x and huggingface-hub 1.x#520
gkriegspeedbay wants to merge 1 commit into
datalab-to:masterfrom
gkriegspeedbay:transformers-5-compat

Conversation

@gkriegspeedbay

Copy link
Copy Markdown

What

Make surya-ocr work under transformers 5.x (and the huggingface-hub 1.x it pulls in). Fixes #492. Supersedes #487, which patched surya/common/donut/encoder.py — a file that no longer exists on master.

surya declares transformers>=4.56.1 with no upper bound, so fresh installs already resolve transformers 5.x and break; CI stays green only because uv sync --frozen pins transformers 4.57 in uv.lock.

Changes

transformers 5.x removed several symbols surya imports. Each is guarded with try/except ImportError, so transformers 4.x behavior is unchanged:

  • pytorch_utils.find_pruneable_heads_and_indices → vendored polyfill (ocr_error/model/encoder.py)
  • transformers.onnx (built-in ONNX exporter, moved to optimum) → guarded import; the export-only DistilBertOnnxConfig is skipped when unavailable (ocr_error/model/config.py)
  • tokenization_utils._is_control / _is_punctuation / _is_whitespace → vendored (ocr_error/tokenizer.py)
  • ModuleUtilsMixin.get_head_mask → vendored onto DistilBertModel (ocr_error/model/encoder.py)
  • Embeddings position ids: derived from arange in forward instead of the persistent=False position_ids buffer. Under 5.x's meta-device init that buffer is never materialized (and isn't in the checkpoint), giving out-of-range indices into position_embeddings.

Silent weight-load failure (common/s3.py): under transformers 5.x, from_pretrained's meta-device path does not materialize checkpoint weights into these custom model classes — it reports a clean load (no missing/unexpected keys, no error) while leaving every parameter at its random init, which collapses model output. S3DownloaderMixin now re-materializes weights from the local single-file model.safetensors after from_pretrained, scoped to transformers>=5 and to nn.Module results (so 4.x and the config/tokenizer/processor classes are untouched). This affected the OCR-error DistilBert and the EfficientViT detector; stock transformers.DistilBertForSequenceClassification loads the same checkpoint fine under 5.x, so this is specific to the vendored classes' interaction with the 5.x loader.

Dependencies:

  • huggingface-hub: <1<2 (transformers 5.x requires hub ≥ 1.3)
  • requests: added as a direct dependency (imported in common/s3.py and debug/fonts.py; previously satisfied transitively via huggingface-hub 0.x, which hub 1.x dropped)

Testing

Full test suite passes on both transformers 5.12.0 and 4.57.6 (Python 3.11, Pillow 12.2.0, CPU): test_ocr_errors, test_detection, test_layout, test_recognition, test_table_rec. The VLM-backed suites ran against a local llama-server (llama.cpp) backend.

Note: I did not touch uv.lock, so CI (uv sync --frozen) still exercises transformers 4.57. Happy to regenerate the lockfile onto transformers 5.x in this PR, or leave that as a separate step — your call.

🤖 Generated with Claude Code

surya-ocr 0.20.0 declares `transformers>=4.56.1` with no upper bound, so
fresh installs resolve transformers 5.x and break, while CI stays green
against the pinned uv.lock (transformers 4.57). This makes the library
actually work under transformers 5.x. Fixes datalab-to#492, supersedes datalab-to#487.

transformers 5.x removals handled (all guarded so 4.x is unchanged):
- pytorch_utils.find_pruneable_heads_and_indices -> vendored polyfill
  (surya/ocr_error/model/encoder.py)
- transformers.onnx (ONNX exporter, moved to `optimum`) -> guarded import;
  the ONNX-export-only DistilBertOnnxConfig is skipped when absent
- tokenization_utils._is_control/_is_punctuation/_is_whitespace -> vendored
- ModuleUtilsMixin.get_head_mask -> vendored onto DistilBertModel
- Embeddings position_ids: derive from arange in forward instead of a
  persistent=False buffer; under 5.x's meta-device init that buffer is
  never materialized (and is absent from the checkpoint), yielding
  out-of-range indices into position_embeddings

Silent weight-load failure (surya/common/s3.py): transformers 5.x's
meta-device from_pretrained does not materialize checkpoint weights into
these vendored model classes -- it reports a clean load (no missing/
unexpected keys) while leaving every parameter at random init, collapsing
model output. S3DownloaderMixin now re-materializes weights from the local
single-file safetensors after from_pretrained. Scoped to transformers >= 5
and to nn.Module results, so 4.x and the config/tokenizer/processor classes
are untouched. Affected the OCR-error DistilBert and the EfficientViT detector.

Dependencies:
- huggingface-hub: <1 -> <2 (transformers 5.x requires hub >= 1.3)
- requests: added as a direct dep (used in common/s3.py + debug/fonts.py;
  was previously transitive via huggingface-hub 0.x, dropped by hub 1.x)

Verified: full test suite passes on both transformers 5.12.0 and 4.57.6
(ocr_error, detection, layout, recognition, table_rec).
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@gkriegspeedbay

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@gkriegspeedbay

Copy link
Copy Markdown
Author

For downstream context on why this matters: surya's transitive transformers<5 / pillow<11 constraints (via marker-pdf) block security updates in tools that depend on marker, so these two PRs are the root unblock.

One caveat I verified for the marker side: marker 1.10.2 can't consume a surya 0.20 build directly — it imports surya.common.surya.schema.TaskNames and other 0.17-era internals that 0.20 restructured — so propagating this fix downstream also needs marker's surya-0.20 migration (separate from these PRs). Flagging in case it helps prioritize; happy to help where useful.

@gkriegspeedbay

Copy link
Copy Markdown
Author

Heads-up for whoever reviews: this pairs with #519 (relaxes pillow<11<13). Together they let downstream consumers (e.g. marker-pdf) move to transformers 5 + Pillow 12; merging only one leaves the other ceiling in place. Glad to combine the two into one PR if that's simpler on your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incompatible with transformers 5.x: find_pruneable_heads_and_indices removed

1 participant