Support transformers 5.x and huggingface-hub 1.x#520
Conversation
surya-ocr 0.20.0 declares `transformers>=4.56.1` with no upper bound, so fresh installs resolve transformers 5.x and break, while CI stays green against the pinned uv.lock (transformers 4.57). This makes the library actually work under transformers 5.x. Fixes datalab-to#492, supersedes datalab-to#487. transformers 5.x removals handled (all guarded so 4.x is unchanged): - pytorch_utils.find_pruneable_heads_and_indices -> vendored polyfill (surya/ocr_error/model/encoder.py) - transformers.onnx (ONNX exporter, moved to `optimum`) -> guarded import; the ONNX-export-only DistilBertOnnxConfig is skipped when absent - tokenization_utils._is_control/_is_punctuation/_is_whitespace -> vendored - ModuleUtilsMixin.get_head_mask -> vendored onto DistilBertModel - Embeddings position_ids: derive from arange in forward instead of a persistent=False buffer; under 5.x's meta-device init that buffer is never materialized (and is absent from the checkpoint), yielding out-of-range indices into position_embeddings Silent weight-load failure (surya/common/s3.py): transformers 5.x's meta-device from_pretrained does not materialize checkpoint weights into these vendored model classes -- it reports a clean load (no missing/ unexpected keys) while leaving every parameter at random init, collapsing model output. S3DownloaderMixin now re-materializes weights from the local single-file safetensors after from_pretrained. Scoped to transformers >= 5 and to nn.Module results, so 4.x and the config/tokenizer/processor classes are untouched. Affected the OCR-error DistilBert and the EfficientViT detector. Dependencies: - huggingface-hub: <1 -> <2 (transformers 5.x requires hub >= 1.3) - requests: added as a direct dep (used in common/s3.py + debug/fonts.py; was previously transitive via huggingface-hub 0.x, dropped by hub 1.x) Verified: full test suite passes on both transformers 5.12.0 and 4.57.6 (ocr_error, detection, layout, recognition, table_rec).
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
For downstream context on why this matters: surya's transitive One caveat I verified for the marker side: marker 1.10.2 can't consume a surya 0.20 build directly — it imports |
|
Heads-up for whoever reviews: this pairs with #519 (relaxes |
What
Make surya-ocr work under transformers 5.x (and the huggingface-hub 1.x it pulls in). Fixes #492. Supersedes #487, which patched
surya/common/donut/encoder.py— a file that no longer exists onmaster.surya declares
transformers>=4.56.1with no upper bound, so fresh installs already resolve transformers 5.x and break; CI stays green only becauseuv sync --frozenpins transformers 4.57 inuv.lock.Changes
transformers 5.x removed several symbols surya imports. Each is guarded with
try/except ImportError, so transformers 4.x behavior is unchanged:pytorch_utils.find_pruneable_heads_and_indices→ vendored polyfill (ocr_error/model/encoder.py)transformers.onnx(built-in ONNX exporter, moved tooptimum) → guarded import; the export-onlyDistilBertOnnxConfigis skipped when unavailable (ocr_error/model/config.py)tokenization_utils._is_control/_is_punctuation/_is_whitespace→ vendored (ocr_error/tokenizer.py)ModuleUtilsMixin.get_head_mask→ vendored ontoDistilBertModel(ocr_error/model/encoder.py)Embeddingsposition ids: derived fromarangeinforwardinstead of thepersistent=Falseposition_idsbuffer. Under 5.x's meta-device init that buffer is never materialized (and isn't in the checkpoint), giving out-of-range indices intoposition_embeddings.Silent weight-load failure (
common/s3.py): under transformers 5.x,from_pretrained's meta-device path does not materialize checkpoint weights into these custom model classes — it reports a clean load (no missing/unexpected keys, no error) while leaving every parameter at its random init, which collapses model output.S3DownloaderMixinnow re-materializes weights from the local single-filemodel.safetensorsafterfrom_pretrained, scoped totransformers>=5and tonn.Moduleresults (so 4.x and the config/tokenizer/processor classes are untouched). This affected the OCR-error DistilBert and the EfficientViT detector; stocktransformers.DistilBertForSequenceClassificationloads the same checkpoint fine under 5.x, so this is specific to the vendored classes' interaction with the 5.x loader.Dependencies:
huggingface-hub:<1→<2(transformers 5.x requires hub ≥ 1.3)requests: added as a direct dependency (imported incommon/s3.pyanddebug/fonts.py; previously satisfied transitively via huggingface-hub 0.x, which hub 1.x dropped)Testing
Full test suite passes on both transformers 5.12.0 and 4.57.6 (Python 3.11, Pillow 12.2.0, CPU):
test_ocr_errors,test_detection,test_layout,test_recognition,test_table_rec. The VLM-backed suites ran against a localllama-server(llama.cpp) backend.Note: I did not touch
uv.lock, so CI (uv sync --frozen) still exercises transformers 4.57. Happy to regenerate the lockfile onto transformers 5.x in this PR, or leave that as a separate step — your call.🤖 Generated with Claude Code