[python] Support row-level Blob access by XiaoHongbo-Hope · Pull Request #7891 · apache/paimon

XiaoHongbo-Hope · 2026-05-18T13:02:47Z

Purpose

Tests

…b-as-descriptor setting

Revert the read-path changes from 4cf7b23 that always resolved blob descriptors to data. Restore master behavior where blob-as-descriptor flag is respected. Add Blob.from_bytes(data, file_io) as a unified entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches BlobData vs BlobRef based on the bytes content.

- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches

- from_bytes(None) returns None (not empty BlobData), matching Java - Add allow_blob_data parameter (Java's 4th arg): when False, always interpret bytes as descriptor - Integrate into read path: _blob_cell_to_data now delegates to Blob.from_bytes instead of duplicating dispatch logic

Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.

JingsongLi

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

to_iterator() now yields rows whose BLOB columns are raw stored bytes (descriptor or inline) rather than eagerly resolved payload bytes; row.get_blob(pos) returns a Blob for descriptor cells and raises on inline cells, matching ColumnarRow.getBlob in Java. RecordBatchReader.read_batch() takes no arguments; file_io is held on the reader and threaded onto the row at iterator construction. Removed read-time descriptor-to-data conversion paths (_convert_descriptor_stored_blob_columns, _blob_cell_to_data, BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor bytes regardless of blob_as_descriptor option. to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs on DataFileBatchReader / FormatBlobReader and the OffsetRow with_blob_context shim are kept for one release and will be removed in a follow-up. Tests that asserted on resolved blob payload bytes from to_iterator() / to_arrow() must now resolve descriptor cells via Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see the _resolve_blobs helper added in blob_test.py / blob_table_test.py).

This reverts commit 151fa9c.

Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to match Java RecordReader.readBatch(); inject file_io directly onto OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO). OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value, file_io)), dropping the _blob_field_indices defensive check Java has no counterpart for. to_blob_iterator() becomes a thin alias of to_iterator(). The read-path behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor resolution when the option is false) is preserved — this PR only aligns the row/iterator API shape, not the read-path semantics. Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a thin alias forwarding to set_file_io(file_io) for one release. Existing callers of read.to_iterator() / read.to_arrow() see no behaviour change.

Java has no separate blob iterator on TableRead; one createReader entry point yields rows whose getBlob() works uniformly. After the row API alignment, to_iterator() returns rows with working get_blob(pos) for both descriptor- and inline-stored blobs, so to_blob_iterator() was a no-op alias with no Java counterpart.

…edRow impls Mirror Java DataGetters where getBlob is part of the interface contract: every InternalRow implementation must provide it. BinaryRow delegates to Blob.from_bytes (no file_io needed for in-memory rows), ProjectedRow forwards to the inner row via the index mapping.

New page covering the Python API for blob columns: - Creating a table with a BLOB column - Writing blobs via PyArrow Table - Reading via row.get_blob(pos) — to_data() and new_input_stream() - Lower-level Blob.from_bytes factory Add a back-link from the existing append-table/blob.md (Java/SQL main doc) to the new page for discoverability.

- test_from_bytes_with_descriptor uses tempfile.TemporaryDirectory so a mid-test failure no longer leaks the temp file. - Rename test_to_iterator_unchanged to test_to_iterator_yields_all_rows (the previous name only made sense in the context of this branch). - test_get_blob_streaming now loops every row and asserts content for each, rather than breaking after the first. - New GetBlobMultiColumnTest exercises a table with two BLOB columns and verifies row.get_blob(pos) is independent per position.

- Blob.from_bytes: when allow_blob_data=False and the input is not a BlobDescriptor, raise a clear ValueError instead of falling into deserialize() which would surface a low-level magic-header error. - DataFileBatchReader._blob_cell_to_data: drop the dead `if blob is not None else None` ternary; at that point value is guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return None either.

…et_blob logic-1: BinaryRow.get_blob() was calling Blob.from_bytes() on the field value, but BinaryRow's BLOB cells are parsed by GenericRowDeserializer into Blob instances rather than raw bytes, so the implementation always raised TypeError. Mirror the GenericRow.get_blob() unwrap pattern. security-2: OffsetRow.get_blob(pos) was unconditionally forwarding the cell at pos to Blob.from_bytes(...), which auto-detects descriptor magic and resolves the embedded URI via file_io. An attacker who can write a BLOBDESC-prefixed payload into a VARBINARY column could weaponise a mis-call of row.get_blob(pos) into SSRF on the reader host. Thread blob_field_indices alongside file_io through the reader chain (DataFileBatchReader → wrapper readers → InternalRowWrapperIterator → OffsetRow). OffsetRow.get_blob now rejects positions that are not BLOB fields. OuterProjectionRecordReader remaps the inner blob_field_indices into projected positions (only top-level paths preserve BLOB-ness; nested extractions descend into ROW sub-structures). Adds GetBlobNonBlobColumnSecurityTest covering a VARBINARY column whose contents start with the BLOBDESC magic header — get_blob must refuse without touching the URI.

split_read.py extracts `_blob_field_indices(fields)` helper, replacing three identical set comprehensions. GetBlobNonBlobColumnSecurityTest drops the per-block explanatory comments; the class docstring already states the SSRF threat model.

…ail-closed OffsetRow - BlobDescriptorConvertReader was the only wrapper RecordBatchReader that did not propagate file_io / blob_field_indices from its inner reader, silently disabling the SSRF guard whenever DataEvolutionSplitRead chose this wrapper as the outermost reader. Mirror the propagation pattern used by FilterRecordBatchReader / RowIdFilterRecordBatchReader. - OffsetRow.get_blob's guard was fail-open: rows constructed without set_blob_field_indices() bypassed the check entirely (e.g. the OffsetRow pair inside KeyValue). Switch the default to frozenset() so any row that never opts in rejects every position. Legitimate readers continue to widen via set_blob_field_indices(). - Strengthen GetBlobNonBlobColumnSecurityTest with mock that fails loudly if uri_reader_factory.create is invoked, so a regression that resolves the URI before raising TypeError can no longer pass. - Drop the dead blob_field_indices inheritance loop in DataEvolutionMergeReader: all inner MergeAllBatchReader instances are constructed without blob_field_indices set, so the loop never fired, and the assumed semantics ("inherit from first inner") would be wrong anyway because DataEvolutionMergeReader reshuffles columns via row_offsets/field_offsets.

The attribute was computed from the constructor's `fields` parameter (table-schema order), but the OffsetRow tuple uses the projected / post-mapping schema order, so the two index spaces don't agree. Today every code path wraps DataFileBatchReader with an outer ConcatBatchReader whose blob_field_indices is computed correctly from `self.read_fields` in split_read.py and overrides the inner one — so the bug never fires — but a future caller invoking DataFileBatchReader.read_batch() directly would silently use the wrong index space, either denying legitimate BLOB access or re-opening the SSRF surface depending on column collisions. Drop the attribute; outer wrappers remain the single source of truth. If DataFileBatchReader is ever called without an outer wrapper, the class-level default `blob_field_indices = None` propagates into OffsetRow which keeps its fail-closed `frozenset()`, so get_blob raises TypeError on every position rather than mis-resolving descriptors.

Previous "Reading Blob Data" section recommended new_input_stream() for large blobs without noting that the default blob-as-descriptor=false read path materialises the payload before the Blob is constructed. In that mode new_input_stream() wraps in-memory bytes and offers no real streaming, contrary to the doc's wording. Move streaming guidance into a dedicated "Streaming for Large Blobs" section that: - documents both blob-as-descriptor modes and the BlobData / BlobRef shape difference - explicitly names the OOM risk under default mode - shows the blob-as-descriptor=true schema option to opt into lazy streaming (verified end-to-end against the read path) This mirrors Java's BlobFormatReader semantics — the default mode is also materialised in Java, and lazy streaming there equally requires the descriptor mode.

… name

XiaoHongbo-Hope added 3 commits May 18, 2026 20:52

[python] Always resolve blob to actual data on read regardless of blo…

4cf7b23

…b-as-descriptor setting

[python] Fix Blob.from_bytes type annotation and add tests

a610d97

- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00

XiaoHongbo-Hope added 2 commits May 19, 2026 18:09

[python] Fix flake8 lint errors

5a13539

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10

XiaoHongbo-Hope changed the title ~~[python] Support transparent blob resolution on read~~ [python] Add Blob.from_bytes unified API May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes unified API~~ [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes to support interpreting blob bytes as Blob object~~ [python] Support unified blob reads May 19, 2026

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00

XiaoHongbo-Hope changed the title ~~[python] Support unified blob reads~~ [python] Add unified Blob.from_bytes resolver May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add unified Blob.from_bytes resolver~~ [python] Support row-level Blob access May 21, 2026

[python] Support row-level Blob access aligned with Java getBlob

3a2a85d

Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.

JingsongLi reviewed May 23, 2026

View reviewed changes

XiaoHongbo-Hope added 6 commits May 24, 2026 16:20

Revert "[python] Align BLOB read path with Java getBlob semantics"

5983598

This reverts commit 151fa9c.

[python] Remove unused OffsetRow.with_blob_context alias

a848c1f

XiaoHongbo-Hope marked this pull request as ready for review May 24, 2026 12:55

XiaoHongbo-Hope added 7 commits May 24, 2026 21:03

[python] Drop verbose remap comment in OuterProjectionRecordReader

e41934d

XiaoHongbo-Hope added 5 commits May 24, 2026 22:38

[python] Add e2e test pinning BlobDescriptorConvertReader propagation

c66e433

[python] Fix misleading blob-as-descriptor doc phrasing and lazy test…

0ed1cb6

… name

[python] Drop unused file_io plumbing on inner blob readers

24152b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support row-level Blob access#7891

[python] Support row-level Blob access#7891
XiaoHongbo-Hope wants to merge 24 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

XiaoHongbo-Hope commented May 18, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaoHongbo-Hope commented May 18, 2026

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: [python] Support row-level Blob access

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

2. OffsetRow.get_blob() when no blob context is set

3. Blob.from_bytes with allow_blob_data=False edge case

4. Minor: return type annotation

5. Minor: redundant None check in data_file_batch_reader.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants