[python] Support row-level Blob access by XiaoHongbo-Hope · Pull Request #7891 · apache/paimon

XiaoHongbo-Hope · 2026-05-18T13:02:47Z

Purpose

Add InternalRow.get_blob(pos) to pypaimon, aligned with Java InternalRow.getBlob. Reads on BLOB columns return a Blob object (BlobData for inline storage, BlobRef for descriptor storage with lazy URI resolution).

Also adds Blob.from_bytes(data, file_io) factory that auto-dispatches based on the BLOBDESC magic header (mirrors Java Blob.fromBytes).

Tests

GetBlobTest / GetBlobMultiColumnTest — row-level access on inline and descriptor blob storage
GetBlobThroughDescriptorConvertReaderTest — pins propagation through BlobDescriptorConvertReader
GetBlobNonBlobColumnSecurityTest — SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URI
Blob.from_bytes factory unit tests

JingsongLi

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

- Blob.from_bytes: when allow_blob_data=False and the input is not a BlobDescriptor, raise a clear ValueError instead of falling into deserialize() which would surface a low-level magic-header error. - DataFileBatchReader._blob_cell_to_data: drop the dead `if blob is not None else None` ternary; at that point value is guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return None either.

JingsongLi

Review: [python] Support row-level Blob access (Revised)

This revision removes the to_blob_iterator() approach in favor of propagating file_io / blob_field_indices through the reader chain, which is a much cleaner design. Comments below:

2. `Blob.from_bytes()` — dead branch in condition

if not allow_blob_data and not is_descriptor:
    raise ValueError(...)          # ← exits here

if is_descriptor or not allow_blob_data:  # ← "not allow_blob_data" is dead code
    ...

If not allow_blob_data is True, the only way to reach the second if is when is_descriptor is True (otherwise we already raised). So the condition simplifies to:

if is_descriptor:
    ...
return BlobData(data)

4. `OffsetRow` post-construction mutation

set_file_io() / set_blob_field_indices() mutate state after __init__, making OffsetRow harder to reason about. These values are known at construction time (from the reader that creates the iterator). Consider passing them through the constructor, or at least through InternalRowWrapperIterator.__init__ → OffsetRow.__init__, rather than setter methods.

This also interacts with the _reused_row pattern — since the same OffsetRow is reused across rows, the blob context is set once and remains valid. But if someone constructs an OffsetRow directly without calling the setters (e.g., in tests or custom code), get_blob() will raise a confusing TypeError("not a BLOB field") when the real issue is missing setup.

5. PR description is empty

Please fill in the Purpose section — what motivated the change, and what user-facing API is being added. This helps reviewers and future readers of git history.

Add `InternalRow.get_blob(pos)` to pypaimon, aligned with Java `InternalRow.getBlob`. Reads on BLOB columns return a `Blob` object (`BlobData` for inline storage, `BlobRef` for descriptor storage with lazy URI resolution). Also adds `Blob.from_bytes(data, file_io)` factory that auto-dispatches based on the BLOBDESC magic header (mirrors Java `Blob.fromBytes`). - `GetBlobTest` / `GetBlobMultiColumnTest` — row-level access on inline and descriptor blob storage - `GetBlobThroughDescriptorConvertReaderTest` — pins propagation through `BlobDescriptorConvertReader` - `GetBlobNonBlobColumnSecurityTest` — SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URI - `Blob.from_bytes` factory unit tests

XiaoHongbo-Hope · 2026-05-25T07:15:04Z

Review: [python] Support row-level Blob access (Revised)

This revision removes the to_blob_iterator() approach in favor of propagating file_io / blob_field_indices through the reader chain, which is a much cleaner design. Comments below:

2. Blob.from_bytes() — dead branch in condition
if not allow_blob_data and not is_descriptor:
    raise ValueError(...)          # ← exits here

if is_descriptor or not allow_blob_data:  # ← "not allow_blob_data" is dead code
    ...
If not allow_blob_data is True, the only way to reach the second if is when is_descriptor is True (otherwise we already raised). So the condition simplifies to:
if is_descriptor:
    ...
return BlobData(data)
4. OffsetRow post-construction mutation

set_file_io() / set_blob_field_indices() mutate state after __init__, making OffsetRow harder to reason about. These values are known at construction time (from the reader that creates the iterator). Consider passing them through the constructor, or at least through InternalRowWrapperIterator.__init__ → OffsetRow.__init__, rather than setter methods.

This also interacts with the _reused_row pattern — since the same OffsetRow is reused across rows, the blob context is set once and remains valid. But if someone constructs an OffsetRow directly without calling the setters (e.g., in tests or custom code), get_blob() will raise a confusing TypeError("not a BLOB field") when the real issue is missing setup.

5. PR description is empty

Please fill in the Purpose section — what motivated the change, and what user-facing API is being added. This helps reviewers and future readers of git history.

Thanks, Jingsong, updated

JingsongLi

+1

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00

XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10

XiaoHongbo-Hope changed the title ~~[python] Support transparent blob resolution on read~~ [python] Add Blob.from_bytes unified API May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes unified API~~ [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add Blob.from_bytes to support interpreting blob bytes as Blob object~~ [python] Support unified blob reads May 19, 2026

XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00

XiaoHongbo-Hope changed the title ~~[python] Support unified blob reads~~ [python] Add unified Blob.from_bytes resolver May 19, 2026

XiaoHongbo-Hope changed the title ~~[python] Add unified Blob.from_bytes resolver~~ [python] Support row-level Blob access May 21, 2026

JingsongLi reviewed May 23, 2026

View reviewed changes

XiaoHongbo-Hope marked this pull request as ready for review May 24, 2026 12:55

JingsongLi reviewed May 25, 2026

View reviewed changes

XiaoHongbo-Hope force-pushed the inline_blob branch from 5b74c47 to 1bcbbac Compare May 25, 2026 04:52

JingsongLi approved these changes May 25, 2026

View reviewed changes

JingsongLi merged commit 2369501 into apache:master May 25, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support row-level Blob access#7891

[python] Support row-level Blob access#7891
JingsongLi merged 1 commit into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

XiaoHongbo-Hope commented May 18, 2026 •

edited

Loading

Uh oh!

JingsongLi left a comment

Uh oh!

JingsongLi left a comment •

edited

Loading

Uh oh!

XiaoHongbo-Hope commented May 25, 2026

Review: [python] Support row-level Blob access (Revised)

2. `Blob.from_bytes()` — dead branch in condition

4. `OffsetRow` post-construction mutation

5. PR description is empty

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaoHongbo-Hope commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: [python] Support row-level Blob access

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

2. OffsetRow.get_blob() when no blob context is set

3. Blob.from_bytes with allow_blob_data=False edge case

4. Minor: return type annotation

5. Minor: redundant None check in data_file_batch_reader.py

Uh oh!

JingsongLi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: [python] Support row-level Blob access (Revised)

2. Blob.from_bytes() — dead branch in condition

4. OffsetRow post-construction mutation

5. PR description is empty

Uh oh!

XiaoHongbo-Hope commented May 25, 2026

Review: [python] Support row-level Blob access (Revised)

2. Blob.from_bytes() — dead branch in condition

4. OffsetRow post-construction mutation

5. PR description is empty

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XiaoHongbo-Hope commented May 18, 2026 •

edited

Loading

JingsongLi left a comment •

edited

Loading

2. `Blob.from_bytes()` — dead branch in condition

4. `OffsetRow` post-construction mutation

2. `Blob.from_bytes()` — dead branch in condition

4. `OffsetRow` post-construction mutation