Skip to content

[python] Support row-level Blob access#7891

Merged
JingsongLi merged 1 commit into
apache:masterfrom
XiaoHongbo-Hope:inline_blob
May 25, 2026
Merged

[python] Support row-level Blob access#7891
JingsongLi merged 1 commit into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope commented May 18, 2026

Purpose

Add InternalRow.get_blob(pos) to pypaimon, aligned with Java InternalRow.getBlob. Reads on BLOB columns return a Blob object (BlobData for inline storage, BlobRef for descriptor storage with lazy URI resolution).

Also adds Blob.from_bytes(data, file_io) factory that auto-dispatches based on the BLOBDESC magic header (mirrors Java Blob.fromBytes).

Tests

  • GetBlobTest / GetBlobMultiColumnTest — row-level access on inline and descriptor blob storage
  • GetBlobThroughDescriptorConvertReaderTest — pins propagation through BlobDescriptorConvertReader
  • GetBlobNonBlobColumnSecurityTest — SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URI
  • Blob.from_bytes factory unit tests

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support transparent blob resolution on read [python] Add Blob.from_bytes unified API May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes unified API [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object [python] Support unified blob reads May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support unified blob reads [python] Add unified Blob.from_bytes resolver May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add unified Blob.from_bytes resolver [python] Support row-level Blob access May 21, 2026
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

  • Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
  • Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 24, 2026 12:55
XiaoHongbo-Hope added a commit to XiaoHongbo-Hope/paimon that referenced this pull request May 24, 2026
- Blob.from_bytes: when allow_blob_data=False and the input is not a
  BlobDescriptor, raise a clear ValueError instead of falling into
  deserialize() which would surface a low-level magic-header error.
- DataFileBatchReader._blob_cell_to_data: drop the dead
  `if blob is not None else None` ternary; at that point value is
  guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return
  None either.
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Support row-level Blob access (Revised)

This revision removes the to_blob_iterator() approach in favor of propagating file_io / blob_field_indices through the reader chain, which is a much cleaner design. Comments below:


2. Blob.from_bytes() — dead branch in condition

if not allow_blob_data and not is_descriptor:
    raise ValueError(...)          # ← exits here

if is_descriptor or not allow_blob_data:  # ← "not allow_blob_data" is dead code
    ...

If not allow_blob_data is True, the only way to reach the second if is when is_descriptor is True (otherwise we already raised). So the condition simplifies to:

if is_descriptor:
    ...
return BlobData(data)

4. OffsetRow post-construction mutation

set_file_io() / set_blob_field_indices() mutate state after __init__, making OffsetRow harder to reason about. These values are known at construction time (from the reader that creates the iterator). Consider passing them through the constructor, or at least through InternalRowWrapperIterator.__init__OffsetRow.__init__, rather than setter methods.

This also interacts with the _reused_row pattern — since the same OffsetRow is reused across rows, the blob context is set once and remains valid. But if someone constructs an OffsetRow directly without calling the setters (e.g., in tests or custom code), get_blob() will raise a confusing TypeError("not a BLOB field") when the real issue is missing setup.


5. PR description is empty

Please fill in the Purpose section — what motivated the change, and what user-facing API is being added. This helps reviewers and future readers of git history.

Add `InternalRow.get_blob(pos)` to pypaimon, aligned with Java `InternalRow.getBlob`. Reads on BLOB columns return a `Blob` object (`BlobData` for inline storage, `BlobRef` for descriptor storage with lazy URI resolution).

Also adds `Blob.from_bytes(data, file_io)` factory that auto-dispatches based on the BLOBDESC magic header (mirrors Java `Blob.fromBytes`).

- `GetBlobTest` / `GetBlobMultiColumnTest` — row-level access on inline and descriptor blob storage
- `GetBlobThroughDescriptorConvertReaderTest` — pins propagation through `BlobDescriptorConvertReader`
- `GetBlobNonBlobColumnSecurityTest` — SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URI
- `Blob.from_bytes` factory unit tests
@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor Author

Review: [python] Support row-level Blob access (Revised)

This revision removes the to_blob_iterator() approach in favor of propagating file_io / blob_field_indices through the reader chain, which is a much cleaner design. Comments below:

2. Blob.from_bytes() — dead branch in condition

if not allow_blob_data and not is_descriptor:
    raise ValueError(...)          # ← exits here

if is_descriptor or not allow_blob_data:  # ← "not allow_blob_data" is dead code
    ...

If not allow_blob_data is True, the only way to reach the second if is when is_descriptor is True (otherwise we already raised). So the condition simplifies to:

if is_descriptor:
    ...
return BlobData(data)

4. OffsetRow post-construction mutation

set_file_io() / set_blob_field_indices() mutate state after __init__, making OffsetRow harder to reason about. These values are known at construction time (from the reader that creates the iterator). Consider passing them through the constructor, or at least through InternalRowWrapperIterator.__init__OffsetRow.__init__, rather than setter methods.

This also interacts with the _reused_row pattern — since the same OffsetRow is reused across rows, the blob context is set once and remains valid. But if someone constructs an OffsetRow directly without calling the setters (e.g., in tests or custom code), get_blob() will raise a confusing TypeError("not a BLOB field") when the real issue is missing setup.

5. PR description is empty

Please fill in the Purpose section — what motivated the change, and what user-facing API is being added. This helps reviewers and future readers of git history.

Thanks, Jingsong, updated

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 2369501 into apache:master May 25, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants