[python] Support row-level Blob access#7891
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Support row-level Blob access
Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.
1. Shared mutable state in to_blob_iterator (correctness / thread-safety)
In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:
- Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
- Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.
A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.
2. OffsetRow.get_blob() when no blob context is set
Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.
3. Blob.from_bytes with allow_blob_data=False edge case
When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.
4. Minor: return type annotation
to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].
5. Minor: redundant None check in data_file_batch_reader.py
After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.
Nice work on the Blob.from_bytes() unification and the lazy-access pattern.
- Blob.from_bytes: when allow_blob_data=False and the input is not a BlobDescriptor, raise a clear ValueError instead of falling into deserialize() which would surface a low-level magic-header error. - DataFileBatchReader._blob_cell_to_data: drop the dead `if blob is not None else None` ternary; at that point value is guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return None either.
There was a problem hiding this comment.
Review: [python] Support row-level Blob access (Revised)
This revision removes the to_blob_iterator() approach in favor of propagating file_io / blob_field_indices through the reader chain, which is a much cleaner design. Comments below:
2. Blob.from_bytes() — dead branch in condition
if not allow_blob_data and not is_descriptor:
raise ValueError(...) # ← exits here
if is_descriptor or not allow_blob_data: # ← "not allow_blob_data" is dead code
...If not allow_blob_data is True, the only way to reach the second if is when is_descriptor is True (otherwise we already raised). So the condition simplifies to:
if is_descriptor:
...
return BlobData(data)4. OffsetRow post-construction mutation
set_file_io() / set_blob_field_indices() mutate state after __init__, making OffsetRow harder to reason about. These values are known at construction time (from the reader that creates the iterator). Consider passing them through the constructor, or at least through InternalRowWrapperIterator.__init__ → OffsetRow.__init__, rather than setter methods.
This also interacts with the _reused_row pattern — since the same OffsetRow is reused across rows, the blob context is set once and remains valid. But if someone constructs an OffsetRow directly without calling the setters (e.g., in tests or custom code), get_blob() will raise a confusing TypeError("not a BLOB field") when the real issue is missing setup.
5. PR description is empty
Please fill in the Purpose section — what motivated the change, and what user-facing API is being added. This helps reviewers and future readers of git history.
Add `InternalRow.get_blob(pos)` to pypaimon, aligned with Java `InternalRow.getBlob`. Reads on BLOB columns return a `Blob` object (`BlobData` for inline storage, `BlobRef` for descriptor storage with lazy URI resolution). Also adds `Blob.from_bytes(data, file_io)` factory that auto-dispatches based on the BLOBDESC magic header (mirrors Java `Blob.fromBytes`). - `GetBlobTest` / `GetBlobMultiColumnTest` — row-level access on inline and descriptor blob storage - `GetBlobThroughDescriptorConvertReaderTest` — pins propagation through `BlobDescriptorConvertReader` - `GetBlobNonBlobColumnSecurityTest` — SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URI - `Blob.from_bytes` factory unit tests
5b74c47 to
1bcbbac
Compare
Thanks, Jingsong, updated |
Purpose
Add
InternalRow.get_blob(pos)to pypaimon, aligned with JavaInternalRow.getBlob. Reads on BLOB columns return aBlobobject (BlobDatafor inline storage,BlobReffor descriptor storage with lazy URI resolution).Also adds
Blob.from_bytes(data, file_io)factory that auto-dispatches based on the BLOBDESC magic header (mirrors JavaBlob.fromBytes).Tests
GetBlobTest/GetBlobMultiColumnTest— row-level access on inline and descriptor blob storageGetBlobThroughDescriptorConvertReaderTest— pins propagation throughBlobDescriptorConvertReaderGetBlobNonBlobColumnSecurityTest— SSRF defence: non-BLOB columns containing magic-prefixed bytes never resolve a URIBlob.from_bytesfactory unit tests