[python] Support row-level Blob access#7891
Conversation
…b-as-descriptor setting
Revert the read-path changes from 4cf7b23 that always resolved blob descriptors to data. Restore master behavior where blob-as-descriptor flag is respected. Add Blob.from_bytes(data, file_io) as a unified entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches BlobData vs BlobRef based on the bytes content.
- Add Optional[bytes] type annotation for data parameter - Raise ValueError when file_io is None but bytes are BlobDescriptor - Add unit tests for from_bytes covering all branches
- from_bytes(None) returns None (not empty BlobData), matching Java - Add allow_blob_data parameter (Java's 4th arg): when False, always interpret bytes as descriptor - Integrate into read path: _blob_cell_to_data now delegates to Blob.from_bytes instead of duplicating dispatch logic
Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead, enabling lazy Blob access with streaming support.
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [python] Support row-level Blob access
Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.
1. Shared mutable state in to_blob_iterator (correctness / thread-safety)
In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:
- Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
- Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.
A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.
2. OffsetRow.get_blob() when no blob context is set
Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.
3. Blob.from_bytes with allow_blob_data=False edge case
When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.
4. Minor: return type annotation
to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].
5. Minor: redundant None check in data_file_batch_reader.py
After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.
Nice work on the Blob.from_bytes() unification and the lazy-access pattern.
to_iterator() now yields rows whose BLOB columns are raw stored bytes (descriptor or inline) rather than eagerly resolved payload bytes; row.get_blob(pos) returns a Blob for descriptor cells and raises on inline cells, matching ColumnarRow.getBlob in Java. RecordBatchReader.read_batch() takes no arguments; file_io is held on the reader and threaded onto the row at iterator construction. Removed read-time descriptor-to-data conversion paths (_convert_descriptor_stored_blob_columns, _blob_cell_to_data, BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor bytes regardless of blob_as_descriptor option. to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs on DataFileBatchReader / FormatBlobReader and the OffsetRow with_blob_context shim are kept for one release and will be removed in a follow-up. Tests that asserted on resolved blob payload bytes from to_iterator() / to_arrow() must now resolve descriptor cells via Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see the _resolve_blobs helper added in blob_test.py / blob_table_test.py).
This reverts commit 151fa9c.
Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to match Java RecordReader.readBatch(); inject file_io directly onto OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO). OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value, file_io)), dropping the _blob_field_indices defensive check Java has no counterpart for. to_blob_iterator() becomes a thin alias of to_iterator(). The read-path behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor resolution when the option is false) is preserved — this PR only aligns the row/iterator API shape, not the read-path semantics. Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a thin alias forwarding to set_file_io(file_io) for one release. Existing callers of read.to_iterator() / read.to_arrow() see no behaviour change.
Java has no separate blob iterator on TableRead; one createReader entry point yields rows whose getBlob() works uniformly. After the row API alignment, to_iterator() returns rows with working get_blob(pos) for both descriptor- and inline-stored blobs, so to_blob_iterator() was a no-op alias with no Java counterpart.
…edRow impls Mirror Java DataGetters where getBlob is part of the interface contract: every InternalRow implementation must provide it. BinaryRow delegates to Blob.from_bytes (no file_io needed for in-memory rows), ProjectedRow forwards to the inner row via the index mapping.
New page covering the Python API for blob columns: - Creating a table with a BLOB column - Writing blobs via PyArrow Table - Reading via row.get_blob(pos) — to_data() and new_input_stream() - Lower-level Blob.from_bytes factory Add a back-link from the existing append-table/blob.md (Java/SQL main doc) to the new page for discoverability.
- test_from_bytes_with_descriptor uses tempfile.TemporaryDirectory so a mid-test failure no longer leaks the temp file. - Rename test_to_iterator_unchanged to test_to_iterator_yields_all_rows (the previous name only made sense in the context of this branch). - test_get_blob_streaming now loops every row and asserts content for each, rather than breaking after the first. - New GetBlobMultiColumnTest exercises a table with two BLOB columns and verifies row.get_blob(pos) is independent per position.
- Blob.from_bytes: when allow_blob_data=False and the input is not a BlobDescriptor, raise a clear ValueError instead of falling into deserialize() which would surface a low-level magic-header error. - DataFileBatchReader._blob_cell_to_data: drop the dead `if blob is not None else None` ternary; at that point value is guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return None either.
…et_blob logic-1: BinaryRow.get_blob() was calling Blob.from_bytes() on the field value, but BinaryRow's BLOB cells are parsed by GenericRowDeserializer into Blob instances rather than raw bytes, so the implementation always raised TypeError. Mirror the GenericRow.get_blob() unwrap pattern. security-2: OffsetRow.get_blob(pos) was unconditionally forwarding the cell at pos to Blob.from_bytes(...), which auto-detects descriptor magic and resolves the embedded URI via file_io. An attacker who can write a BLOBDESC-prefixed payload into a VARBINARY column could weaponise a mis-call of row.get_blob(pos) into SSRF on the reader host. Thread blob_field_indices alongside file_io through the reader chain (DataFileBatchReader → wrapper readers → InternalRowWrapperIterator → OffsetRow). OffsetRow.get_blob now rejects positions that are not BLOB fields. OuterProjectionRecordReader remaps the inner blob_field_indices into projected positions (only top-level paths preserve BLOB-ness; nested extractions descend into ROW sub-structures). Adds GetBlobNonBlobColumnSecurityTest covering a VARBINARY column whose contents start with the BLOBDESC magic header — get_blob must refuse without touching the URI.
split_read.py extracts `_blob_field_indices(fields)` helper, replacing three identical set comprehensions. GetBlobNonBlobColumnSecurityTest drops the per-block explanatory comments; the class docstring already states the SSRF threat model.
…ail-closed OffsetRow
- BlobDescriptorConvertReader was the only wrapper RecordBatchReader that
did not propagate file_io / blob_field_indices from its inner reader,
silently disabling the SSRF guard whenever DataEvolutionSplitRead chose
this wrapper as the outermost reader. Mirror the propagation pattern
used by FilterRecordBatchReader / RowIdFilterRecordBatchReader.
- OffsetRow.get_blob's guard was fail-open: rows constructed without
set_blob_field_indices() bypassed the check entirely (e.g. the OffsetRow
pair inside KeyValue). Switch the default to frozenset() so any row that
never opts in rejects every position. Legitimate readers continue to
widen via set_blob_field_indices().
- Strengthen GetBlobNonBlobColumnSecurityTest with mock that fails loudly
if uri_reader_factory.create is invoked, so a regression that
resolves the URI before raising TypeError can no longer pass.
- Drop the dead blob_field_indices inheritance loop in
DataEvolutionMergeReader: all inner MergeAllBatchReader instances are
constructed without blob_field_indices set, so the loop never fired,
and the assumed semantics ("inherit from first inner") would be wrong
anyway because DataEvolutionMergeReader reshuffles columns via
row_offsets/field_offsets.
The attribute was computed from the constructor's `fields` parameter (table-schema order), but the OffsetRow tuple uses the projected / post-mapping schema order, so the two index spaces don't agree. Today every code path wraps DataFileBatchReader with an outer ConcatBatchReader whose blob_field_indices is computed correctly from `self.read_fields` in split_read.py and overrides the inner one — so the bug never fires — but a future caller invoking DataFileBatchReader.read_batch() directly would silently use the wrong index space, either denying legitimate BLOB access or re-opening the SSRF surface depending on column collisions. Drop the attribute; outer wrappers remain the single source of truth. If DataFileBatchReader is ever called without an outer wrapper, the class-level default `blob_field_indices = None` propagates into OffsetRow which keeps its fail-closed `frozenset()`, so get_blob raises TypeError on every position rather than mis-resolving descriptors.
Previous "Reading Blob Data" section recommended new_input_stream() for large blobs without noting that the default blob-as-descriptor=false read path materialises the payload before the Blob is constructed. In that mode new_input_stream() wraps in-memory bytes and offers no real streaming, contrary to the doc's wording. Move streaming guidance into a dedicated "Streaming for Large Blobs" section that: - documents both blob-as-descriptor modes and the BlobData / BlobRef shape difference - explicitly names the OOM risk under default mode - shows the blob-as-descriptor=true schema option to opt into lazy streaming (verified end-to-end against the read path) This mirrors Java's BlobFormatReader semantics — the default mode is also materialised in Java, and lazy streaming there equally requires the descriptor mode.
Purpose
Tests