Skip to content

[python] Support row-level Blob access#7891

Open
XiaoHongbo-Hope wants to merge 24 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob
Open

[python] Support row-level Blob access#7891
XiaoHongbo-Hope wants to merge 24 commits into
apache:masterfrom
XiaoHongbo-Hope:inline_blob

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

Purpose

Tests

Revert the read-path changes from 4cf7b23 that always resolved blob
descriptors to data. Restore master behavior where blob-as-descriptor
flag is respected. Add Blob.from_bytes(data, file_io) as a unified
entry point (equivalent to Java's Blob.fromBytes) that auto-dispatches
BlobData vs BlobRef based on the bytes content.
- Add Optional[bytes] type annotation for data parameter
- Raise ValueError when file_io is None but bytes are BlobDescriptor
- Add unit tests for from_bytes covering all branches
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 09:45
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 10:00
- from_bytes(None) returns None (not empty BlobData), matching Java
- Add allow_blob_data parameter (Java's 4th arg): when False, always
  interpret bytes as descriptor
- Integrate into read path: _blob_cell_to_data now delegates to
  Blob.from_bytes instead of duplicating dispatch logic
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 19, 2026 14:10
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support transparent blob resolution on read [python] Add Blob.from_bytes unified API May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes unified API [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add Blob.from_bytes to support interpreting blob bytes as Blob object [python] Support unified blob reads May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft May 19, 2026 16:00
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Support unified blob reads [python] Add unified Blob.from_bytes resolver May 19, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python] Add unified Blob.from_bytes resolver [python] Support row-level Blob access May 21, 2026
Add get_blob(pos) to InternalRow and to_blob_iterator() to TableRead,
enabling lazy Blob access with streaming support.
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [python] Support row-level Blob access

Overall this is a useful addition that enables lazy/streaming blob access at the row level. The Blob.from_bytes() factory and get_blob() API are clean and well-tested.

1. Shared mutable state in to_blob_iterator (correctness / thread-safety)

In table_read.py, the method mutates self.table.options immediately (setting BLOB_AS_DESCRIPTOR = True) but only restores the original value inside the generator's finally block. Two problems:

  • Deferred restoration: Since this is a generator, the finally block only executes when the generator is exhausted or closed. If a caller never fully consumes it, the table option remains mutated indefinitely.
  • Concurrent use: Any other read on the same table instance will see BLOB_AS_DESCRIPTOR = True unexpectedly.

A safer pattern would be to pass the option override as a parameter rather than mutating the shared table options.

2. OffsetRow.get_blob() when no blob context is set

Will produce an AttributeError on NoneType if called on a row not created via to_blob_iterator(). The base class raises a clear NotImplementedError, but the override skips that guard. Consider checking when self._file_io is None.

3. Blob.from_bytes with allow_blob_data=False edge case

When allow_blob_data=False and the input is raw bytes without the blob descriptor magic prefix, the code enters the descriptor-deserialization path which will fail with an opaque error. Raise a ValueError explicitly instead.

4. Minor: return type annotation

to_blob_iterator is annotated as -> Iterator but could be -> Iterator[InternalRow].

5. Minor: redundant None check in data_file_batch_reader.py

After refactor, blob = Blob.from_bytes(value, self.file_io) followed by blob.to_data() if blob is not None else None -- at this point value is guaranteed non-None, so the ternary is dead code.

Nice work on the Blob.from_bytes() unification and the lazy-access pattern.

to_iterator() now yields rows whose BLOB columns are raw stored bytes
(descriptor or inline) rather than eagerly resolved payload bytes;
row.get_blob(pos) returns a Blob for descriptor cells and raises on
inline cells, matching ColumnarRow.getBlob in Java.

RecordBatchReader.read_batch() takes no arguments; file_io is held on
the reader and threaded onto the row at iterator construction.

Removed read-time descriptor-to-data conversion paths
(_convert_descriptor_stored_blob_columns, _blob_cell_to_data,
BlobDescriptorConvertReader wrapping). FormatBlobReader emits descriptor
bytes regardless of blob_as_descriptor option.

to_blob_iterator() is now an alias for to_iterator(); deprecated kwargs
on DataFileBatchReader / FormatBlobReader and the OffsetRow
with_blob_context shim are kept for one release and will be removed in a
follow-up.

Tests that asserted on resolved blob payload bytes from to_iterator() /
to_arrow() must now resolve descriptor cells via
Blob.from_bytes(bytes, file_io, allow_blob_data=False).to_data() (see
the _resolve_blobs helper added in blob_test.py / blob_table_test.py).
Drop the BLOB-specific parameters on RecordBatchReader.read_batch() to
match Java RecordReader.readBatch(); inject file_io directly onto
OffsetRow at iterator construction (mirroring Java ColumnarRow.setFileIO).

OffsetRow.get_blob(pos) becomes a one-liner (Blob.from_bytes(value,
file_io)), dropping the _blob_field_indices defensive check Java has no
counterpart for.

to_blob_iterator() becomes a thin alias of to_iterator(). The read-path
behaviour (BLOB_AS_DESCRIPTOR option consumption, eager descriptor
resolution when the option is false) is preserved — this PR only aligns
the row/iterator API shape, not the read-path semantics.

Compatibility: OffsetRow.with_blob_context(file_io, ...) is kept as a
thin alias forwarding to set_file_io(file_io) for one release. Existing
callers of read.to_iterator() / read.to_arrow() see no behaviour change.
Java has no separate blob iterator on TableRead; one createReader entry
point yields rows whose getBlob() works uniformly. After the row API
alignment, to_iterator() returns rows with working get_blob(pos) for
both descriptor- and inline-stored blobs, so to_blob_iterator() was a
no-op alias with no Java counterpart.
…edRow impls

Mirror Java DataGetters where getBlob is part of the interface contract:
every InternalRow implementation must provide it. BinaryRow delegates to
Blob.from_bytes (no file_io needed for in-memory rows), ProjectedRow
forwards to the inner row via the index mapping.
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review May 24, 2026 12:55
New page covering the Python API for blob columns:
- Creating a table with a BLOB column
- Writing blobs via PyArrow Table
- Reading via row.get_blob(pos) — to_data() and new_input_stream()
- Lower-level Blob.from_bytes factory

Add a back-link from the existing append-table/blob.md (Java/SQL main
doc) to the new page for discoverability.
- test_from_bytes_with_descriptor uses tempfile.TemporaryDirectory so
  a mid-test failure no longer leaks the temp file.
- Rename test_to_iterator_unchanged to test_to_iterator_yields_all_rows
  (the previous name only made sense in the context of this branch).
- test_get_blob_streaming now loops every row and asserts content for
  each, rather than breaking after the first.
- New GetBlobMultiColumnTest exercises a table with two BLOB columns and
  verifies row.get_blob(pos) is independent per position.
- Blob.from_bytes: when allow_blob_data=False and the input is not a
  BlobDescriptor, raise a clear ValueError instead of falling into
  deserialize() which would surface a low-level magic-header error.
- DataFileBatchReader._blob_cell_to_data: drop the dead
  `if blob is not None else None` ternary; at that point value is
  guaranteed non-None and Blob.from_bytes(non-None bytes) cannot return
  None either.
…et_blob

logic-1: BinaryRow.get_blob() was calling Blob.from_bytes() on the field
value, but BinaryRow's BLOB cells are parsed by GenericRowDeserializer
into Blob instances rather than raw bytes, so the implementation always
raised TypeError. Mirror the GenericRow.get_blob() unwrap pattern.

security-2: OffsetRow.get_blob(pos) was unconditionally forwarding the
cell at pos to Blob.from_bytes(...), which auto-detects descriptor magic
and resolves the embedded URI via file_io. An attacker who can write a
BLOBDESC-prefixed payload into a VARBINARY column could weaponise a
mis-call of row.get_blob(pos) into SSRF on the reader host.

Thread blob_field_indices alongside file_io through the reader chain
(DataFileBatchReader → wrapper readers → InternalRowWrapperIterator →
OffsetRow). OffsetRow.get_blob now rejects positions that are not BLOB
fields. OuterProjectionRecordReader remaps the inner blob_field_indices
into projected positions (only top-level paths preserve BLOB-ness;
nested extractions descend into ROW sub-structures).

Adds GetBlobNonBlobColumnSecurityTest covering a VARBINARY column whose
contents start with the BLOBDESC magic header — get_blob must refuse
without touching the URI.
split_read.py extracts `_blob_field_indices(fields)` helper, replacing
three identical set comprehensions.

GetBlobNonBlobColumnSecurityTest drops the per-block explanatory
comments; the class docstring already states the SSRF threat model.
…ail-closed OffsetRow

- BlobDescriptorConvertReader was the only wrapper RecordBatchReader that
  did not propagate file_io / blob_field_indices from its inner reader,
  silently disabling the SSRF guard whenever DataEvolutionSplitRead chose
  this wrapper as the outermost reader. Mirror the propagation pattern
  used by FilterRecordBatchReader / RowIdFilterRecordBatchReader.

- OffsetRow.get_blob's guard was fail-open: rows constructed without
  set_blob_field_indices() bypassed the check entirely (e.g. the OffsetRow
  pair inside KeyValue). Switch the default to frozenset() so any row that
  never opts in rejects every position. Legitimate readers continue to
  widen via set_blob_field_indices().

- Strengthen GetBlobNonBlobColumnSecurityTest with mock that fails loudly
  if uri_reader_factory.create is invoked, so a regression that
  resolves the URI before raising TypeError can no longer pass.

- Drop the dead blob_field_indices inheritance loop in
  DataEvolutionMergeReader: all inner MergeAllBatchReader instances are
  constructed without blob_field_indices set, so the loop never fired,
  and the assumed semantics ("inherit from first inner") would be wrong
  anyway because DataEvolutionMergeReader reshuffles columns via
  row_offsets/field_offsets.
The attribute was computed from the constructor's `fields` parameter
(table-schema order), but the OffsetRow tuple uses the projected /
post-mapping schema order, so the two index spaces don't agree. Today
every code path wraps DataFileBatchReader with an outer ConcatBatchReader
whose blob_field_indices is computed correctly from `self.read_fields`
in split_read.py and overrides the inner one — so the bug never fires —
but a future caller invoking DataFileBatchReader.read_batch() directly
would silently use the wrong index space, either denying legitimate BLOB
access or re-opening the SSRF surface depending on column collisions.

Drop the attribute; outer wrappers remain the single source of truth.
If DataFileBatchReader is ever called without an outer wrapper, the
class-level default `blob_field_indices = None` propagates into
OffsetRow which keeps its fail-closed `frozenset()`, so get_blob raises
TypeError on every position rather than mis-resolving descriptors.
Previous "Reading Blob Data" section recommended new_input_stream() for
large blobs without noting that the default blob-as-descriptor=false
read path materialises the payload before the Blob is constructed. In
that mode new_input_stream() wraps in-memory bytes and offers no real
streaming, contrary to the doc's wording.

Move streaming guidance into a dedicated "Streaming for Large Blobs"
section that:
- documents both blob-as-descriptor modes and the BlobData / BlobRef
  shape difference
- explicitly names the OOM risk under default mode
- shows the blob-as-descriptor=true schema option to opt into lazy
  streaming (verified end-to-end against the read path)

This mirrors Java's BlobFormatReader semantics — the default mode is
also materialised in Java, and lazy streaming there equally requires
the descriptor mode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants