Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
4cf7b23
[python] Always resolve blob to actual data on read regardless of blo…
XiaoHongbo-Hope May 18, 2026
77bc51b
[python] Add Blob.from_bytes unified API and revert broken read behavior
XiaoHongbo-Hope May 19, 2026
a610d97
[python] Fix Blob.from_bytes type annotation and add tests
XiaoHongbo-Hope May 19, 2026
eb3897d
[python] Align Blob.from_bytes with Java Blob.fromBytes semantics
XiaoHongbo-Hope May 19, 2026
5a13539
[python] Fix flake8 lint errors
XiaoHongbo-Hope May 19, 2026
3a2a85d
[python] Support row-level Blob access aligned with Java getBlob
XiaoHongbo-Hope May 21, 2026
151fa9c
[python] Align BLOB read path with Java getBlob semantics
XiaoHongbo-Hope May 24, 2026
5983598
Revert "[python] Align BLOB read path with Java getBlob semantics"
XiaoHongbo-Hope May 24, 2026
a5161cf
[python] Align BLOB row API shape with Java InternalRow.getBlob
XiaoHongbo-Hope May 24, 2026
27560b6
[python] Remove to_blob_iterator() — to_iterator() suffices
XiaoHongbo-Hope May 24, 2026
a4e9ee0
[python] Make InternalRow.get_blob abstract and add BinaryRow/Project…
XiaoHongbo-Hope May 24, 2026
a848c1f
[python] Remove unused OffsetRow.with_blob_context alias
XiaoHongbo-Hope May 24, 2026
fca89c4
[docs] Add pypaimon Blob storage page
XiaoHongbo-Hope May 24, 2026
79765e3
[python] Tighten blob tests: temp file safety, naming, coverage
XiaoHongbo-Hope May 24, 2026
4fe5f96
[python] Address JingsongLi review on PR #7891
XiaoHongbo-Hope May 24, 2026
8358165
[python] Fix BinaryRow.get_blob + validate column type in OffsetRow.g…
XiaoHongbo-Hope May 24, 2026
e41934d
[python] Drop verbose remap comment in OuterProjectionRecordReader
XiaoHongbo-Hope May 24, 2026
c8498f5
[python] DRY blob_field_indices computation and trim test comments
XiaoHongbo-Hope May 24, 2026
cd5f4f7
[python] Close SSRF gaps: BlobDescriptorConvertReader propagation + f…
XiaoHongbo-Hope May 24, 2026
9d91f5c
[python] Remove DataFileBatchReader.blob_field_indices footgun
XiaoHongbo-Hope May 24, 2026
2c3bc23
[docs] Clarify that lazy blob streaming requires blob-as-descriptor=true
XiaoHongbo-Hope May 24, 2026
c66e433
[python] Add e2e test pinning BlobDescriptorConvertReader propagation
XiaoHongbo-Hope May 24, 2026
0ed1cb6
[python] Fix misleading blob-as-descriptor doc phrasing and lazy test…
XiaoHongbo-Hope May 24, 2026
24152b2
[python] Drop unused file_io plumbing on inner blob readers
XiaoHongbo-Hope May 24, 2026
5b74c47
[python] Pass blob context via OffsetRow constructor; simplify Blob.f…
XiaoHongbo-Hope May 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/content/append-table/blob.md
Original file line number Diff line number Diff line change
Expand Up @@ -712,6 +712,8 @@ For these configured fields:
- writes can still start from raw BLOB input
- the field is treated as descriptor-based for operations such as `MERGE INTO`

For the Python equivalent, see [Blob Storage in pypaimon]({{< ref "pypaimon/blob" >}}).

## Limitations

1. **Append Table Only**: Blob type is designed for append-only tables. Primary key tables are not supported.
Expand Down
158 changes: 158 additions & 0 deletions docs/content/pypaimon/blob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
---
title: "Blob Storage"
weight: 7
type: docs
aliases:
- /pypaimon/blob.html
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Blob Storage in pypaimon

For Paimon's Blob storage concepts (storage modes, table options, SQL usage,
Java API), see [Blob Storage]({{< ref "append-table/blob" >}}).

This page covers the Python API for reading and writing BLOB columns.

## Creating a Table

A BLOB column maps to PyArrow `large_binary()`. The table must enable
`row-tracking.enabled` and `data-evolution.enabled`.

```python
from pypaimon import CatalogFactory, Schema
import pyarrow as pa

catalog = CatalogFactory.create({'warehouse': '/tmp/paimon-warehouse'})
catalog.create_database('my_db', True)

pa_schema = pa.schema([
('id', pa.int32()),
('name', pa.string()),
('image', pa.large_binary()),
])
schema = Schema.from_pyarrow_schema(
pa_schema,
options={
'row-tracking.enabled': 'true',
'data-evolution.enabled': 'true',
},
)
catalog.create_table('my_db.image_table', schema, True)
```

## Writing Blob Data

Pass raw bytes for the blob column in a PyArrow Table; pypaimon writes them
to dedicated `.blob` files automatically.

```python
table = catalog.get_table('my_db.image_table')
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()

with open('cat.jpg', 'rb') as f1, open('dog.jpg', 'rb') as f2:
writer.write_arrow(pa.Table.from_pydict({
'id': [1, 2],
'name': ['cat', 'dog'],
'image': [f1.read(), f2.read()],
}, schema=pa_schema))

write_builder.new_commit().commit(writer.prepare_commit())
writer.close()
```

## Reading Blob Data

Use `row.get_blob(pos)` to access blob columns. It returns a `Blob` object
regardless of how the blob is stored.

```python
read_builder = table.new_read_builder()
splits = read_builder.new_scan().plan().splits()
read = read_builder.new_read()

for row in read.to_iterator(splits):
blob = row.get_blob(2)
if blob is None:
continue
data = blob.to_data()
```

## Streaming for Large Blobs

`blob.new_input_stream()` returns a file-like object. Whether it is
genuinely lazy depends on how the table is configured:

- Default mode (`blob-as-descriptor=false`): the read path materialises
the payload before it reaches `row.get_blob(pos)`. `Blob` is a
`BlobData` and `new_input_stream()` wraps the in-memory bytes — not
true streaming. For large blobs this can still OOM.
- Descriptor mode (`blob-as-descriptor=true`): the read path preserves
the descriptor. `Blob` is a `BlobRef` and `new_input_stream()` opens
the underlying file on demand.

This mirrors Java's `BlobFormatReader` semantics.

For genuine on-demand streaming of large blobs (videos, model weights),
configure `blob-as-descriptor=true` before reading:

```python
schema = Schema.from_pyarrow_schema(
pa_schema,
options={
'row-tracking.enabled': 'true',
'data-evolution.enabled': 'true',
'blob-as-descriptor': 'true',
},
)
# Reads of this table return BlobRef whose new_input_stream() is lazy.
for row in read.to_iterator(splits):
with row.get_blob(2).new_input_stream() as stream:
chunk = stream.read(1024)
```

## Lower-level: `Blob.from_bytes`

When you already have raw or descriptor bytes (for example from a custom
source) and want to wrap them as a `Blob`, use the factory:

```python
from pypaimon.table.row.blob import Blob

# Inline bytes → BlobData (no file_io required)
blob = Blob.from_bytes(b'hello')

# Descriptor bytes → BlobRef (lazy; requires file_io to resolve the URI)
file_io = table.file_io
blob = Blob.from_bytes(descriptor_bytes, file_io)

data = blob.to_data()
```

The factory auto-dispatches based on the bytes content (BLOBDESC magic
header). This mirrors Java's `Blob.fromBytes(...)`.

## See Also

- [Blob Storage]({{< ref "append-table/blob" >}}) — concept, storage modes,
SQL/Java API
- [Data Evolution]({{< ref "pypaimon/data-evolution" >}}) — required for
blob tables
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ def __init__(self, inner: RecordBatchReader, table):
self._inner = inner
self._table = table
self._descriptor_fields = CoreOptions.blob_descriptor_fields(table.options)
self.file_io = inner.file_io
self.blob_field_indices = inner.blob_field_indices

def read_arrow_batch(self) -> Optional[RecordBatch]:
import pyarrow
Expand Down
4 changes: 3 additions & 1 deletion paimon-python/pypaimon/read/reader/concat_batch_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,11 @@

class ConcatBatchReader(RecordBatchReader):

def __init__(self, reader_suppliers: List[Callable]):
def __init__(self, reader_suppliers: List[Callable], file_io=None, blob_field_indices=None):
self.queue: collections.deque[Callable] = collections.deque(reader_suppliers)
self.current_reader: Optional[RecordBatchReader] = None
self.file_io = file_io
self.blob_field_indices = blob_field_indices

def read_arrow_batch(self) -> Optional[RecordBatch]:
while True:
Expand Down
23 changes: 2 additions & 21 deletions paimon-python/pypaimon/read/reader/data_file_batch_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
from pypaimon.read.reader.format_blob_reader import FormatBlobReader
from pypaimon.read.reader.iface.record_batch_reader import RecordBatchReader
from pypaimon.schema.data_types import DataField, PyarrowFieldParser
from pypaimon.table.row.blob import Blob, BlobDescriptor
from pypaimon.table.row.blob import Blob
from pypaimon.table.special_fields import SpecialFields


Expand Down Expand Up @@ -178,28 +178,9 @@ def _blob_cell_to_data(self, value):
value = self._normalize_blob_cell(value)
if value is None:
return None

if not isinstance(value, bytes):
return value

descriptor = self._deserialize_descriptor_or_none(value)
if descriptor is None:
return value

try:
uri_reader = self.file_io.uri_reader_factory.create(descriptor.uri)
blob = Blob.from_descriptor(uri_reader, descriptor)
return blob.to_data()
except Exception as e:
raise RuntimeError(
"Failed to read blob bytes from descriptor URI while converting blob value."
) from e

@staticmethod
def _deserialize_descriptor_or_none(raw: bytes):
if not BlobDescriptor.is_blob_descriptor(raw):
return None
return BlobDescriptor.deserialize(raw)
return Blob.from_bytes(value, self.file_io).to_data()

def _assign_row_tracking(self, record_batch: RecordBatch) -> RecordBatch:
"""Assign row tracking meta fields (_ROW_ID and _SEQUENCE_NUMBER)."""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ def __init__(
self.predicate = predicate
self.field_names = field_names
self.schema_fields = schema_fields
self.file_io = reader.file_io
self.blob_field_indices = reader.blob_field_indices

def read_arrow_batch(self) -> Optional[pa.RecordBatch]:
while True:
Expand Down
13 changes: 10 additions & 3 deletions paimon-python/pypaimon/read/reader/iface/record_batch_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ class RecordBatchReader(RecordReader):
The reader that reads the pyarrow batches of records.
"""

file_io = None
blob_field_indices = None

@abstractmethod
def read_arrow_batch(self) -> Optional[RecordBatch]:
"""
Expand Down Expand Up @@ -61,13 +64,17 @@ def read_batch(self) -> Optional[RecordIterator[InternalRow]]:
df = self.read_next_df()
if df is None:
return None
return InternalRowWrapperIterator(df.iter_rows(), df.width)
return InternalRowWrapperIterator(
df.iter_rows(), df.width, self.file_io, self.blob_field_indices)


class InternalRowWrapperIterator(RecordIterator[InternalRow]):
def __init__(self, iterator: Iterator[tuple], width: int):
def __init__(self, iterator: Iterator[tuple], width: int,
file_io=None, blob_field_indices=None):
self._iterator = iterator
self._reused_row = OffsetRow(None, 0, width)
self._reused_row = OffsetRow(None, 0, width,
file_io=file_io,
blob_field_indices=blob_field_indices)

def next(self) -> Optional[InternalRow]:
row_tuple = next(self._iterator, None)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ def __init__(
inner: RecordReader[InternalRow],
inner_top_names: List[str],
name_paths: List[List[str]],
file_io=None,
blob_field_indices=None,
):
if not name_paths:
raise ValueError("name_paths must be non-empty")
Expand All @@ -58,12 +60,22 @@ def __init__(
self._specs.append(_PathSpec(name_to_top_idx[top_name], list(path[1:])))
self._inner = inner
self._flat_arity = len(name_paths)
self._file_io = file_io
self._blob_field_indices = None
if blob_field_indices is not None:
self._blob_field_indices = {
proj_pos
for proj_pos, spec in enumerate(self._specs)
if not spec.sub_names and spec.top_idx in blob_field_indices
}

def read_batch(self) -> Optional[RecordIterator[InternalRow]]:
inner_batch = self._inner.read_batch()
if inner_batch is None:
return None
return _OuterProjectionIterator(inner_batch, self._specs, self._flat_arity)
return _OuterProjectionIterator(
inner_batch, self._specs, self._flat_arity, self._file_io,
self._blob_field_indices)

def close(self) -> None:
self._inner.close()
Expand All @@ -77,11 +89,15 @@ def __init__(
inner: RecordIterator[InternalRow],
specs: List["_PathSpec"],
flat_arity: int,
file_io=None,
blob_field_indices=None,
):
self._inner = inner
self._specs = specs
self._flat_arity = flat_arity
self._reused_row = OffsetRow(None, 0, flat_arity)
self._reused_row = OffsetRow(None, 0, flat_arity,
file_io=file_io,
blob_field_indices=blob_field_indices)

def next(self) -> Optional[InternalRow]:
inner_row = self._inner.next()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ def __init__(self, reader: RecordBatchReader, first_row_id: int, row_id_ranges:
self.reader = reader
self.current_row_id = first_row_id
self.row_id_ranges = row_id_ranges
self.file_io = reader.file_io
self.blob_field_indices = reader.blob_field_indices

def read_arrow_batch(self) -> Optional[RecordBatch]:
while True:
Expand Down
20 changes: 16 additions & 4 deletions paimon-python/pypaimon/read/split_read.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,11 @@
_COMPRESS_EXTENSIONS = frozenset(['gz', 'bz2', 'deflate', 'snappy', 'lz4', 'zst'])


def _blob_field_indices(fields: List[DataField]) -> set:
return {i for i, f in enumerate(fields)
if hasattr(f.type, 'type') and f.type.type == 'BLOB'}


def format_identifier(file_name):
idx = file_name.rfind('.')
assert idx != -1, "%s is not a legal file name." % file_name
Expand Down Expand Up @@ -556,7 +561,9 @@ def create_reader(self) -> RecordReader:
if not data_readers:
return EmptyFileRecordReader()

concat_reader = ConcatBatchReader(data_readers)
concat_reader = ConcatBatchReader(
data_readers, file_io=self.table.file_io,
blob_field_indices=_blob_field_indices(self.read_fields))
# if the table is appendonly table, we don't need extra filter, all predicates has pushed down
if self.table.is_primary_key_table and self.predicate_for_reader:
return FilterRecordReader(concat_reader, self.predicate_for_reader)
Expand Down Expand Up @@ -630,9 +637,12 @@ def create_reader(self) -> RecordReader:
if self.outer_extract_name_paths:
from pypaimon.read.reader.outer_projection_record_reader import \
OuterProjectionRecordReader
inner_top_names = [f.name for f in self.read_fields[-self.value_arity:]]
inner_value_fields = self.read_fields[-self.value_arity:]
reader = OuterProjectionRecordReader(
reader, inner_top_names, self.outer_extract_name_paths)
reader, [f.name for f in inner_value_fields],
self.outer_extract_name_paths,
file_io=self.table.file_io,
blob_field_indices=_blob_field_indices(inner_value_fields))
if self.limit is not None:
from pypaimon.read.reader.limited_record_reader import \
LimitedRecordReader
Expand Down Expand Up @@ -686,7 +696,9 @@ def create_reader(self) -> RecordReader:
lambda files=need_merge_files: self._create_union_reader(files)
)

merge_reader = ConcatBatchReader(suppliers)
merge_reader = ConcatBatchReader(
suppliers, file_io=self.table.file_io,
blob_field_indices=_blob_field_indices(self.read_fields))
if self.predicate_for_reader is not None:
reader = FilterRecordBatchReader(
merge_reader,
Expand Down
9 changes: 9 additions & 0 deletions paimon-python/pypaimon/table/row/binary_row.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,15 @@ def get_field(self, index: int) -> Any:
self.arity),
index, self.fields[index].type)

def get_blob(self, pos: int):
from pypaimon.table.row.blob import Blob
value = self.get_field(pos)
if value is None:
return None
if isinstance(value, Blob):
return value
raise TypeError(f"Cannot get Blob from {type(value)} at position {pos}")

def get_row_kind(self) -> RowKind:
return self.row_kind

Expand Down
Loading