Skip to content

[core] introduce Placeholder for Blob File Format#7889

Merged
JingsongLi merged 4 commits into
apache:masterfrom
steFaiz:placeholder_blob
May 26, 2026
Merged

[core] introduce Placeholder for Blob File Format#7889
JingsongLi merged 4 commits into
apache:masterfrom
steFaiz:placeholder_blob

Conversation

@steFaiz
Copy link
Copy Markdown
Contributor

@steFaiz steFaiz commented May 18, 2026

Purpose

This is the first part of #7881
Including:

  1. Bump Blob File Format to V2, introducing a PlaceHolder Blob.
  2. Introduce a fallbackReader for blob to skip placeholders. This is a two-level abstraction:
    a. At first, all data files will be divided according to max_seq_num
    b. within each group, create a sequential reader to logically concat files and fill missing gaps. For example: If the full row range of normal files is [0, 100], but some group only have one file with range [20, 80], the output is: [0, 19] -> filled with placeholders; [20, 80] -> records from files; [81, 100] -> filled with placeholders.
    c. create readers for each group, and read the blob from the max group whose value is NOT a placeholder.

The mechanism can be illustrated as below:
image

Tests

ITCase and Unit tests

@steFaiz steFaiz marked this pull request as draft May 18, 2026 11:17
@steFaiz steFaiz changed the title [core] introduce Placeholder for Blob File Format [wip][core] introduce Placeholder for Blob File Format May 18, 2026
@steFaiz steFaiz marked this pull request as ready for review May 19, 2026 06:19
@steFaiz steFaiz changed the title [wip][core] introduce Placeholder for Blob File Format [core] introduce Placeholder for Blob File Format May 19, 2026
* The placeholder blob, mainly for blob update in data-evolution. It should never be exposed to
* users.
*/
Blob PLACE_HOLDER =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange, maybe just use NULL as place holder?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advise! But in #7125 we supports storing nulls in blob file. I'm not clear how to distinguish placeholders and native NULLs if so.

From the semantics, NULLs are exposed to users, users know that they store some nulls. But placeholders are fully internal used, users should never be aware about them. If users set some rows as nulls, we may fallback those rows to earlier versions, this is not expected in our design.

Could you please give me some advise?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can consider using row number in blob to determine how to merge? You can just return valid blobs with row number.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row number is actually the primary key.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you not only need this class for reading, but also for writing. If you skip these elements, the changes will be significant.

I thin you can just introduce a BlobPlaceHolder implements Blob, Serializable for this, use instance of is better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll modify my code!

@steFaiz steFaiz force-pushed the placeholder_blob branch 2 times, most recently from 355a05b to b2cc640 Compare May 22, 2026 15:34
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Comments for PR #7889

Overall this is a well-designed change with thorough tests. The sequence-group fallback mechanism for blob placeholders is a solid approach. A few observations:

Issues

1. Typo in comment (DataEvolutionSplitRead.java)

} else if (bunch instanceof BlobFileBunch) {
    // for blob funch, fallback on placeholders

→ "funch" should be "bunch"

2. Potential resource leak in BlobFallbackRecordReader.readBatch()

for (int i = 0; i < groupReaders.size(); i++) {
    RecordIterator<InternalRow> iterator = groupReaders.get(i).readBatch();
    if (iterator == null) {
        return null;  // ← iterators[0..i-1] are not released
    }
    iterators[i] = iterator;
}

If the k-th reader returns null, the already-obtained iterators [0, k-1] will never have releaseBatch() called. Should release them before returning null.

3. Memory pressure from ForceSingleBatchReader wrapping all group readers

Each sequence group is fully materialized into memory via ForceSingleBatchReader. When there are many sequence groups with large row ranges, this could cause significant memory pressure, especially when blob-as-descriptor is disabled and actual blob data is loaded. The TODO comment acknowledges this - just want to confirm this is acceptable for the first iteration.

4. Singleton placeholder row reuse in BlobSequenceGroupRecordReader

private InternalRow placeHolderRow() {
    if (placeholderRow == null) {
        GenericRow row = new GenericRow(readRowType.getFieldCount());
        row.setField(blobIndex, BlobPlaceholder.INSTANCE);
        placeholderRow = row;
    }
    return placeholderRow;
}

This returns the same mutable GenericRow instance for all placeholder positions. It works here because ForceSingleBatchReader copies the data, but it's fragile - a future caller that holds references to returned rows would see aliased data. Consider adding a comment noting this intentional reuse.

5. BlobFileBunch doesn't validate schemaId across files (by design?)

VectorFileBunch.add() enforces file.schemaId() == files.get(0).schemaId(), but BlobFileBunch.add() only checks writeCols. This seems intentional since blob files from different sequences naturally have different schemas, but worth confirming.

6. DataEvolutionFileReader contract relaxation

- checkArgument(readers != null && readers.length > 1, "Readers should be more than 1");
+ checkArgument(readers != null && readers.length >= 1, "should not pass empty readers.");

This relaxes the precondition for all callers of DataEvolutionFileReader, not just the blob path. Is there a case where a single reader is passed from non-blob paths? If not, consider keeping > 1 for non-blob scenarios or documenting when single-reader is expected.

Minor / Style

  • In BlobFallbackRecordReaderTest, the ReadResult.add() method silently treats placeholder rows differently (counts them vs collecting rowIds). This is fine for testing but the test would be more explicit if assertions on placeholder count were always paired with total row count assertions.

  • The fixedBlobBytes helper in BlobUpdateTest allocates 2 * 1024 * 124 bytes (≈248KB). Was 2 * 1024 * 1024 (2MB) intended? Or is this intentionally smaller to keep tests fast?

Positive

  • The separation of SpecialFieldBunch into BlobFileBunch and VectorFileBunch is a good refactoring - they have fundamentally different semantics now.
  • The backward-compatible version 1 reading test (testReadLegacyVersionOneBlobFile) is a nice addition.
  • The BlobSequenceGroupRecordReader javadoc with ASCII art is very helpful for understanding the complex layout.
  • Python-side changes properly mirror the Java changes with appropriate error handling for unsupported placeholder in read_arrow_batch.

@steFaiz
Copy link
Copy Markdown
Contributor Author

steFaiz commented May 25, 2026

@JingsongLi Thanks! 1 & 2 are addressed. Left some explaination for the rest:

  1. Memory pressure from ForceSingleBatchReader wrapping all group readers

Yes, this is a problem, and I've mark this as TODO in my original implementation. I think it can be solved by:

  1. introducing a discard method to discard iterator's next record. In blob, we can just move to the next blob.
  2. reading all blobs as descriptor. But this may break the optimization introduced in [format] read blobs as bytes when blob-as-descriptor is false #6989

I'm pleasant to continuously optimize this, but I think can be introduced in future PRs.

  1. Singleton placeholder row reuse in BlobSequenceGroupRecordReader

This is safe, because Placeholder will never be read, all method callings will throw an error

  1. BlobFileBunch doesn't validate schemaId across files (by design?)

This is intentionally fixed in #7618

  1. DataEvolutionFileReader contract relaxation

This is for only-blob projection situations. If only one blob field is selected, the datafiles num in each blob bunch is still
'> 1' because of update. So the code path will still go through the createUnionReader.
Refactoring the whole code is too noisy in this PR. And this relaxation do not affect correctness of efficiency. It could be optimized in separated PRs.

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I went through the latest patch again, focusing on the placeholder encoding/decoding path and the data-evolution fallback reader.

The previous major concerns look addressed or acceptable for this iteration:

  • BlobFormatWriter now encodes null and placeholder distinctly, and BlobFormatReader maps placeholders back to the internal sentinel without exposing blob data accessors.
  • BlobFallbackRecordReader now keeps blob sequence groups aligned and releases the normal returned iterators through releaseBatch().
  • The fallback behavior handles row-range pushdown and overlapping sequence groups in the added tests.
  • The DataEvolutionFileReader one-reader relaxation makes sense for blob-only projections.

I also ran a local compile check on the patched core modules:

mvn -pl paimon-core -am -DskipTests -Dspotless.check.skip=true -Drat.skip=true -Dcheckstyle.skip=true -Dscala-2.12 -DskipITs -Dfast compile

It passes locally.

I only have two non-blocking follow-ups:

  1. The known memory/IO overhead in BlobFallbackRecordReader is still worth optimizing later, because stale rows are fully advanced even when a newer non-placeholder row has already been found.
  2. There are still a few small wording/typo issues in comments/messages (latestFistRowId, "offset the that row", "could not be null"), but they do not affect correctness.

LGTM from my side.

public class BlobFormatWriter implements FileAwareFormatWriter {

public static final byte VERSION = 1;
public static final byte VERSION = 2;
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to keep current version. This change seems tolerable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old version BlobReader will throws an unexpected error when reading file with placeholders.
The BlobFileMeta check nulls by -1:

public boolean isNull(int i) {
    return blobLengths[i] == -1;
}

So in BlobFormatReader it goes to the normal read code path:

if (fileMeta.isNull(currentPosition)) {
    blob = null;
} else {
    // offsets and length are invalid.
    long offset = fileMeta.blobOffset(currentPosition) + 4;
    long length = fileMeta.blobLength(currentPosition) - 16;
    if (in != null) {
        blob = Blob.fromData(readInlineBlob(in, offset, length));
    } else {
        blob = Blob.fromFile(fileIO, filePathString, offset, length);
    }
}

The error msg is:

Invalid blob at index 2: offset=3, length=-18
java.io.IOException: Invalid blob at index 2: offset=3, length=-18

I think we should tell users that they are reading V2 data files rather than just this vague error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it can throw exceptions without producing incorrect data, it's okay

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrading all engines to read Blob is a huge undertaking, do not upgrade versions easily

// for blob bunch, fallback on placeholders
int blobIndex = findBlobFieldIndex(readRowType);
checkArgument(blobIndex >= 0, "Blob bunch read type should contain a blob field.");
return new BlobFallbackRecordReader(
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fast bunch.files().size() == 1 channel here? When there is no need to merge, it should be a simple code path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've add a shortcut that if all blob files belong to a same max_seq group, just use the concat reader to sequenatially read them

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steFaiz for update. Left two comments.

@steFaiz steFaiz force-pushed the placeholder_blob branch from ae4578d to 573f2d1 Compare May 26, 2026 12:18
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit 5ce4e01 into apache:master May 26, 2026
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants