[Core] Support to order data by columns in append only writer by FangYongs · Pull Request #7886 · apache/paimon

FangYongs · 2026-05-18T10:44:41Z

Purpose

Order data by specific columns in single file which is written by append only writer

Tests

AppendOnlyWriterTest#testSortedBufferedSinkWriter

Close #7885

FangYongs · 2026-05-18T10:45:19Z

@Aitozi @shidayang Have a look when you're free

Aitozi · 2026-05-19T01:24:49Z

+                coreOptions.clusteringIncrementalEnabled()
+                        && coreOptions.clusteringIncrementalOptimizeWrite()
+                        && coreOptions.clusteringIncrementalMode()
+                                == CoreOptions.ClusteringIncrementalMode.LOCAL_SORT


The LOCAL_SORT is described as Task-Level sorting, but what we have actually implemented is File-Level sorting.

Do we need to introduce a mode similar to "file_local" to represent this specific granularity of File-Level sorting functionality?

/** * Sort rows only within each compaction task (no global shuffle). Every output file is * internally ordered by the clustering columns, which is sufficient for per-file Parquet * lookup optimizations. */ LOCAL_SORT( "local-sort", "Sort rows only within each compaction task without global shuffle. Every output file is internally ordered.");

@JingsongLi What do you think of LOCAL_SORT? In our previous discussion, this was meant for local sorting in a single file. However, judging from the current situation, it is used for data sorting at the task level.

I feel it's okay. Local sorting can just do its best to sort, whether at the task level or file level. We can add comments to explain.

shidayang · 2026-05-21T12:44:46Z

-                maxDiskSize,
-                spillCompression);
+    private SinkWriter<InternalRow> createSinkWriter(boolean useWriteBuffer, boolean spillable) {
+        if (useWriteBuffer) {


When sortEnabled is enabled, should we throw an error if useWriteBuffer is false ? Otherwise, it is hard for users to notice this behavior.

JingsongLi

Review

The overall idea — enabling local sort within each data file for incremental clustering in the append-only writer — is reasonable. However there are several design and correctness concerns.

Design Issues

1. SortedBufferedSinkWriter flushes ALL buffered rows into a single RollingFileWriter call.

public List<DataFileMeta> flush() throws IOException {
    RollingFileWriter<T, DataFileMeta> writer = writerSupplier.get();
    MutableObjectIterator<BinaryRow> sorted = sortBuffer.sortedIterator();
    while ((reuse = sorted.next(row)) != null) {
        writeToWriter(writer, reuse);
    }
    writer.close();
    flushedFiles.addAll(writer.result());
    sortBuffer.clear();
}

The RollingFileWriter will roll files based on target file size. But the sorted data is written sequentially — it will be split at arbitrary points based on byte size, NOT on sort-key boundaries. This means after rolling, the value ranges in adjacent files may overlap. The sort is only meaningful within each individual output file, which is the stated goal. So this is correct but worth documenting.

2. Passing the entire CoreOptions to AppendOnlyWriter and SortedBufferedSinkWriter is heavy-handed.

The writer now stores CoreOptions coreOptions as a field. It only uses it to extract clustering columns, mode, and a few sort config values. Consider passing just the needed parameters (clustering columns, sort config) instead of the entire options bag. This avoids coupling the writer to all of CoreOptions and makes testing easier.

3. The sortEnabled condition requires all four checks simultaneously.

this.sortEnabled =
    coreOptions.clusteringIncrementalEnabled()
        && coreOptions.clusteringIncrementalOptimizeWrite()
        && coreOptions.clusteringIncrementalMode() == LOCAL_SORT
        && !clusteringColumns.isEmpty();

This is very specific — users must set exactly the right combination. The option name clustering.incremental.optimize-write is a separate boolean on top of clustering.incremental=true + clustering.incremental.mode=local-sort. That's 3 options to enable a single behavior. Would a simpler option like write.sort-by-columns (independent of clustering) be more user-friendly?

4. instanceof chain in writeBundle is fragile.

if (sinkWriter instanceof BufferedSinkWriter) {
    ((BufferedSinkWriter<InternalRow>) sinkWriter).writeBundle(bundle);
} else if (sinkWriter instanceof SortedBufferedSinkWriter) {
    ((SortedBufferedSinkWriter<InternalRow>) sinkWriter).writeBundle(bundle);
} else {
    ((DirectSinkWriter<?>) sinkWriter).writeBundle(bundle);
}

Since SortedBufferedSinkWriter extends BaseBufferedSinkWriter and BufferedSinkWriter extends BaseBufferedSinkWriter, and writeBundle is defined on BaseBufferedSinkWriter, you could just check instanceof BaseBufferedSinkWriter. Or better: make writeBundle part of the SinkWriter<T> interface with a default implementation, eliminating the instanceof entirely.

Correctness Issues

5. SortedBufferedSinkWriter.write() doesn't return a meaningful value.

public boolean write(T data) throws IOException {
    return sortBuffer.write(toRow.apply(data));
}

SortBuffer.write() returns false when the buffer is full and cannot accept more data. When this returns false, the caller (AppendOnlyWriter.write()) should trigger a flush. But looking at AppendOnlyWriter.write():

public void write(InternalRow rowData) throws Exception {
    boolean success = sinkWriter.write(rowData);
    if (!success) {
        flush(false, false);
        success = sinkWriter.write(rowData);
        if (!success) {
            flush(false, true);
        }
    }
}

This flush-and-retry logic works correctly for BufferedSinkWriter (which clears the write buffer). For SortedBufferedSinkWriter, flush() calls sortBuffer.clear() — so after flush, the retry write should succeed. Good.

6. setMemoryPool resolves clustering column indices by name — no validation.

int[] keyFields = clusteringColumns.stream()
    .mapToInt(rowType.getFieldNames()::indexOf).toArray();

If a clustering column name doesn't exist in rowType, indexOf returns -1. This will cause ArrayIndexOutOfBoundsException later when the comparator tries to access field -1. Should validate and throw a clear error if a clustering column is missing from the schema.

7. The sortBuffer is initialized in setMemoryPool, but write() may be called before setMemoryPool.

If write() is called before setMemoryPool(), sortBuffer is null → NPE. The BufferedSinkWriter has the same pattern (write buffer created in setMemoryPool), so this is consistent but risky. At minimum, add a null check with a clear error.

Minor

BaseBufferedSinkWriter is a good refactoring to reduce duplication
Removing two comment lines (// cleanup code that might throw another exception, // reuse writeBuffer) is fine — they were noise
The test covers the basic happy path but doesn't test: buffer spill to disk, write more data than memory allows, clustering column not in schema (negative test)

[apache#7885] Support to order data by keys in append only writer

debadb4

FangYongs changed the title ~~[#7885] Support to order data by columns in append only writer~~ [Core] Support to order data by columns in append only writer May 18, 2026

Aitozi reviewed May 19, 2026

View reviewed changes

shidayang reviewed May 21, 2026

View reviewed changes

JingsongLi reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support to order data by columns in append only writer#7886

[Core] Support to order data by columns in append only writer#7886
FangYongs wants to merge 1 commit into
apache:masterfrom
FangYongs:local-sort-for-append-table

FangYongs commented May 18, 2026 •

edited

Loading

Uh oh!

FangYongs commented May 18, 2026

Uh oh!

Aitozi May 19, 2026

Uh oh!

FangYongs May 19, 2026

Uh oh!

JingsongLi May 21, 2026

Uh oh!

shidayang May 21, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

FangYongs commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

FangYongs commented May 18, 2026

Uh oh!

Aitozi May 19, 2026

Choose a reason for hiding this comment

Uh oh!

FangYongs May 19, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi May 21, 2026

Choose a reason for hiding this comment

Uh oh!

shidayang May 21, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review

Design Issues

Correctness Issues

Minor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FangYongs commented May 18, 2026 •

edited

Loading