[Core] Support to order data by columns in append only writer#7886
[Core] Support to order data by columns in append only writer#7886FangYongs wants to merge 1 commit into
Conversation
|
@Aitozi @shidayang Have a look when you're free |
| coreOptions.clusteringIncrementalEnabled() | ||
| && coreOptions.clusteringIncrementalOptimizeWrite() | ||
| && coreOptions.clusteringIncrementalMode() | ||
| == CoreOptions.ClusteringIncrementalMode.LOCAL_SORT |
There was a problem hiding this comment.
The LOCAL_SORT is described as Task-Level sorting, but what we have actually implemented is File-Level sorting.
Do we need to introduce a mode similar to "file_local" to represent this specific granularity of File-Level sorting functionality?
/**
* Sort rows only within each compaction task (no global shuffle). Every output file is
* internally ordered by the clustering columns, which is sufficient for per-file Parquet
* lookup optimizations.
*/
LOCAL_SORT(
"local-sort",
"Sort rows only within each compaction task without global shuffle. Every output file is internally ordered.");
There was a problem hiding this comment.
@JingsongLi What do you think of LOCAL_SORT? In our previous discussion, this was meant for local sorting in a single file. However, judging from the current situation, it is used for data sorting at the task level.
There was a problem hiding this comment.
I feel it's okay. Local sorting can just do its best to sort, whether at the task level or file level. We can add comments to explain.
| maxDiskSize, | ||
| spillCompression); | ||
| private SinkWriter<InternalRow> createSinkWriter(boolean useWriteBuffer, boolean spillable) { | ||
| if (useWriteBuffer) { |
There was a problem hiding this comment.
When sortEnabled is enabled, should we throw an error if useWriteBuffer is false ? Otherwise, it is hard for users to notice this behavior.
JingsongLi
left a comment
There was a problem hiding this comment.
Review
The overall idea — enabling local sort within each data file for incremental clustering in the append-only writer — is reasonable. However there are several design and correctness concerns.
Design Issues
1. SortedBufferedSinkWriter flushes ALL buffered rows into a single RollingFileWriter call.
public List<DataFileMeta> flush() throws IOException {
RollingFileWriter<T, DataFileMeta> writer = writerSupplier.get();
MutableObjectIterator<BinaryRow> sorted = sortBuffer.sortedIterator();
while ((reuse = sorted.next(row)) != null) {
writeToWriter(writer, reuse);
}
writer.close();
flushedFiles.addAll(writer.result());
sortBuffer.clear();
}The RollingFileWriter will roll files based on target file size. But the sorted data is written sequentially — it will be split at arbitrary points based on byte size, NOT on sort-key boundaries. This means after rolling, the value ranges in adjacent files may overlap. The sort is only meaningful within each individual output file, which is the stated goal. So this is correct but worth documenting.
2. Passing the entire CoreOptions to AppendOnlyWriter and SortedBufferedSinkWriter is heavy-handed.
The writer now stores CoreOptions coreOptions as a field. It only uses it to extract clustering columns, mode, and a few sort config values. Consider passing just the needed parameters (clustering columns, sort config) instead of the entire options bag. This avoids coupling the writer to all of CoreOptions and makes testing easier.
3. The sortEnabled condition requires all four checks simultaneously.
this.sortEnabled =
coreOptions.clusteringIncrementalEnabled()
&& coreOptions.clusteringIncrementalOptimizeWrite()
&& coreOptions.clusteringIncrementalMode() == LOCAL_SORT
&& !clusteringColumns.isEmpty();This is very specific — users must set exactly the right combination. The option name clustering.incremental.optimize-write is a separate boolean on top of clustering.incremental=true + clustering.incremental.mode=local-sort. That's 3 options to enable a single behavior. Would a simpler option like write.sort-by-columns (independent of clustering) be more user-friendly?
4. instanceof chain in writeBundle is fragile.
if (sinkWriter instanceof BufferedSinkWriter) {
((BufferedSinkWriter<InternalRow>) sinkWriter).writeBundle(bundle);
} else if (sinkWriter instanceof SortedBufferedSinkWriter) {
((SortedBufferedSinkWriter<InternalRow>) sinkWriter).writeBundle(bundle);
} else {
((DirectSinkWriter<?>) sinkWriter).writeBundle(bundle);
}Since SortedBufferedSinkWriter extends BaseBufferedSinkWriter and BufferedSinkWriter extends BaseBufferedSinkWriter, and writeBundle is defined on BaseBufferedSinkWriter, you could just check instanceof BaseBufferedSinkWriter. Or better: make writeBundle part of the SinkWriter<T> interface with a default implementation, eliminating the instanceof entirely.
Correctness Issues
5. SortedBufferedSinkWriter.write() doesn't return a meaningful value.
public boolean write(T data) throws IOException {
return sortBuffer.write(toRow.apply(data));
}SortBuffer.write() returns false when the buffer is full and cannot accept more data. When this returns false, the caller (AppendOnlyWriter.write()) should trigger a flush. But looking at AppendOnlyWriter.write():
public void write(InternalRow rowData) throws Exception {
boolean success = sinkWriter.write(rowData);
if (!success) {
flush(false, false);
success = sinkWriter.write(rowData);
if (!success) {
flush(false, true);
}
}
}This flush-and-retry logic works correctly for BufferedSinkWriter (which clears the write buffer). For SortedBufferedSinkWriter, flush() calls sortBuffer.clear() — so after flush, the retry write should succeed. Good.
6. setMemoryPool resolves clustering column indices by name — no validation.
int[] keyFields = clusteringColumns.stream()
.mapToInt(rowType.getFieldNames()::indexOf).toArray();If a clustering column name doesn't exist in rowType, indexOf returns -1. This will cause ArrayIndexOutOfBoundsException later when the comparator tries to access field -1. Should validate and throw a clear error if a clustering column is missing from the schema.
7. The sortBuffer is initialized in setMemoryPool, but write() may be called before setMemoryPool.
If write() is called before setMemoryPool(), sortBuffer is null → NPE. The BufferedSinkWriter has the same pattern (write buffer created in setMemoryPool), so this is consistent but risky. At minimum, add a null check with a clear error.
Minor
BaseBufferedSinkWriteris a good refactoring to reduce duplication- Removing two comment lines (
// cleanup code that might throw another exception,// reuse writeBuffer) is fine — they were noise - The test covers the basic happy path but doesn't test: buffer spill to disk, write more data than memory allows, clustering column not in schema (negative test)
Purpose
Order data by specific columns in single file which is written by append only writer
Tests
AppendOnlyWriterTest#testSortedBufferedSinkWriter
Close #7885