[improvement](fe) Skip zero-length files in FileSplitter to avoid sending empty splits to BE#62482
[improvement](fe) Skip zero-length files in FileSplitter to avoid sending empty splits to BE#62482kaka11chen wants to merge 1 commit intoapache:masterfrom
Conversation
…ding empty splits to BE Issue Number: close #xxx Problem Summary: Zero-length files (length=0) on external storage (HDFS/S3) can produce empty splits that are assigned to BE, wasting RPC and scheduling resources. FileSplitter.splitFile() previously created splits even for length=0 files (specifically for non-splittable/compressed files). These splits carry no data but still consume a scan range slot and backend scheduling overhead. This does NOT affect: - Paimon JNI splits (length=0 but data in Paimon split object) - Hudi log-only splits (length=0 but data in delta logs) - Iceberg splits (create their own split objects) These formats bypass FileSplitter and create splits directly.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR prevents generation of empty scan splits for zero-length external files (e.g., HDFS/S3), reducing unnecessary BE scheduling/RPC overhead.
Changes:
- Add an early return in
FileSplitter.splitFile()to skiplength <= 0files. - Add a unit test covering splittable/non-splittable and null block-location cases for zero-length files, including ensuring the initial-split counter isn’t consumed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| fe/fe-core/src/main/java/org/apache/doris/datasource/FileSplitter.java | Skips zero-length files to avoid creating empty splits. |
| fe/fe-core/src/test/java/org/apache/doris/datasource/FileSplitterTest.java | Adds coverage ensuring zero-length files produce no splits and don’t consume initial split quota. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| throws IOException { | ||
| if (length <= 0) { | ||
| // Zero-length files contain no data; skip to avoid sending empty splits to BE. | ||
| return Lists.newArrayList(); |
There was a problem hiding this comment.
The early return for zero-length files currently allocates a new mutable list on every call. Since this is a hot path when scanning many files, consider returning Collections.emptyList() (or ImmutableList.of()) to avoid unnecessary allocations and signal immutability.
| return Lists.newArrayList(); | |
| return ImmutableList.of(); |
|
run buildall |
Issue Number: close #xxx
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
Zero-length files (length=0) on external storage (HDFS/S3) can produce empty splits that are assigned to BE, wasting RPC and scheduling resources. FileSplitter.splitFile() previously created splits even for length=0 files (specifically for non-splittable/compressed files). These splits carry no data but still consume a scan range slot and backend scheduling overhead.
This does NOT affect:
These formats bypass FileSplitter and create splits directly.
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)