Skip to content

[improvement](fe) Skip zero-length files in FileSplitter to avoid sending empty splits to BE#62482

Open
kaka11chen wants to merge 1 commit intoapache:masterfrom
kaka11chen:skip_zero_size_split
Open

[improvement](fe) Skip zero-length files in FileSplitter to avoid sending empty splits to BE#62482
kaka11chen wants to merge 1 commit intoapache:masterfrom
kaka11chen:skip_zero_size_split

Conversation

@kaka11chen
Copy link
Copy Markdown
Contributor

@kaka11chen kaka11chen commented Apr 14, 2026

Issue Number: close #xxx

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

Zero-length files (length=0) on external storage (HDFS/S3) can produce empty splits that are assigned to BE, wasting RPC and scheduling resources. FileSplitter.splitFile() previously created splits even for length=0 files (specifically for non-splittable/compressed files). These splits carry no data but still consume a scan range slot and backend scheduling overhead.

This does NOT affect:

  • Paimon JNI splits (length=0 but data in Paimon split object)
  • Hudi log-only splits (length=0 but data in delta logs)
  • Iceberg splits (create their own split objects)
    These formats bypass FileSplitter and create splits directly.

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…ding empty splits to BE

Issue Number: close #xxx

Problem Summary:
Zero-length files (length=0) on external storage (HDFS/S3) can produce
empty splits that are assigned to BE, wasting RPC and scheduling resources.
FileSplitter.splitFile() previously created splits even for length=0 files
(specifically for non-splittable/compressed files). These splits carry no
data but still consume a scan range slot and backend scheduling overhead.

This does NOT affect:
- Paimon JNI splits (length=0 but data in Paimon split object)
- Hudi log-only splits (length=0 but data in delta logs)
- Iceberg splits (create their own split objects)
These formats bypass FileSplitter and create splits directly.
Copilot AI review requested due to automatic review settings April 14, 2026 06:43
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 14, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents generation of empty scan splits for zero-length external files (e.g., HDFS/S3), reducing unnecessary BE scheduling/RPC overhead.

Changes:

  • Add an early return in FileSplitter.splitFile() to skip length <= 0 files.
  • Add a unit test covering splittable/non-splittable and null block-location cases for zero-length files, including ensuring the initial-split counter isn’t consumed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
fe/fe-core/src/main/java/org/apache/doris/datasource/FileSplitter.java Skips zero-length files to avoid creating empty splits.
fe/fe-core/src/test/java/org/apache/doris/datasource/FileSplitterTest.java Adds coverage ensuring zero-length files produce no splits and don’t consume initial split quota.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

throws IOException {
if (length <= 0) {
// Zero-length files contain no data; skip to avoid sending empty splits to BE.
return Lists.newArrayList();
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return for zero-length files currently allocates a new mutable list on every call. Since this is a hot path when scanning many files, consider returning Collections.emptyList() (or ImmutableList.of()) to avoid unnecessary allocations and signal immutability.

Suggested change
return Lists.newArrayList();
return ImmutableList.of();

Copilot uses AI. Check for mistakes.
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants