Skip to content

[CELEBORN-2327] Add active-slot weight to load-aware placement#3685

Open
sunchao wants to merge 2 commits into
apache:mainfrom
sunchao:dev/chao/codex/active-slots-placement-oss
Open

[CELEBORN-2327] Add active-slot weight to load-aware placement#3685
sunchao wants to merge 2 commits into
apache:mainfrom
sunchao:dev/chao/codex/active-slots-placement-oss

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented May 13, 2026

Why are the changes needed?

Celeborn load-aware slot placement currently orders candidate disks using flush and fetch timing only. That can still keep assigning new partitions onto disks that already carry a large amount of reserved active-slot pressure, which makes placement skew worse under overlapping shuffle-heavy workloads. CELEBORN-2327 tracks this gap.

What changes were proposed in this PR?

  • Add an optional, default-off celeborn.master.slot.assign.loadAware.activeSlotsWeight config.
  • Include activeSlots * activeSlotsWeight in load-aware disk ordering.
  • Thread the new config through the master allocation path.
  • Document the new tuning knob and update the slot-allocation developer docs.
  • Add a regression test showing that, when configured, lower-active-slot disks are preferred over otherwise equivalent disks.

How was this PR tested?

  • UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite
  • ./build/mvn -pl master -am -Dtest=SlotsAllocatorSuiteJ -DwildcardSuites=org.apache.celeborn.NoSuchSuite -DfailIfNoTests=false test
  • ./build/mvn -pl master -am -DskipTests test-compile

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new optional celeborn.master.slot.assign.loadAware.activeSlotsWeight config (default 0.0) so load-aware slot placement can also account for each disk's currently-reserved active slots when sorting candidate disks, helping reduce placement skew under overlapping shuffle-heavy workloads.

Changes:

  • Define and thread the new activeSlotsWeight config from CelebornConf through Master.scala into SlotsAllocator.offerSlotsLoadAware / placeDisksToGroups, where it is added to the existing flush/fetch-time sort key.
  • Document the new tuning knob in docs/configuration/master.md and update docs/developers/slotsallocation.md to reflect the updated ordering formula.
  • Add a regression test in SlotsAllocatorSuiteJ that verifies, with activeSlotsWeight=1 and flush/fetch weights 0, the lower-active-slot worker is preferred over an otherwise equivalent overloaded worker.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Adds MASTER_SLOT_ASSIGN_LOADAWARE_ACTIVE_SLOTS_WEIGHT config entry and accessor.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Reads new config and passes it into offerSlotsLoadAware.
master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java Adds activeSlotsWeight parameter and incorporates activeSlots * activeSlotsWeight into disk comparator.
master/src/test/java/org/apache/celeborn/service/deploy/master/SlotsAllocatorSuiteJ.java Adds regression test and threads new parameter through existing helper call sites.
docs/configuration/master.md Documents new config row.
docs/developers/slotsallocation.md Updates ordering formula description.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for contribution.

Comment thread docs/developers/slotsallocation.md
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
@SteNicholas SteNicholas requested a review from pan3793 May 15, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants