Skip to content

[CELEBORN-2330] Fix HA master bootstrap redirect handling#3691

Draft
sunchao wants to merge 5 commits into
apache:mainfrom
sunchao:dev/chao/codex/port-pr70-to-oss-main
Draft

[CELEBORN-2330] Fix HA master bootstrap redirect handling#3691
sunchao wants to merge 5 commits into
apache:mainfrom
sunchao:dev/chao/codex/port-pr70-to-oss-main

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented May 17, 2026

Why are the changes needed?

In HA mode, clients can hit a failover window where the master they contact is no longer the leader and returns a MasterNotLeaderException with a suggested leader address. The client-facing symptom is an RPC failure surfaced as:

CelebornException: Exception thrown in awaitResult

The redirect signal is still present underneath that wrapper, but the existing bootstrap and retry logic does not consistently preserve and follow it. In particular:

  • bootstrap-time redirects can be treated like generic connection failures instead of explicit leader hints
  • suggested leaders can themselves redirect again or fail setup
  • after such failures, the client may not continue cleanly to the remaining configured masters

The consequence is that a client can fail to establish a master connection during HA leader transitions even when a reachable leader or another configured master is available. That turns a recoverable redirect/failover event into an avoidable client-visible failure and makes rolling upgrades or leader changes noisier than necessary.

What changes were proposed in this PR?

This port brings the HA redirect handling fix from openai/celeborn#70 onto upstream main.

When a master tells the client which leader to use, the client now keeps that redirect information even if it is wrapped inside another exception, and it actively follows the suggested leader during bootstrap and failover. If that suggested leader points to another leader, the client can continue along that redirect chain instead of giving up too early.

If the redirect path is no longer useful, for example because the suggested leader cannot be reached, no leader is currently presented, or redirects start looping, the client falls back to the remaining configured masters and keeps searching for a usable endpoint. The retry scan also advances correctly after endpoint setup fails, so one bad redirect does not prevent the client from trying the next viable master.

The PR also adds focused HA tests that cover the bootstrap redirect cases, chained redirects, redirect cycles, and fallback to configured masters.

How was this PR tested?

  • build/mvn test -pl common -am -Dtest=MasterClientSuiteJ -DwildcardSuites=org.apache.celeborn.common.client.__NoSuchSuite__

@sunchao sunchao changed the title Fix HA master bootstrap redirect handling [CELEBORN-2330] Fix HA master bootstrap redirect handling May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant