Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection by Copilot · Pull Request #38 · epugh/solr

Copilot · 2026-05-02T14:08:31Z

Replace waitForLiveNodes poll loop in ZkFailoverTest with waitForState calls (removed unused ZkStateReader and KeeperException imports)
Replace 10 s waitForActiveCollection in ZkShardTermsRecoveryTest (@BeforeClass and @Before) with waitForState(clusterShape(...)) (uses default 45 s timeout)
Replace waitForActiveCollection(60, TimeUnit.SECONDS, ...) in LeaderElectionIntegrationTest with waitForState(clusterShape(...)) for consistency; removed now-unused TimeUnit import

Original prompt

Open a PR in epugh/solr targeting main to fix CI flakiness in org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection.

Implement in solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java:

Increase cluster.waitForActiveCollection timeout from 10 seconds to 60 seconds.
Replace ad-hoc polling loops with fixed Thread.sleep after cluster.expireZkSession(jetty) with waitForState(...) predicates:
- wait until leader for shard1 is not the expired-session jetty node
- wait until the expired-session node is present in liveNodes again (or otherwise provably active) before stopping other nodes
Replace the final polling loop waiting for the original node to regain leadership with waitForState asserting leader nodeName equals the original jetty node.

Keep changes minimal and limited to test stabilization; preserve test intent and cleanup.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing because a single flaky SolrCloud integration test fails:

Failing test: org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection
Seen in job log: :solr:core:test (FAILURE): 5562 test(s), 1 failure(s) with the failing test called out explicitly.
Workflow just runs ./gradlew test in .github/workflows/tests-via-crave.yml (commit 0d018395511ef6e6dc31ae1ace2d8ef14af183b9), so fixing or stabilizing that test is the right solution.

There are also repeated warnings/errors about:
java.util.ServiceConfigurationError: Locale provider adapter "CLDR" cannot be instantiated and policy/URLPermission messages. Those look noisy, but they are not what fails the build; the build fails due to the test failure.

Root cause (most likely)

testSimpleSliceLeaderElection() in solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java relies on timing-sensitive leader changes with very short waits, and it uses ad-hoc polling loops with fixed sleep:

It waits only 10 seconds for the collection to become active (cluster.waitForActiveCollection(..., 10, TimeUnit.SECONDS, ...)).
It then stops leaders and waits for state transitions (fine), but the later “client timeout / session expire” portion uses:
- a manual for loop with Thread.sleep(100) up to 6 seconds (60 * 100ms) for leader change, and
- then another fixed Thread.sleep(TICK_TIME * 2 + 100) hoping the old leader reconnects,
- then another loop up to 32 seconds to see the “first leader” become leader again.

On slower/loaded CI (especially under constrained CPU), those fixed durations are often insufficient, causing nondeterministic failures.

Targeted fix: make the test deterministic by using `waitForState` (and longer timeouts)

Replace the timing loops and fixed sleeps with Solr’s waitForState(...) (which the test already uses earlier), and increase the “active collection” timeout.

Suggested patch (code-level)

In solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java (ref 0d018395511ef6e6dc31ae1ace2d8ef14af183b9):

Give the cluster more time to become active:

- cluster.waitForActiveCollection(collection, 10, TimeUnit.SECONDS, 2, 6);
+ cluster.waitForActiveCollection(collection, 60, TimeUnit.SECONDS, 2, 6);

Replace the first polling loop after expireZkSession(jetty) with waitForState:

- for (int i = 0; i < 60; i++) { // wait till leader is changed
-   if (jetty != getRunner(getLeader(collection))) {
-     break;
-   }
-   Thread.sleep(100);
- }
+ waitForState(
+   "Expected leader to move away after expiring zk session",
+   collection,
+   c -> {
+     var leader = c.getLeader("shard1");
+     return leader != null && !jetty.getNodeName().equals(leader.getNodeName());
+   }
+ );

Avoid the fixed sleep intended to “ensure the first leader reconnected”. Instead, explicitly wait until that node is live/visible again before proceeding to stop others. The exact predicate depends on what waitForState provides, but a common pattern is: wait until the node appears in live nodes again (or the replica state is ACTIVE for that node). For example (conceptually):

waitForState(
  "Expected expired session node to rejoin live nodes",
  collection,
  (liveNodes, c) -> liveNodes.contains(jetty.getNodeName())
);

If liveNodes contains node names in a different format than JettySolrRunner#getNodeName(), adapt accordingly (or check replica nodeName presence in cluster state).

Replace the final polling loop with waitForState and a clear assertion:

- for (int i = 0; i < 320; i++) { // wait till leader is changed
-   try {
-     if (jetty == getRunner(...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

…erElection - Increase cluster.waitForActiveCollection timeout from 10s to 60s - Replace ad-hoc polling loop after expireZkSession with waitForState (waits until leader moves away from the expired-session node) - Replace Thread.sleep with waitForState for node rejoining live nodes - Replace final polling loop + assertEquals with waitForState (waits until original node becomes leader again) Agent-Logs-Url: https://github.com/epugh/solr/sessions/1eab1dea-7bf6-4911-93ff-03f3c6614cfd Co-authored-by: epugh <22395+epugh@users.noreply.github.com>

…nd ZkShardTermsRecoveryTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>

…in LeaderElectionIntegrationTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>

Initial plan

afa14b7

Copilot AI assigned Copilot and epugh May 2, 2026

Copilot started work on behalf of epugh May 2, 2026 14:08 View session

Copilot AI changed the title ~~[WIP] Fix CI flakiness in LeaderElectionIntegrationTest~~ Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection May 2, 2026

Copilot finished work on behalf of epugh May 2, 2026 14:13

Copilot AI requested a review from epugh May 2, 2026 14:13

Copilot started work on behalf of epugh May 2, 2026 14:17 View session

Copilot finished work on behalf of epugh May 2, 2026 14:20

Copilot started work on behalf of epugh May 2, 2026 14:21 View session

Copilot AI and others added 2 commits May 2, 2026 14:28

Replace Thread.sleep poll loops with waitForState in ZkFailoverTest a…

3264541

…nd ZkShardTermsRecoveryTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>

Replace waitForActiveCollection(60s) with waitForState(clusterShape) …

bf5d71c

…in LeaderElectionIntegrationTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>

Copilot finished work on behalf of epugh May 2, 2026 14:29

fix precommit

bc5d06f

github-actions Bot added tests cat:cloud labels May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38

Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-ci-flakiness-leaderelection-test

Copilot AI commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause (most likely)

Targeted fix: make the test deterministic by using waitForState (and longer timeouts)

Suggested patch (code-level)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 2, 2026 •

edited

Loading

Targeted fix: make the test deterministic by using `waitForState` (and longer timeouts)