Skip to content

Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38

Draft
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-ci-flakiness-leaderelection-test
Draft

Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38
Copilot wants to merge 5 commits into
mainfrom
copilot/fix-ci-flakiness-leaderelection-test

Conversation

Copy link
Copy Markdown

Copilot AI commented May 2, 2026

  • Replace waitForLiveNodes poll loop in ZkFailoverTest with waitForState calls (removed unused ZkStateReader and KeeperException imports)
  • Replace 10 s waitForActiveCollection in ZkShardTermsRecoveryTest (@BeforeClass and @Before) with waitForState(clusterShape(...)) (uses default 45 s timeout)
  • Replace waitForActiveCollection(60, TimeUnit.SECONDS, ...) in LeaderElectionIntegrationTest with waitForState(clusterShape(...)) for consistency; removed now-unused TimeUnit import
Original prompt

Open a PR in epugh/solr targeting main to fix CI flakiness in org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection.

Implement in solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java:

  • Increase cluster.waitForActiveCollection timeout from 10 seconds to 60 seconds.
  • Replace ad-hoc polling loops with fixed Thread.sleep after cluster.expireZkSession(jetty) with waitForState(...) predicates:
    • wait until leader for shard1 is not the expired-session jetty node
    • wait until the expired-session node is present in liveNodes again (or otherwise provably active) before stopping other nodes
  • Replace the final polling loop waiting for the original node to regain leadership with waitForState asserting leader nodeName equals the original jetty node.

Keep changes minimal and limited to test stabilization; preserve test intent and cleanup.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: The job is failing because a single flaky SolrCloud integration test fails:

  • Failing test: org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection
  • Seen in job log: :solr:core:test (FAILURE): 5562 test(s), 1 failure(s) with the failing test called out explicitly.
  • Workflow just runs ./gradlew test in .github/workflows/tests-via-crave.yml (commit 0d018395511ef6e6dc31ae1ace2d8ef14af183b9), so fixing or stabilizing that test is the right solution.

There are also repeated warnings/errors about:
java.util.ServiceConfigurationError: Locale provider adapter "CLDR" cannot be instantiated and policy/URLPermission messages. Those look noisy, but they are not what fails the build; the build fails due to the test failure.

Root cause (most likely)

testSimpleSliceLeaderElection() in solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java relies on timing-sensitive leader changes with very short waits, and it uses ad-hoc polling loops with fixed sleep:

  • It waits only 10 seconds for the collection to become active (cluster.waitForActiveCollection(..., 10, TimeUnit.SECONDS, ...)).
  • It then stops leaders and waits for state transitions (fine), but the later “client timeout / session expire” portion uses:
    • a manual for loop with Thread.sleep(100) up to 6 seconds (60 * 100ms) for leader change, and
    • then another fixed Thread.sleep(TICK_TIME * 2 + 100) hoping the old leader reconnects,
    • then another loop up to 32 seconds to see the “first leader” become leader again.

On slower/loaded CI (especially under constrained CPU), those fixed durations are often insufficient, causing nondeterministic failures.

Targeted fix: make the test deterministic by using waitForState (and longer timeouts)

Replace the timing loops and fixed sleeps with Solr’s waitForState(...) (which the test already uses earlier), and increase the “active collection” timeout.

Suggested patch (code-level)

In solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java (ref 0d018395511ef6e6dc31ae1ace2d8ef14af183b9):

  1. Give the cluster more time to become active:
- cluster.waitForActiveCollection(collection, 10, TimeUnit.SECONDS, 2, 6);
+ cluster.waitForActiveCollection(collection, 60, TimeUnit.SECONDS, 2, 6);
  1. Replace the first polling loop after expireZkSession(jetty) with waitForState:
- for (int i = 0; i < 60; i++) { // wait till leader is changed
-   if (jetty != getRunner(getLeader(collection))) {
-     break;
-   }
-   Thread.sleep(100);
- }
+ waitForState(
+   "Expected leader to move away after expiring zk session",
+   collection,
+   c -> {
+     var leader = c.getLeader("shard1");
+     return leader != null && !jetty.getNodeName().equals(leader.getNodeName());
+   }
+ );
  1. Avoid the fixed sleep intended to “ensure the first leader reconnected”. Instead, explicitly wait until that node is live/visible again before proceeding to stop others. The exact predicate depends on what waitForState provides, but a common pattern is: wait until the node appears in live nodes again (or the replica state is ACTIVE for that node). For example (conceptually):
waitForState(
  "Expected expired session node to rejoin live nodes",
  collection,
  (liveNodes, c) -> liveNodes.contains(jetty.getNodeName())
);

If liveNodes contains node names in a different format than JettySolrRunner#getNodeName(), adapt accordingly (or check replica nodeName presence in cluster state).

  1. Replace the final polling loop with waitForState and a clear assertion:
- for (int i = 0; i < 320; i++) { // wait till leader is changed
-   try {
-     if (jetty == getRunner(...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

…erElection

- Increase cluster.waitForActiveCollection timeout from 10s to 60s
- Replace ad-hoc polling loop after expireZkSession with waitForState
  (waits until leader moves away from the expired-session node)
- Replace Thread.sleep with waitForState for node rejoining live nodes
- Replace final polling loop + assertEquals with waitForState
  (waits until original node becomes leader again)

Agent-Logs-Url: https://github.com/epugh/solr/sessions/1eab1dea-7bf6-4911-93ff-03f3c6614cfd

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix CI flakiness in LeaderElectionIntegrationTest Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection May 2, 2026
Copilot finished work on behalf of epugh May 2, 2026 14:13
Copilot AI requested a review from epugh May 2, 2026 14:13
Copilot finished work on behalf of epugh May 2, 2026 14:20
Copilot AI and others added 2 commits May 2, 2026 14:28
…nd ZkShardTermsRecoveryTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
…in LeaderElectionIntegrationTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
Copilot finished work on behalf of epugh May 2, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants