Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38
Draft
Copilot wants to merge 5 commits into
Draft
Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection#38Copilot wants to merge 5 commits into
Copilot wants to merge 5 commits into
Conversation
…erElection - Increase cluster.waitForActiveCollection timeout from 10s to 60s - Replace ad-hoc polling loop after expireZkSession with waitForState (waits until leader moves away from the expired-session node) - Replace Thread.sleep with waitForState for node rejoining live nodes - Replace final polling loop + assertEquals with waitForState (waits until original node becomes leader again) Agent-Logs-Url: https://github.com/epugh/solr/sessions/1eab1dea-7bf6-4911-93ff-03f3c6614cfd Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix CI flakiness in LeaderElectionIntegrationTest
Fix CI flakiness in LeaderElectionIntegrationTest.testSimpleSliceLeaderElection
May 2, 2026
…nd ZkShardTermsRecoveryTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
…in LeaderElectionIntegrationTest Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281 Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
waitForLiveNodespoll loop inZkFailoverTestwithwaitForStatecalls (removed unusedZkStateReaderandKeeperExceptionimports)waitForActiveCollectioninZkShardTermsRecoveryTest(@BeforeClassand@Before) withwaitForState(clusterShape(...))(uses default 45 s timeout)waitForActiveCollection(60, TimeUnit.SECONDS, ...)inLeaderElectionIntegrationTestwithwaitForState(clusterShape(...))for consistency; removed now-unusedTimeUnitimportOriginal prompt
Open a PR in
epugh/solrtargetingmainto fix CI flakiness inorg.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection.Implement in
solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java:cluster.waitForActiveCollectiontimeout from 10 seconds to 60 seconds.Thread.sleepaftercluster.expireZkSession(jetty)withwaitForState(...)predicates:waitForStateasserting leader nodeName equals the original jetty node.Keep changes minimal and limited to test stabilization; preserve test intent and cleanup.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing because a single flaky SolrCloud integration test fails:
org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection:solr:core:test (FAILURE): 5562 test(s), 1 failure(s)with the failing test called out explicitly../gradlew testin.github/workflows/tests-via-crave.yml(commit0d018395511ef6e6dc31ae1ace2d8ef14af183b9), so fixing or stabilizing that test is the right solution.There are also repeated warnings/errors about:
java.util.ServiceConfigurationError: Locale provider adapter "CLDR" cannot be instantiatedand policy/URLPermission messages. Those look noisy, but they are not what fails the build; the build fails due to the test failure.Root cause (most likely)
testSimpleSliceLeaderElection()insolr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.javarelies on timing-sensitive leader changes with very short waits, and it uses ad-hoc polling loops with fixed sleep:cluster.waitForActiveCollection(..., 10, TimeUnit.SECONDS, ...)).forloop withThread.sleep(100)up to 6 seconds (60 * 100ms) for leader change, andThread.sleep(TICK_TIME * 2 + 100)hoping the old leader reconnects,On slower/loaded CI (especially under constrained CPU), those fixed durations are often insufficient, causing nondeterministic failures.
Targeted fix: make the test deterministic by using
waitForState(and longer timeouts)Replace the timing loops and fixed sleeps with Solr’s
waitForState(...)(which the test already uses earlier), and increase the “active collection” timeout.Suggested patch (code-level)
In
solr/core/src/test/org/apache/solr/cloud/LeaderElectionIntegrationTest.java(ref0d018395511ef6e6dc31ae1ace2d8ef14af183b9):expireZkSession(jetty)withwaitForState:waitForStateprovides, but a common pattern is: wait until the node appears in live nodes again (or the replica state is ACTIVE for that node). For example (conceptually):If
liveNodescontains node names in a different format thanJettySolrRunner#getNodeName(), adapt accordingly (or check replica nodeName presence in cluster state).waitForStateand a clear assertion: