Skip to content

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397

Draft
chungen0126 wants to merge 8 commits into
apache:masterfrom
chungen0126:HDDS-13482
Draft

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397
chungen0126 wants to merge 8 commits into
apache:masterfrom
chungen0126:HDDS-13482

Conversation

@chungen0126
Copy link
Copy Markdown
Contributor

@chungen0126 chungen0126 commented May 31, 2026

What changes were proposed in this pull request?

Summary

Fix intermittent failure in TestContainerStateMachine#testApplyTransactionFailure, TestContainerStateMachine#testContainerStateMachineRestartWithDNChangePipeline, testWriteStateMachineDataIdempotencyWithClosedContainer, and testApplyTransactionIdempotencyWithClosedContainer.

Changes

For testWriteStateMachineDataIdempotencyWithClosedContainer:

The test stemmed from a race between a retry write operation and a close container request. The test expects idempotency for identical data, but intermittent failures occurred because the initial write and the retry write contained different data.

  • Case A (Success): If close container executes first, no error occurs.
  • Case B (Failure): If the retry write executes before the close container, a mismatch occurs between the written data "hello" and the committed metadata. While the container successfully closes, it is later marked as "unhealthy" by the scanner due to a checksum mismatch.

Fix: Updated the test to ensure data consistency during retries or adjusted the timing expectations to handle the race condition correctly.

For testContainerStateMachineFailures

testContainerStateMachineFailures causes failures in subsequent tests, including testContainerStateMachineRestartWithDNChangePipeline. This occurs because the test triggers a Ratis storage reset that invalidates existing pipelines. Since these pipelines are closed passively via client-side retries rather than by the ScrubbingService, they leave in the PipelineManager, causing subsequent tests to erroneously select and fail on these stale pipelines.

Fix: Make testContainerStateMachineFailures at the end of the class.

For testApplyTransactionFailure

Intermittent failures in this test happen because the initial takeSnapshot call does not guarantee that the snapshot is the final one before the container data deletion. Any notifyTermIndexUpdated event occurring after that point triggers a new snapshot, leading to snapshot inconsistency.

Fix: I refactor the test. By capturing the snapshot after the container data is deleted, we ensure that the snapshot is the last one before the deletion. Subsequent transactions and snapshot operations are then applied to verify that these actions do not alter the existing, consistent snapshot.

For testApplyTransactionIdempotencyWithClosedContainer

Fix flaky testApplyTransactionFailure by asserting BlockCommitSequenceId

Relying on Ratis snapshot files to verify that an unhealthy container accepts no new writes is flaky, as snapshots can be triggered asynchronously by internal Ratis events (like notifyTermIndexUpdate).

This fix removes the unreliable snapshot file comparison and instead asserts that the container's BlockCommitSequenceId remains unchanged after transitioning to the UNHEALTHY state. This deterministically proves that no new transactions were applied to the data layer.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13482
https://issues.apache.org/jira/browse/HDDS-12215
https://issues.apache.org/jira/browse/HDDS-14962
https://issues.apache.org/jira/browse/HDDS-6115

How was this patch tested?

Before changes: TestContainerStateMachine failed 22 times in 20 * 10 iterations. https://github.com/chungen0126/ozone/actions/runs/26375145366

After changes: TestContainerStateMachine passed: 20 * 10 iterations after changes. https://github.com/chungen0126/ozone/actions/runs/26791609635

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant