HDDS-13482. Intermittent failure in TestContainerStateMachineFailures by chungen0126 · Pull Request #10397 · apache/ozone

chungen0126 · 2026-05-31T07:29:09Z

What changes were proposed in this pull request?

Summary

Fix intermittent failure in TestContainerStateMachine#testApplyTransactionFailure, TestContainerStateMachine#testContainerStateMachineRestartWithDNChangePipeline, testWriteStateMachineDataIdempotencyWithClosedContainer, and testApplyTransactionIdempotencyWithClosedContainer.

Changes

For `testWriteStateMachineDataIdempotencyWithClosedContainer`:

The test stemmed from a race between a retry write operation and a close container request. The test expects idempotency for identical data, but intermittent failures occurred because the initial write and the retry write contained different data.

Case A (Success): If close container executes first, no error occurs.
Case B (Failure): If the retry write executes before the close container, a mismatch occurs between the written data "hello" and the committed metadata. While the container successfully closes, it is later marked as "unhealthy" by the scanner due to a checksum mismatch.

Fix: Updated the test to ensure data consistency during retries or adjusted the timing expectations to handle the race condition correctly.

For testContainerStateMachineFailures

testContainerStateMachineFailures causes failures in subsequent tests, including testContainerStateMachineRestartWithDNChangePipeline. This occurs because the test triggers a Ratis storage reset that invalidates existing pipelines. Since these pipelines are closed passively via client-side retries rather than by the ScrubbingService, they leave in the PipelineManager, causing subsequent tests to erroneously select and fail on these stale pipelines.

Fix: Make testContainerStateMachineFailures at the end of the class.

For testApplyTransactionFailure

Intermittent failures in this test happen because the initial takeSnapshot call does not guarantee that the snapshot is the final one before the container data deletion. Any notifyTermIndexUpdated event occurring after that point triggers a new snapshot, leading to snapshot inconsistency.

Fix: I refactor the test. By capturing the snapshot after the container data is deleted, we ensure that the snapshot is the last one before the deletion. Subsequent transactions and snapshot operations are then applied to verify that these actions do not alter the existing, consistent snapshot.

For testApplyTransactionIdempotencyWithClosedContainer

Fix flaky testApplyTransactionFailure by asserting BlockCommitSequenceId

Relying on Ratis snapshot files to verify that an unhealthy container accepts no new writes is flaky, as snapshots can be triggered asynchronously by internal Ratis events (like notifyTermIndexUpdate).

This fix removes the unreliable snapshot file comparison and instead asserts that the container's BlockCommitSequenceId remains unchanged after transitioning to the UNHEALTHY state. This deterministically proves that no new transactions were applied to the data layer.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13482
https://issues.apache.org/jira/browse/HDDS-12215
https://issues.apache.org/jira/browse/HDDS-14962
https://issues.apache.org/jira/browse/HDDS-6115

How was this patch tested?

Before changes: TestContainerStateMachine failed 22 times in 20 * 10 iterations. https://github.com/chungen0126/ozone/actions/runs/26375145366

After changes: TestContainerStateMachine passed: 20 * 10 iterations after changes. https://github.com/chungen0126/ozone/actions/runs/26791609635

chungen0126 added 8 commits May 31, 2026 09:21

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures

63b20e0

fix testApplyTransactionIdempotencyWithClosedContainer

6cdd22b

fix checkstyle

9550d08

rewrite testApplyTransactionFailure

dc41163

fix testApplyTransactionFailure

19765cb

fix checkstyle

a1e6bd9

refactor testApplyTransactionFailure

43c6589

fix testApplyTransactionFailure

2b2a6a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397

HDDS-13482. Intermittent failure in TestContainerStateMachineFailures#10397
chungen0126 wants to merge 8 commits into
apache:masterfrom
chungen0126:HDDS-13482

chungen0126 commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chungen0126 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Summary

Changes

For testWriteStateMachineDataIdempotencyWithClosedContainer:

For testContainerStateMachineFailures

For testApplyTransactionFailure

For testApplyTransactionIdempotencyWithClosedContainer

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chungen0126 commented May 31, 2026 •

edited

Loading

For `testWriteStateMachineDataIdempotencyWithClosedContainer`: