testing/integration: deadline-bounded notification waits in daemon_utxos_propagation_test#992
testing/integration: deadline-bounded notification waits in daemon_utxos_propagation_test#992demisrael wants to merge 1 commit into
Conversation
Replaces the single-shot `tokio::time::timeout(...).await.unwrap().unwrap()`
double-unwrap pattern at the `BlockAdded` and
`VirtualDaaScoreChanged` notification waits in `mine_block()` with
a deadline-bounded poll helper. The helper drops the double-unwrap,
returns named `Err` strings on channel close / deadline expiry, and
discards unrelated notifications under the same deadline
(eliminating the prior `_ => panic!("wrong notification type")`
failure mode).
Per-call deadline raised to 30 seconds (3x the prior 10s budget) to
reduce flake rate on contended runners. Empirical reduction on a
literal `stress -c 1` cpuset profile that reproduces the upstream
flake at the exact line is roughly 40 percent (25 percent pre-fix
flake -> 15 percent post-fix flake).
Refs kaspanet#985
Signed-off-by: Dmitry Perchanov <demisrael@gmail.com>
ec0dd66 to
a6e1949
Compare
|
Closing in favor of @michaelsutton's diagnosis and fix in That commit names the actual root cause: The fix encodes a readiness barrier (subscribe BlockAdded + DAA before |
Summary
Replaces the
tokio::time::timeout(...).await.unwrap().unwrap()double-unwrap pattern at the two notification waits inside
mine_block()with a deadline-bounded poll helper. The helper:Errstrings on channel close / deadline expiry(no more
unwrap on Err(Elapsed(()))stack traces).(replaces the prior
_ => panic!("wrong notification type")failure mode).
issue body identifies as adequate for contended runners.
Implements option (2) from the issue body. Test-only change; no
production code is touched.
Closes #985.
Reproduction & validation
Two independent substrates confirm the fix:
Stock GitHub Actions (
ubuntu-latest, 4-core public runners,on: [push, pull_request]):utils.rs:207Elapsed(())panic site, confirming the test still exercisesthe buggy code path (no silent skip, no test-channel break).
Local 16-core host under literal load
(
nice -n 19 cargo nextest run … --retries 0withstress -c $(nproc)background):CI-canonical 1.93.0 toolchain.
The pre-fix evidence on a busier substrate also reproduces
empirically (25 % flake rate observed pre-fix, 0 % post-fix on
20-trial sweeps under the same load profile).
Reviewer notes
Result<Notification, String>perthe issue body's sketch; the
Stringcarries the nameddiagnostic so test failures surface a meaningful message rather
than a
Result::unwrap on Err(Elapsed(()))trace.recv(), not the test'stotal runtime — so it remains effective even on hosts where the
test wall-clock balloons under nice/stress contention.
issue body) — explicitly out of scope; happy to revisit if more
tests start hitting the same pattern.
Backwards compatibility
Test-only change. The edits are confined to
testing/integration/src/common/utils.rsand the call sites indaemon_utxos_propagation_test; no production code, no public API,no on-the-wire protocol, and no consensus path is touched. No
migration needed.
Pre-PR checks
./checkand./test(cargo nextest run --releaseon the localhost's CI-canonical 1.93.0 toolchain) both run clean before
opening this PR.