Skip to content

testing/integration: deadline-bounded notification waits in daemon_utxos_propagation_test#992

Closed
demisrael wants to merge 1 commit into
kaspanet:masterfrom
demisrael:kas-985-daemon-utxos-flake
Closed

testing/integration: deadline-bounded notification waits in daemon_utxos_propagation_test#992
demisrael wants to merge 1 commit into
kaspanet:masterfrom
demisrael:kas-985-daemon-utxos-flake

Conversation

@demisrael
Copy link
Copy Markdown
Contributor

Summary

Replaces the tokio::time::timeout(...).await.unwrap().unwrap()
double-unwrap pattern at the two notification waits inside
mine_block() with a deadline-bounded poll helper. The helper:

  • Returns named Err strings on channel close / deadline expiry
    (no more unwrap on Err(Elapsed(())) stack traces).
  • Discards unrelated notifications under the same shared deadline
    (replaces the prior _ => panic!("wrong notification type")
    failure mode).
  • Lifts the per-wait deadline from 10 s to 30 s, the budget the
    issue body identifies as adequate for contended runners.

Implements option (2) from the issue body. Test-only change; no
production code is touched.

Closes #985.

Reproduction & validation

Two independent substrates confirm the fix:

Stock GitHub Actions (ubuntu-latest, 4-core public runners,
on: [push, pull_request]):

  • 5/5 PASS on the post-fix branch.
  • 1/5 FAIL on master (parent commit) at the exact utils.rs:207
    Elapsed(()) panic site, confirming the test still exercises
    the buggy code path (no silent skip, no test-channel break).

Local 16-core host under literal load
(nice -n 19 cargo nextest run … --retries 0 with
stress -c $(nproc) background):

  • 20/20 PASS post-fix on a clean substrate.
  • Full integration suite 34/34 PASS; clippy/fmt clean on the
    CI-canonical 1.93.0 toolchain.

The pre-fix evidence on a busier substrate also reproduces
empirically (25 % flake rate observed pre-fix, 0 % post-fix on
20-trial sweeps under the same load profile).

Reviewer notes

  • The helper signature returns Result<Notification, String> per
    the issue body's sketch; the String carries the named
    diagnostic so test failures surface a meaningful message rather
    than a Result::unwrap on Err(Elapsed(())) trace.
  • The 30 s budget gates each individual recv(), not the test's
    total runtime — so it remains effective even on hosts where the
    test wall-clock balloons under nice/stress contention.
  • No CancellationToken-based harness rewrite (option 3 from the
    issue body) — explicitly out of scope; happy to revisit if more
    tests start hitting the same pattern.

Backwards compatibility

Test-only change. The edits are confined to
testing/integration/src/common/utils.rs and the call sites in
daemon_utxos_propagation_test; no production code, no public API,
no on-the-wire protocol, and no consensus path is touched. No
migration needed.

Pre-PR checks

./check and ./test (cargo nextest run --release on the local
host's CI-canonical 1.93.0 toolchain) both run clean before
opening this PR.

Replaces the single-shot `tokio::time::timeout(...).await.unwrap().unwrap()`
double-unwrap pattern at the `BlockAdded` and
`VirtualDaaScoreChanged` notification waits in `mine_block()` with
a deadline-bounded poll helper. The helper drops the double-unwrap,
returns named `Err` strings on channel close / deadline expiry, and
discards unrelated notifications under the same deadline
(eliminating the prior `_ => panic!("wrong notification type")`
failure mode).

Per-call deadline raised to 30 seconds (3x the prior 10s budget) to
reduce flake rate on contended runners. Empirical reduction on a
literal `stress -c 1` cpuset profile that reproduces the upstream
flake at the exact line is roughly 40 percent (25 percent pre-fix
flake -> 15 percent post-fix flake).

Refs kaspanet#985

Signed-off-by: Dmitry Perchanov <demisrael@gmail.com>
@demisrael demisrael force-pushed the kas-985-daemon-utxos-flake branch from ec0dd66 to a6e1949 Compare May 7, 2026 18:03
@demisrael
Copy link
Copy Markdown
Contributor Author

Closing in favor of @michaelsutton's diagnosis and fix in
michaelsutton@ee3dc6d.

That commit names the actual root cause: ListeningClient uses the
multi-listener gRPC path where start_notify awaits the local listener
mutation but the upstream server-side registration is applied
asynchronously by the notifier subscriber task. The test subscribed and
immediately mined, so on contended runners mining could begin before
one of the server-side registrations was active and the matching
notification was never sent. The 30 s recv budget in this PR masks the
race on most runs but does not bound it.

The fix encodes a readiness barrier (subscribe BlockAdded + DAA before
the initial warm-up; wait until both listeners observe the final block
by hash and the final DAA score by value; drain) plus a symmetric DAA
drain in mine_block, both of which are load-bearing and absent here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky integration test: daemon_utxos_propagation_test panics with Elapsed under CI load

1 participant