rt: don't notify when pushing task to empty LIFO slot#8069
rt: don't notify when pushing task to empty LIFO slot#8069
Conversation
|
However, I'm not 100% sure that this change is correct, given #8065 (comment) (which is what originally prompted me to remove the I worry that it would be possible to re-introduce the deadlock that can occur when a task is placed in the LIFO slot and the worker thread is immediately blocked by the task that notified it, because the blocked worker won't have any further reasons to unpark another worker while it is blocked. However, the tokio/tokio/tests/rt_threaded.rs Lines 733 to 743 in b010b5d I wonder if removing that task from the test makes it pass on 1.52.1 but hang on this branch. Let's find out.. |
Yup. If we remove diff --git a/tokio/tests/rt_threaded.rs b/tokio/tests/rt_threaded.rs
index 133ba4c3..ad00abaa 100644
--- a/tokio/tests/rt_threaded.rs
+++ b/tokio/tests/rt_threaded.rs
@@ -730,18 +730,6 @@ fn lifo_stealable() {
.unwrap();
rt.block_on(async {
- // Keep the runtime busy so that the workers that might steal the
- // blocked task don't all park themselves forever.
- //
- // Since this task will always be woken by whichever worker is holding
- // the time driver, rather than a worker that's executing tasks, it
- // shouldn't ever kick the victim task out of its worker's LIFO slot.
- let churn = tokio::spawn(async move {
- loop {
- tokio::time::sleep(Duration::from_millis(4)).await;
- }
- });
-
let victim_task_joined = tokio::spawn(async move {
println!("[victim] task started");
task_started_tx.send(()).unwrap();
@@ -791,7 +779,6 @@ fn lifo_stealable() {
// Before possibly panicking, make sure that we wake up the blocker task
// so that it doesn't stop the runtime from shutting down.
block_thread_tx.send(()).unwrap();
- churn.abort();
result
.expect("task in LIFO slot should complete within 30 seconds")
.expect("task in LIFO slot should not panic");...then the test still passes on eliza@hekate ~/Code/tokio $ git co master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
eliza@hekate ~/Code/tokio $ git rev-parse HEAD
b010b5ddafb316e5d540de1736f5f7bfde42ec39
eliza@hekate ~/Code/tokio $ cargo nextest run lifo_stealable
Compiling examples v0.0.0 (/home/eliza/Code/tokio/examples)
Compiling tokio-util v0.7.18 (/home/eliza/Code/tokio/tokio-util)
Compiling tokio v1.52.1 (/home/eliza/Code/tokio/tokio)
Finished `test` profile [unoptimized + debuginfo] target(s) in 2.62s
────────────
Nextest run ID 004702f1-2de3-42e8-b18c-69606f942b2f with nextest profile: default
Starting 1 test across 235 binaries (1717 tests skipped)
PASS [ 0.003s] tokio::rt_threaded lifo_stealable
────────────
Summary [ 0.005s] 1 test run: 1 passed, 1717 skipped
eliza@hekate ~/Code/tokio $ git checkout eliza/should-notify-again
M tokio/tests/rt_threaded.rs
Switched to branch 'eliza/should-notify-again'
Your branch is up to date with 'origin/eliza/should-notify-again'.
eliza@hekate ~/Code/tokio $ git rev-parse HEAD
bc62e2d94cb84d556af7d6a707405baa1058e832
eliza@hekate ~/Code/tokio $ cargo nextest run lifo_stealable
Compiling tokio v1.52.1 (/home/eliza/Code/tokio/tokio)
Compiling tokio-stream v0.1.18 (/home/eliza/Code/tokio/tokio-stream)
Compiling tokio-util v0.7.18 (/home/eliza/Code/tokio/tokio-util)
Compiling tests-integration v0.1.0 (/home/eliza/Code/tokio/tests-integration)
Compiling tokio-test v0.4.5 (/home/eliza/Code/tokio/tokio-test)
Compiling tokio-macros v2.7.0 (/home/eliza/Code/tokio/tokio-macros)
Compiling stress-test v0.1.0 (/home/eliza/Code/tokio/stress-test)
Compiling examples v0.0.0 (/home/eliza/Code/tokio/examples)
Finished `test` profile [unoptimized + debuginfo] target(s) in 3.77s
────────────
Nextest run ID 67fc6e9f-99cd-48d7-bddd-582cce200425 with nextest profile: default
Starting 1 test across 235 binaries (1717 tests skipped)
FAIL [ 30.020s] tokio::rt_threaded lifo_stealable
stdout ───
running 1 test
[victim] task started
[victim] task waiting for wakeup...
[main] victim slot task start acked!
[main] blocker task spawned
[blocker] sending wakeup
[blocker] blocking the worker thread...
[main] result: Err(Elapsed(()))
[blocker] done
[victim] task running after wakeup
test lifo_stealable ... FAILED
failures:
failures:
lifo_stealable
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 28 filtered out; finished in 30.02s
stderr ───
thread 'lifo_stealable' (3775832) panicked at tokio/tests/rt_threaded.rs:783:14:
task in LIFO slot should complete within 30 seconds: Elapsed(())
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/std/src/panicking.rs:689:5
1: core::panicking::panic_fmt
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/core/src/panicking.rs:80:14
2: core::result::unwrap_failed
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/core/src/result.rs:1867:5
3: core::result::Result<T,E>::expect
at /home/eliza/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1185:23
4: rt_threaded::lifo_stealable::{{closure}}
at ./tests/rt_threaded.rs:783:14
5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
at ./src/runtime/park.rs:284:71
6: tokio::task::coop::with_budget
at ./src/task/coop/mod.rs:167:5
7: tokio::task::coop::budget
at ./src/task/coop/mod.rs:133:5
8: tokio::runtime::park::CachedParkThread::block_on
at ./src/runtime/park.rs:284:31
9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
at ./src/runtime/context/blocking.rs:66:14
10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
at ./src/runtime/scheduler/multi_thread/mod.rs:92:22
11: tokio::runtime::context::runtime::enter_runtime
at ./src/runtime/context/runtime.rs:65:16
12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
at ./src/runtime/scheduler/multi_thread/mod.rs:91:9
13: tokio::runtime::runtime::Runtime::block_on_inner
at ./src/runtime/runtime.rs:373:50
14: tokio::runtime::runtime::Runtime::block_on
at ./src/runtime/runtime.rs:345:18
15: rt_threaded::lifo_stealable
at ./tests/rt_threaded.rs:732:8
16: rt_threaded::lifo_stealable::{{closure}}
at ./tests/rt_threaded.rs:701:20
17: core::ops::function::FnOnce::call_once
at /home/eliza/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
18: <fn() -> core::result::Result<(), alloc::string::String> as core::ops::function::FnOnce<()>>::call_once
at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Cancelling due to test failure:
────────────
Summary [ 30.023s] 1 test run: 0 passed, 1 failed, 1717 skipped
FAIL [ 30.020s] tokio::rt_threaded lifo_stealable
error: test run failed
eliza@hekate ~/Code/tokio $ Unfortunately, I think that this means we should not move forwards with putting back a Sadly, this means we have to accept the performance impact of additional notifications as part of the cost of preventing deadlocks here. Personally, I feel that this is appropriate, because I don't think we should consider a performance decrease from code that behaved incorrectly to be a "regression". In the long term, I think we might be able to reduce the amount of cross-thread notifications here by changing the logic for notifying a worker when pushing tasks to the queue so that it is less eager about waking workers in general. At present we always notify another worker if there are no workers currently in searching and any workers are parked: We might consider changing the "lifo slot only" case to notify only if all other workers are parked, or something, so that we will notify here if it is necessary to prevent a deadlock, but not if there are no other currently searching workers. But I think we'll want to think carefully about that. @Darksonn, what do you think? |
Now, we don't rely on the presence of another task to prevent workers from parking in the test. This way, it validates that we are correctly unparking other workers. See #8069 (comment)
Now, we don't rely on the presence of another task to prevent workers from parking in the test. This way, it validates that we are correctly unparking other workers. See #8069 (comment)
Adds a loom test that reproduces a stranded-task race in the `any_lifo` global-bit + bounded-timeout-park protocol introduced on top of tokio-rs#8069. When the `ANY_LIFO` bit is already set by a prior pusher in the same wave (the optimization's normal operating state), two `SeqCst` operations on distinct memory locations can interleave such that: - worker B (running `steal_stranded_lifo` after a timed park) reads the bit as set, then scans peer LIFO cells and observes worker W's cell as empty (legal under SC across two locations because B's cell-load can sit before W's cell-store in the SC total order even when the bit-load physically precedes the cell-store); - worker W's `push_lifo` populates its cell, then `put_lifo` reads `prev=true` (because the bit was already set) and W suppresses its notify; - worker B's scan having found nothing, B calls `clear_lifo` and parks. The cell now holds a task with no remaining mechanism to discover it until either another push starts a new wave or the 100 ms `LIFO_EXCLUSIVITY_TIMEOUT` fires. The test lives at `runtime::tests::loom_multi_thread::lifo_bit_race` and exercises real `Idle` and `queue::Local` / `queue::Steal` from the patch. Three tests: - `lifo_bit_race_pre_set` (FAIL): bit pre-set, race reachable. - `control_no_preset_is_safe` (pass): bit starts cleared so W is forced to notify -- if this ever fails, the harness is broken. - `control_always_notify_is_safe` (pass): `if !put_lifo()` removed -- if this ever fails, the harness is broken. Visibility bumps required to construct `Idle` and call its bit-management methods from `runtime::tests::loom_multi_thread`: - `multi_thread/mod.rs`: `mod idle;` -> `pub(crate) mod idle;` - `idle.rs`: `pub(super)` -> `pub(crate)` on `Idle`, `Synced`, `Idle::new`, `put_lifo`, `clear_lifo`, `should_attempt_lifo_steal`. Run with: RUSTFLAGS="--cfg loom -C debug-assertions" \ cargo test --release --lib --features full \ runtime::tests::loom_multi_thread::lifo_bit_race The `-C debug-assertions` is required by the existing `compile_error!("these tests require debug assertions to be enabled")` guard in `runtime/tests/mod.rs`. Made-with: Cursor
Adds a loom test that reproduces a stranded-task race in the `any_lifo` global-bit + bounded-timeout-park protocol introduced on top of tokio-rs#8069.
Adds a loom test that reproduces a stranded-task race in the `any_lifo` global-bit + bounded-timeout-park protocol introduced on top of tokio-rs#8069. Also run rustfmt --edition 2021
Currently, the `rt_threaded::lifo_stealable` test I added in #7431 spawns an additional task which sleeps on a 4ms timer in a loop. This ensures that no worker remains permanently parked. This was added because it was necessary to stop the LIFO slot deadlock from occurring prior to changes in the logic for determining whether to notify another worker, which is what @Darksonn was referring to in [this comment][1]. Removing the `churn()` test makes the test actually validate that another worker is notified to steal the LIFO task, and that the changes from #7431 will *always* prevent a LIFO slot deadlock, regardless of the behavior of other tasks on the runtime. See also [this comment][2] for further discussion. [1]: #7431 (comment) [2]: #8069 (comment)
|
I think it is basically correct that you must issue these wakeups to avoid issues where a blocking task prevents execution of the task in the lifo slot. We could perhaps consider entirely different solutions where a second thread wakes up periodically to check if the running worker thread is making progress. Of course that's a bigger runtime change. |
See #8065
This change un-does the increase in notifications described in #8065. Running the repro from #8065 (comment), I see voluntary context switches back to 1.50.0 counts (note the
cargo buildoutput saying that this is a local Tokio checkout):