Skip to content

rt: shard the multi-thread inject queue to reduce remote spawn contention#7973

Open
alex wants to merge 6 commits intotokio-rs:masterfrom
alex:shard-remote-lock
Open

rt: shard the multi-thread inject queue to reduce remote spawn contention#7973
alex wants to merge 6 commits intotokio-rs:masterfrom
alex:shard-remote-lock

Conversation

@alex
Copy link
Copy Markdown
Contributor

@alex alex commented Mar 13, 2026

The multi-threaded scheduler's inject queue was protected by a single global mutex (shared with idle coordination state). Every remote task spawn — any spawn from outside a worker thread — acquired this lock, serializing concurrent spawners and limiting throughput.

This change introduces inject::Sharded, which splits the inject queue into up to 8 independent shards, each an existing Shared/Synced pair with its own mutex and cache-line padding.

Design:

  • Push: each thread is assigned a home shard on first push (via a global counter) and sticks with it. This keeps consecutive pushes from one thread cache-local while spreading distinct threads across distinct locks.
  • Pop: workers rotate through shards starting at their own index, skipping empty shards via per-shard atomic length. pop_n drains from one shard at a time to keep critical sections bounded.
  • Shard count: capped at 8 (and 1 under loom). Contention drops off steeply past a handful of shards, and is_empty()/len() scan all shards in the worker hot loop.
  • is_closed: a single Release atomic set after all shards are closed, so the shutdown check stays lock-free.

Random shard selection via context::thread_rng_n (as used in #7757 for the blocking pool) was measured and found to be 20-33% slower on remote_spawn at 8+ threads. The inject workload is a tight loop of trivial pushes where producer-side cache locality dominates: with RNG, a hot thread bounces between shard cache lines on every push; with sticky assignment it stays hot on one mutex and list tail. RNG did win slightly (5-9%) on single-producer benchmarks where spreading tasks lets workers pop in parallel, but not enough to offset the regression at scale.

The inject state is removed from the global Synced mutex, which now only guards idle coordination. This also helps the single-threaded path since remote pushes no longer contend with worker park/unpark.

Results on remote_spawn benchmark (12,800 no-op tasks, N spawner threads, 64-core box):

threads before after improvement
1 9.38 ms 7.33 ms -22%
2 14.94 ms 6.64 ms -56%
4 23.69 ms 5.34 ms -77%
8 34.81 ms 4.69 ms -87%
16 32.33 ms 4.54 ms -86%
32 30.37 ms 4.73 ms -84%
64 26.59 ms 5.34 ms -80%

rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle -7%, yield_many -1%, rest neutral.

Developed in conjunction with Claude.

@github-actions github-actions Bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Mar 13, 2026
@alex alex force-pushed the shard-remote-lock branch 3 times, most recently from adfa19f to de52c66 Compare March 13, 2026 23:59
@ADD-SP
Copy link
Copy Markdown
Member

ADD-SP commented Mar 14, 2026

Is the inject queue still a FIFO queue?

@ADD-SP ADD-SP added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime T-performance Topic: performance and benchmarks labels Mar 14, 2026
@alex
Copy link
Copy Markdown
Contributor Author

alex commented Mar 14, 2026

Approximately -- each of the queue shards is FIFO, but nothing attempts to ensure ordering cross-shard.

@alex alex force-pushed the shard-remote-lock branch from de52c66 to ca7604a Compare March 31, 2026 03:28
Copy link
Copy Markdown
Member

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks quite reasonable. Let's do it.

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated
Comment thread tokio/src/runtime/scheduler/inject/sharded.rs
Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated
@alex alex force-pushed the shard-remote-lock branch 2 times, most recently from 71aadec to ba043df Compare April 2, 2026 11:34
Comment on lines +63 to +67
/// Home shard index for the sharded inject queue. External threads
/// pushing tasks are assigned a shard on first push and stick with it
/// for cache locality. `INJECT_SHARD_UNASSIGNED` means not yet assigned.
#[cfg(feature = "rt-multi-thread")]
inject_push_shard: Cell<usize>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with block_in_place? Let's say thread A is assigned shard 4, and then it invokes block_in_place and stops being a worker thread. One of the blocking threads become a worker thread to replace it. Then it might pick a different shard from 4, right? I guess this means we can end up with multiple workers using the same shard.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the wrong question ... this is for when you push to a runtime from outside it, so you're not talking about your own shard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that's true either way -- our max number of shards is 8, and you can have way more workers than that. A worker does not uniquely own its shard.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess it's okay then. Should we periodically pick a new shard? Perhaps after every 100 spawns? That way, if a program has two threads that continuously spawn a lot on the same shard, then eventually they pick a new shard and stop contending.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose we could do that for the pathological case. 100 spawns might be too few though? If you set it too low you risk having a single thread bounce around and just dirty up everyone else's cache lines.

My preference would be to land this without the adaptive behavior and do that as a follow up -- even for the pathological case of two threads that happen to land on the same shard, this should still pareto dominate being unsharded.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm raising the concern as a liveness issue, which per this guarantee makes it a correctness concern.

how did you end up with an item in B queue but that worker never woken up?

The code for waking up a worker after pushing to the queue is here:

// Otherwise, use the inject queue.
self.push_remote_task(task);
self.notify_parked_remote();

There is no relationship between which shard the item was pushed to, and which worker is woken up. All this code ensures that, after pushing an item, then if all workers are idle, then a worker is woken up. But it might not be the same worker as where the item was pushed, and if there is already a non-idle worker searching for work, no wakeup occurs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #8029

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking more about this, and I've come to the conclusion that it is okay as-is. As a perf matter, it could be beneficial to give notify_parked_remote() a hint about which worker it should prefer to wake up, but not a blocker.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can easily produce a test case for this scenario, perhaps by having one worker execute a future that just spawns a big pile of tasks in a loop without yielding? That's not a totally contrived scenario as you might imagine an accept loop or something under load where there's basically always new connections to handle without having to wait for a long period of time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm no longer worried about it is that this scenario will cause all worker threads to wake up, and once all worker threads are alive, you can't have this kind of starvation.

Comment thread tokio/src/runtime/scheduler/multi_thread/worker.rs Outdated
@Darksonn
Copy link
Copy Markdown
Member

Please rebase or merge master to avoid conflicts with the LIFO slot changes.

@alex alex force-pushed the shard-remote-lock branch from e5ab8fb to d2cc4fe Compare April 12, 2026 17:31
Copy link
Copy Markdown
Member

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Darksonn
Copy link
Copy Markdown
Member

Ah, looks like loom timed out?

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated
Comment thread tokio/src/runtime/scheduler/inject/sharded.rs
Comment thread tokio/src/runtime/scheduler/inject/sharded.rs
Comment thread tokio/src/runtime/scheduler/inject/sharded.rs
Comment on lines +105 to +110
for shard in self.shards.iter() {
if !shard.shared.is_empty() {
return false;
}
}
true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it, but this could also be written as:

Suggested change
for shard in self.shards.iter() {
if !shard.shared.is_empty() {
return false;
}
}
true
self.shards.iter().all(|shard| shard.is_empty())

Comment on lines +63 to +67
/// Home shard index for the sharded inject queue. External threads
/// pushing tasks are assigned a shard on first push and stick with it
/// for cache locality. `INJECT_SHARD_UNASSIGNED` means not yet assigned.
#[cfg(feature = "rt-multi-thread")]
inject_push_shard: Cell<usize>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can easily produce a test case for this scenario, perhaps by having one worker execute a future that just spawns a big pile of tasks in a loop without yielding? That's not a totally contrived scenario as you might imagine an accept loop or something under load where there's basically always new connections to handle without having to wait for a long period of time.

alex and others added 5 commits April 13, 2026 21:10
…tion

The multi-threaded scheduler's inject queue was protected by a single
global mutex (shared with idle coordination state). Every remote task
spawn — any spawn from outside a worker thread — acquired this lock,
serializing concurrent spawners and limiting throughput.

This change introduces `inject::Sharded`, which splits the inject queue
into up to 8 independent shards, each an existing `Shared`/`Synced`
pair with its own mutex and cache-line padding.

Design:
- Push: each thread is assigned a home shard on first push (via a
  global counter) and sticks with it. This keeps consecutive pushes
  from one thread cache-local while spreading distinct threads across
  distinct locks.
- Pop: workers rotate through shards starting at their own index,
  skipping empty shards via per-shard atomic length. pop_n drains from
  one shard at a time to keep critical sections bounded.
- Shard count: capped at 8 (and 1 under loom). Contention drops off
  steeply past a handful of shards, and is_empty()/len() scan all
  shards in the worker hot loop.
- is_closed: a single Release atomic set after all shards are closed,
  so the shutdown check stays lock-free.

Random shard selection via context::thread_rng_n (as used in tokio-rs#7757 for
the blocking pool) was measured and found to be 20-33% slower on
remote_spawn at 8+ threads. The inject workload is a tight loop of
trivial pushes where producer-side cache locality dominates: with RNG,
a hot thread bounces between shard cache lines on every push; with
sticky assignment it stays hot on one mutex and list tail. RNG did win
slightly (5-9%) on single-producer benchmarks where spreading tasks
lets workers pop in parallel, but not enough to offset the regression
at scale.

The inject state is removed from the global Synced mutex, which now
only guards idle coordination. This also helps the single-threaded
path since remote pushes no longer contend with worker park/unpark.

Results on remote_spawn benchmark (12,800 no-op tasks, N spawner
threads, 64-core box):

  threads   before    after    improvement
  1         9.38 ms   7.33 ms  -22%
  2        14.94 ms   6.64 ms  -56%
  4        23.69 ms   5.34 ms  -77%
  8        34.81 ms   4.69 ms  -87%
  16       32.33 ms   4.54 ms  -86%
  32       30.37 ms   4.73 ms  -84%
  64       26.59 ms   5.34 ms  -80%

rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle
-7%, yield_many -1%, rest neutral.

Developed in conjunction with Claude.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sharded::pop already checks each shard's emptiness as a fast path
before locking, so the outer is_empty scan was iterating all shards
twice for no benefit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix incorrect claim that a closed shard implies all shards are closed
(close() operates shard-by-shard), and document that MAX_SHARDS must
be a power of two.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alex alex force-pushed the shard-remote-lock branch from 6970571 to 8913d32 Compare April 14, 2026 01:15
@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 14, 2026

Is there a recommended way to debug the loom tests timing? Looks like even in normal operations they take several hours :-( I'm assuming the likely cause of timeouts is accidentally making the state space too large? (I was hoping that limiting to 1 shard on loom would prevent that, but perhaps not)

@Darksonn
Copy link
Copy Markdown
Member

Limiting it to one shard is not necessarily enough. Any operation involved in concurrency (such as atomics or taking/releasing mutexes) that happens while other threads exists is a location where preemption can happen in loom, so as you introduce more mutex lock/unlocks during the part of the test with more than one thread, that increases the search space.

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 14, 2026

Ah yeah. Same question though, what's the best way to debug it :-)

@Darksonn
Copy link
Copy Markdown
Member

If there's an actual bug that it's catching, I'll do it with printlns to understand the specific interleaving it fails on. I'm not sure when it's just too many interleavings. Probably the test has to be simplified.

this may also marginally help performance for real code by removing duplicate checks for shared emptiness under locks
@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 15, 2026

Looks like the latest change was sufficient, hopefully it looks reasonable.

Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the previous change to use a sharded queue in spawn_blocking (#7757) introduced a regression that caused programs using spawn_blocking to hang (#8056), and that change ultimately had to be reverted (in #8057), I think we should be cautious about moving forwards with this.

In particular, I feel like --- as a substantial change to runtime internals that potentially affects all uses of the multi-threaded runtime, and which introduces a complex new concurrent data structure involving unsafe code --- this is the type of change which should really be introduced as an opt-in tokio::runtime::Builder setting which requires tokio_unstable. This way, we can allow users to start testing this in production without running the risk of introducing regressions which block users from picking up new tokio releases. We've taken this approach in the past for changes such as the alternative timer implementation (#7467) and eager I/O driver handoff (#8010)1, and I think it would be appropriate to do something similar here, as well.

Footnotes

  1. The latter of which is a much smaller change!

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 16, 2026

On my TODO list to figure out how practice this is this weekend.

@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 16, 2026

On my TODO list to figure out how practice this is this weekend.

Great! If it's possible, I think that starting out as an opt-in experimental feature can be a useful way to get big improvements to runtime internals like this one merged faster and start trying them out in production

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 16, 2026

The "how [practical] this is" was specifically making this opt in.

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 19, 2026

Ok, spent some time this morning looking at a "pre refactor" (pre-factor?) to make it easy for us to support both the shared lock and sharded lock queues. Unfortunately, I'm extremely unhappy with the results -- the current diff is +340/-219 (just for the refactor), basically as big as this patch. And there's a bunch of annoying design pieces.

What I want with is an enum Inject { Shared(SharedInject) } where SharedInject has the Mutex and the inject::Shared. And then all the APIs from SharedInject are exposed on Inject by matching on all self-variants. When we go to added Sharded, that should be the only thing that needs to change

This leads to splitting off the Mutex that's currently covering both idle and inject (in multi_threaded::Shared) -- I think this is ok, but it's probably the most significant change.

The other obvious change is that trace_multi_thread takes a lock for each pop, rather than one lock for the whole time. I think this could probably be fixed with a new pop_all() (the current PR has this same design).

I also ended up with 2x implementations of Pop, one for where the MutexGuard is owned and another for where i's bothered.

All this to say: I think this can be done, the PR is much larger than I'd like. If anyone has any for a better architecture, let me know and I can mull on it. Otherwise I think I'll make #8068 my priority and come back this once that's done.

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 19, 2026

(I'm also happy to put it up as a draft PR if it's of interest to anyone.)

@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 19, 2026

So, very naively, I had kind of hoped we would be able to do this by making the number of shards a parameter that's provided when the sharded inject queue is constructed, and having the runtime builder either construct it with 1 or 8 shards depending on whether the sharded queue is enabled. But, this only works if the behavior of the queue with a single shard is more or less equivalent to a single queue. It sounds like this not actually be the case, though, because of splitting the mutex currently guarding both idle and inject into two separate mutices around the inject queue and idle worker state. This would, unfortunately, mean that once this is released, programs which do not enable the sharded inject queue still get different behavior relative to the current Tokio version, since these mutexen are separate...

I actually think that we should probably be at least a little concerned about separating the inject queue and idle state mutex. Off the top of my head, I can't think of any code that relies on the assumption that these two pieces of shared state are not modified concurrently, but I think we'll have to look closely to make sure that's the case.

@alex
Copy link
Copy Markdown
Contributor Author

alex commented Apr 19, 2026

Yeah, I think that's the right summary.

Maybe there's an even more incremental change of just using different mutexes for idle and inject in the existing shared structure? It's a tiny diff and should be the biggest semantic risk for the migration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR T-performance Topic: performance and benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants