rt: shard the multi-thread inject queue to reduce remote spawn contention by alex · Pull Request #7973 · tokio-rs/tokio

alex · 2026-03-13T23:44:47Z

The multi-threaded scheduler's inject queue was protected by a single global mutex (shared with idle coordination state). Every remote task spawn — any spawn from outside a worker thread — acquired this lock, serializing concurrent spawners and limiting throughput.

This change introduces inject::Sharded, which splits the inject queue into up to 8 independent shards, each an existing Shared/Synced pair with its own mutex and cache-line padding.

Design:

Push: each thread is assigned a home shard on first push (via a global counter) and sticks with it. This keeps consecutive pushes from one thread cache-local while spreading distinct threads across distinct locks.
Pop: workers rotate through shards starting at their own index, skipping empty shards via per-shard atomic length. pop_n drains from one shard at a time to keep critical sections bounded.
Shard count: capped at 8 (and 1 under loom). Contention drops off steeply past a handful of shards, and is_empty()/len() scan all shards in the worker hot loop.
is_closed: a single Release atomic set after all shards are closed, so the shutdown check stays lock-free.

Random shard selection via context::thread_rng_n (as used in #7757 for the blocking pool) was measured and found to be 20-33% slower on remote_spawn at 8+ threads. The inject workload is a tight loop of trivial pushes where producer-side cache locality dominates: with RNG, a hot thread bounces between shard cache lines on every push; with sticky assignment it stays hot on one mutex and list tail. RNG did win slightly (5-9%) on single-producer benchmarks where spreading tasks lets workers pop in parallel, but not enough to offset the regression at scale.

The inject state is removed from the global Synced mutex, which now only guards idle coordination. This also helps the single-threaded path since remote pushes no longer contend with worker park/unpark.

Results on remote_spawn benchmark (12,800 no-op tasks, N spawner threads, 64-core box):

threads before after improvement
1 9.38 ms 7.33 ms -22%
2 14.94 ms 6.64 ms -56%
4 23.69 ms 5.34 ms -77%
8 34.81 ms 4.69 ms -87%
16 32.33 ms 4.54 ms -86%
32 30.37 ms 4.73 ms -84%
64 26.59 ms 5.34 ms -80%

rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle -7%, yield_many -1%, rest neutral.

Developed in conjunction with Claude.

ADD-SP · 2026-03-14T03:11:28Z

Is the inject queue still a FIFO queue?

alex · 2026-03-14T03:12:56Z

Approximately -- each of the queue shards is FIFO, but nothing attempts to ensure ordering cross-shard.

Darksonn

In general this looks quite reasonable. Let's do it.

Darksonn · 2026-04-02T11:41:20Z

+    /// Home shard index for the sharded inject queue. External threads
+    /// pushing tasks are assigned a shard on first push and stick with it
+    /// for cache locality. `INJECT_SHARD_UNASSIGNED` means not yet assigned.
+    #[cfg(feature = "rt-multi-thread")]
+    inject_push_shard: Cell<usize>,


How does this interact with block_in_place? Let's say thread A is assigned shard 4, and then it invokes block_in_place and stops being a worker thread. One of the blocking threads become a worker thread to replace it. Then it might pick a different shard from 4, right? I guess this means we can end up with multiple workers using the same shard.

I guess this is the wrong question ... this is for when you push to a runtime from outside it, so you're not talking about your own shard.

Yes, but that's true either way -- our max number of shards is 8, and you can have way more workers than that. A worker does not uniquely own its shard.

Ok, I guess it's okay then. Should we periodically pick a new shard? Perhaps after every 100 spawns? That way, if a program has two threads that continuously spawn a lot on the same shard, then eventually they pick a new shard and stop contending.

Hmm, I suppose we could do that for the pathological case. 100 spawns might be too few though? If you set it too low you risk having a single thread bounce around and just dirty up everyone else's cache lines.

My preference would be to land this without the adaptive behavior and do that as a follow up -- even for the pathological case of two threads that happen to land on the same shard, this should still pareto dominate being unsharded.

I'm raising the concern as a liveness issue, which per this guarantee makes it a correctness concern.

how did you end up with an item in B queue but that worker never woken up?

The code for waking up a worker after pushing to the queue is here:

tokio/tokio/src/runtime/scheduler/multi_thread/worker.rs

Lines 1287 to 1289 in e5ab8fb

// Otherwise, use the inject queue.

self.push_remote_task(task);

self.notify_parked_remote();

There is no relationship between which shard the item was pushed to, and which worker is woken up. All this code ensures that, after pushing an item, then if all workers are idle, then a worker is woken up. But it might not be the same worker as where the item was pushed, and if there is already a non-idle worker searching for work, no wakeup occurs.

See also #8029

I've been thinking more about this, and I've come to the conclusion that it is okay as-is. As a perf matter, it could be beneficial to give notify_parked_remote() a hint about which worker it should prefer to wake up, but not a blocker.

I wonder if we can easily produce a test case for this scenario, perhaps by having one worker execute a future that just spawns a big pile of tasks in a loop without yielding? That's not a totally contrived scenario as you might imagine an accept loop or something under load where there's basically always new connections to handle without having to wait for a long period of time.

The reason I'm no longer worried about it is that this scenario will cause all worker threads to wake up, and once all worker threads are alive, you can't have this kind of starvation.

Darksonn · 2026-04-12T14:10:12Z

Please rebase or merge master to avoid conflicts with the LIFO slot changes.

Darksonn

LGTM

Darksonn · 2026-04-13T18:28:16Z

Ah, looks like loom timed out?

hawkw · 2026-04-13T22:34:20Z

+        for shard in self.shards.iter() {
+            if !shard.shared.is_empty() {
+                return false;
+            }
+        }
+        true


take it or leave it, but this could also be written as:

Suggested change

for shard in self.shards.iter() {

if !shard.shared.is_empty() {

return false;

}

}

true

self.shards.iter().all(|shard| shard.is_empty())

hawkw · 2026-04-13T22:41:34Z

+    /// Home shard index for the sharded inject queue. External threads
+    /// pushing tasks are assigned a shard on first push and stick with it
+    /// for cache locality. `INJECT_SHARD_UNASSIGNED` means not yet assigned.
+    #[cfg(feature = "rt-multi-thread")]
+    inject_push_shard: Cell<usize>,


I wonder if we can easily produce a test case for this scenario, perhaps by having one worker execute a future that just spawns a big pile of tasks in a loop without yielding? That's not a totally contrived scenario as you might imagine an accept loop or something under load where there's basically always new connections to handle without having to wait for a long period of time.

…tion The multi-threaded scheduler's inject queue was protected by a single global mutex (shared with idle coordination state). Every remote task spawn — any spawn from outside a worker thread — acquired this lock, serializing concurrent spawners and limiting throughput. This change introduces `inject::Sharded`, which splits the inject queue into up to 8 independent shards, each an existing `Shared`/`Synced` pair with its own mutex and cache-line padding. Design: - Push: each thread is assigned a home shard on first push (via a global counter) and sticks with it. This keeps consecutive pushes from one thread cache-local while spreading distinct threads across distinct locks. - Pop: workers rotate through shards starting at their own index, skipping empty shards via per-shard atomic length. pop_n drains from one shard at a time to keep critical sections bounded. - Shard count: capped at 8 (and 1 under loom). Contention drops off steeply past a handful of shards, and is_empty()/len() scan all shards in the worker hot loop. - is_closed: a single Release atomic set after all shards are closed, so the shutdown check stays lock-free. Random shard selection via context::thread_rng_n (as used in tokio-rs#7757 for the blocking pool) was measured and found to be 20-33% slower on remote_spawn at 8+ threads. The inject workload is a tight loop of trivial pushes where producer-side cache locality dominates: with RNG, a hot thread bounces between shard cache lines on every push; with sticky assignment it stays hot on one mutex and list tail. RNG did win slightly (5-9%) on single-producer benchmarks where spreading tasks lets workers pop in parallel, but not enough to offset the regression at scale. The inject state is removed from the global Synced mutex, which now only guards idle coordination. This also helps the single-threaded path since remote pushes no longer contend with worker park/unpark. Results on remote_spawn benchmark (12,800 no-op tasks, N spawner threads, 64-core box): threads before after improvement 1 9.38 ms 7.33 ms -22% 2 14.94 ms 6.64 ms -56% 4 23.69 ms 5.34 ms -77% 8 34.81 ms 4.69 ms -87% 16 32.33 ms 4.54 ms -86% 32 30.37 ms 4.73 ms -84% 64 26.59 ms 5.34 ms -80% rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle -7%, yield_many -1%, rest neutral. Developed in conjunction with Claude.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Sharded::pop already checks each shard's emptiness as a fast path before locking, so the outer is_empty scan was iterating all shards twice for no benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix incorrect claim that a closed shard implies all shards are closed (close() operates shard-by-shard), and document that MAX_SHARDS must be a power of two. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alex · 2026-04-14T01:17:15Z

Is there a recommended way to debug the loom tests timing? Looks like even in normal operations they take several hours :-( I'm assuming the likely cause of timeouts is accidentally making the state space too large? (I was hoping that limiting to 1 shard on loom would prevent that, but perhaps not)

Darksonn · 2026-04-14T06:59:27Z

Limiting it to one shard is not necessarily enough. Any operation involved in concurrency (such as atomics or taking/releasing mutexes) that happens while other threads exists is a location where preemption can happen in loom, so as you introduce more mutex lock/unlocks during the part of the test with more than one thread, that increases the search space.

alex · 2026-04-14T11:49:54Z

Ah yeah. Same question though, what's the best way to debug it :-)

Darksonn · 2026-04-14T11:56:50Z

If there's an actual bug that it's catching, I'll do it with printlns to understand the specific interleaving it fails on. I'm not sure when it's just too many interleavings. Probably the test has to be simplified.

this may also marginally help performance for real code by removing duplicate checks for shared emptiness under locks

alex · 2026-04-15T09:39:51Z

Looks like the latest change was sufficient, hopefully it looks reasonable.

hawkw

Given that the previous change to use a sharded queue in spawn_blocking (#7757) introduced a regression that caused programs using spawn_blocking to hang (#8056), and that change ultimately had to be reverted (in #8057), I think we should be cautious about moving forwards with this.

In particular, I feel like --- as a substantial change to runtime internals that potentially affects all uses of the multi-threaded runtime, and which introduces a complex new concurrent data structure involving unsafe code --- this is the type of change which should really be introduced as an opt-in tokio::runtime::Builder setting which requires tokio_unstable. This way, we can allow users to start testing this in production without running the risk of introducing regressions which block users from picking up new tokio releases. We've taken this approach in the past for changes such as the alternative timer implementation (#7467) and eager I/O driver handoff (#8010)¹, and I think it would be appropriate to do something similar here, as well.

The latter of which is a much smaller change! ↩

alex · 2026-04-16T21:59:31Z

On my TODO list to figure out how practice this is this weekend.

hawkw · 2026-04-16T22:41:43Z

On my TODO list to figure out how practice this is this weekend.

Great! If it's possible, I think that starting out as an opt-in experimental feature can be a useful way to get big improvements to runtime internals like this one merged faster and start trying them out in production

alex · 2026-04-16T22:48:45Z

The "how [practical] this is" was specifically making this opt in.

alex · 2026-04-19T14:07:40Z

Ok, spent some time this morning looking at a "pre refactor" (pre-factor?) to make it easy for us to support both the shared lock and sharded lock queues. Unfortunately, I'm extremely unhappy with the results -- the current diff is +340/-219 (just for the refactor), basically as big as this patch. And there's a bunch of annoying design pieces.

What I want with is an enum Inject { Shared(SharedInject) } where SharedInject has the Mutex and the inject::Shared. And then all the APIs from SharedInject are exposed on Inject by matching on all self-variants. When we go to added Sharded, that should be the only thing that needs to change

This leads to splitting off the Mutex that's currently covering both idle and inject (in multi_threaded::Shared) -- I think this is ok, but it's probably the most significant change.

The other obvious change is that trace_multi_thread takes a lock for each pop, rather than one lock for the whole time. I think this could probably be fixed with a new pop_all() (the current PR has this same design).

I also ended up with 2x implementations of Pop, one for where the MutexGuard is owned and another for where i's bothered.

All this to say: I think this can be done, the PR is much larger than I'd like. If anyone has any for a better architecture, let me know and I can mull on it. Otherwise I think I'll make #8068 my priority and come back this once that's done.

alex · 2026-04-19T14:29:55Z

(I'm also happy to put it up as a draft PR if it's of interest to anyone.)

hawkw · 2026-04-19T19:58:33Z

So, very naively, I had kind of hoped we would be able to do this by making the number of shards a parameter that's provided when the sharded inject queue is constructed, and having the runtime builder either construct it with 1 or 8 shards depending on whether the sharded queue is enabled. But, this only works if the behavior of the queue with a single shard is more or less equivalent to a single queue. It sounds like this not actually be the case, though, because of splitting the mutex currently guarding both idle and inject into two separate mutices around the inject queue and idle worker state. This would, unfortunately, mean that once this is released, programs which do not enable the sharded inject queue still get different behavior relative to the current Tokio version, since these mutexen are separate...

I actually think that we should probably be at least a little concerned about separating the inject queue and idle state mutex. Off the top of my head, I can't think of any code that relies on the assumption that these two pieces of shared state are not modified concurrently, but I think we'll have to look closely to make sure that's the case.

alex · 2026-04-19T20:06:59Z

Yeah, I think that's the right summary.

Maybe there's an even more incremental change of just using different mutexes for idle and inject in the existing shared structure? It's a tiny diff and should be the biggest semantic risk for the migration?

github-actions Bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Mar 13, 2026

alex force-pushed the shard-remote-lock branch 3 times, most recently from adfa19f to de52c66 Compare March 13, 2026 23:59

ADD-SP added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime T-performance Topic: performance and benchmarks labels Mar 14, 2026

alex force-pushed the shard-remote-lock branch from de52c66 to ca7604a Compare March 31, 2026 03:28

Darksonn reviewed Apr 1, 2026

View reviewed changes

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated

martin-g reviewed Apr 1, 2026

View reviewed changes

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated

alex force-pushed the shard-remote-lock branch 2 times, most recently from 71aadec to ba043df Compare April 2, 2026 11:34

Darksonn reviewed Apr 2, 2026

View reviewed changes

Comment thread tokio/src/runtime/scheduler/multi_thread/worker.rs Outdated

alex force-pushed the shard-remote-lock branch from e5ab8fb to d2cc4fe Compare April 12, 2026 17:31

Darksonn approved these changes Apr 13, 2026

View reviewed changes

martin-g approved these changes Apr 13, 2026

View reviewed changes

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs Outdated

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs

Comment thread tokio/src/runtime/scheduler/inject/sharded.rs

hawkw reviewed Apr 13, 2026

View reviewed changes

alex and others added 5 commits April 13, 2026 21:10

rt: address review feedback on sharded inject queue

16a5b46

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rt: remove redundant is_empty check in next_remote_task

f5e3fdc

Sharded::pop already checks each shard's emptiness as a fast path before locking, so the outer is_empty scan was iterating all shards twice for no benefit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix spellcheck.dic

9e765c1

rt: address review feedback on sharded inject doc comments

8913d32

Fix incorrect claim that a closed shard implies all shards are closed (close() operates shard-by-shard), and document that MAX_SHARDS must be a power of two. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alex force-pushed the shard-remote-lock branch from 6970571 to 8913d32 Compare April 14, 2026 01:15

attempt to reduce the state-space for loom to model.

deb01d3

this may also marginally help performance for real code by removing duplicate checks for shared emptiness under locks

hawkw requested changes Apr 16, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Apr 20, 2026

fix(napi): fix custom tokio runtime recreation and cleanup hook re-re… napi-rs/napi-rs#3252

Open

	// Otherwise, use the inject queue.
	self.push_remote_task(task);
	self.notify_parked_remote();

Uh oh!

Conversation

alex commented Mar 13, 2026

Uh oh!

ADD-SP commented Mar 14, 2026

Uh oh!

alex commented Mar 14, 2026

Uh oh!

Darksonn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Darksonn commented Apr 12, 2026

Uh oh!

Darksonn left a comment

Choose a reason for hiding this comment

Uh oh!

Darksonn commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex commented Apr 14, 2026

Uh oh!

Darksonn commented Apr 14, 2026

Uh oh!

alex commented Apr 14, 2026

Uh oh!

Darksonn commented Apr 14, 2026

Uh oh!

alex commented Apr 15, 2026

Uh oh!

hawkw left a comment

Choose a reason for hiding this comment

Footnotes

Uh oh!

alex commented Apr 16, 2026

Uh oh!

hawkw commented Apr 16, 2026

Uh oh!

alex commented Apr 16, 2026

Uh oh!

alex commented Apr 19, 2026

Uh oh!

alex commented Apr 19, 2026

Uh oh!

hawkw commented Apr 19, 2026

Uh oh!

alex commented Apr 19, 2026

Uh oh!

Reviewers

Assignees