metrics: add task schedule latency metric#7986
metrics: add task schedule latency metric#7986robholt wants to merge 13 commits intotokio-rs:masterfrom
Conversation
This new metric tracks the amount of time between when a task is scheduled and when it is polled, also known as queue delay. This duration is recorded in a histogram, just like the poll time metric. This metric is useful for implementing queue management algorithms in systems using tokio. For example, it could be used to implement a generic http load shedder using the CoDel algorithm.
|
|
||
| // Check for a task in the LIFO slot | ||
| let task = match core.lifo_slot.take() { | ||
| Some(task) => task, |
There was a problem hiding this comment.
Should core.stats.start_poll(task.get_scheduled_at()); be called here too ?
There was a problem hiding this comment.
I considered this, but it would also change the behavior of the poll time metric. I do think that would be more correct, but wanted to avoid changing too many things in one PR. I'm happy to make the change though if you think that's fine!
| /// The last time this task was scheduled. Used to measure schedule latency. | ||
| #[cfg(tokio_unstable)] | ||
| /// | ||
| /// Only enabled when the target supports 64-bit atomics because the metric | ||
| /// that uses this field also requires 64-bit atomics. | ||
| #[cfg(all(tokio_unstable, target_has_atomic = "64"))] | ||
| pub(super) scheduled_at: UnsafeCell<Option<Instant>>, |
There was a problem hiding this comment.
This field adds 16 bytes to the header, bringing it up to 48 bytes in total (56 if tracing is enabled too).
This is causing the header_lte_cache_line test (at the bottom of this file) to fail on 32-bit targets. Restricting the field to targets that support 64-bit atomics did not work because some 32-bit targets do support 64-bit atomics.
Maybe sticking this behind a feature (like tracing) would be best?
There was a problem hiding this comment.
I'm okay with changing that test to be <= 64 to accomodate 32-bit targets, they are not that important anyway.
Regardless, the Instant struct is pretty large and space is at a premium in the task header. I don't think we want to dedicate 16 bytes just so we can have nanosecond precision even if it takes a billion years for it to get scheduled. Could we store a u32 of milliseconds since the runtime was created, or something like that? I'm pretty sure we already store a sort of "epoch" instant in the tokio::runtime::time part of the driver.
There was a problem hiding this comment.
That's a good idea, I updated the PR to use this "epoch" approach. I used a u64 though because the struct is already 8-byte aligned on all targets because of the 64-bit owner_id field, so a u32 just wastes 4 bytes. Using a u64 also means we can keep the nanosecond precision, since u64::MAX in nanoseconds is equal to ~584 years.
That brings the size of the struct up to 40 bytes now.
And you're right, the time driver does track the startup time of the time driver. I opted not to use it however because the time driver is not enabled by default, so using it would mean the time driver must be enabled by users in order to use this schedule latency tracking.
| @@ -247,6 +249,16 @@ impl<S> Notified<S> { | |||
| pub(crate) fn task_meta<'meta>(&self) -> crate::runtime::TaskMeta<'meta> { | |||
| self.0.task_meta() | |||
There was a problem hiding this comment.
can we please expose this field in the task meta? That would really help for observability tools because it means that during on_poll_start we can unconditionally know what the scheduling delay was
There was a problem hiding this comment.
I think it would be best to do that in a follow-up PR. I agree it would be very useful, but it'll require some refactoring of the way TaskMeta is constructed.
| /// Startup time of this scheduler. | ||
| /// | ||
| /// This instant is used as the basis of task `scheduled_at` measurements. | ||
| #[cfg(all(tokio_unstable, target_pointer_width = "64"))] | ||
| started_at: Instant, |
There was a problem hiding this comment.
This PR is using these cfgs all over the place. Can we please reduce them a bit? You can define a module that contains a ScheduleLatencyInstant which is either empty or contains an Instant depending on the configuration, and then code actually using it can stop having cfgs. Similarly for the u64 in the task header.
There was a problem hiding this comment.
Makes sense, the cfgs did get a bit out of control after the 64-bit restriction.
I've updated the PR to remove almost all of the cfgs and centralized the logic into a new schedule_latency module. The size of the task Header remains the same as the previous version on both 64-bit and 32-bit targets.
This new metric tracks the amount of time between when a task is scheduled and when it is polled, also known as queue delay. This duration is recorded in a histogram, just like the poll time metric.
Motivation
Measuring the amount of time a unit of work spends waiting in a queue, also known as "queue delay" or "sojourn time", is an essential component of the CoDel active queue management algorithm (and any algorithm based on it). CoDel is particularly popular because it requires no configuration or tuning to specific workloads, it maintains low latency and high throughput, and relies on a single parameter (queue delay) so it is relatively simple to implement.
For example, an HTTP server built on tokio could use this metric to power an overload protection layer that uses the CoDel algorithm to reject requests and prevent a latency spiral while still serving close to the maximum possible request volume, without the need for specific workload tuning.
There is some precedent for a task scheduler to expose this metric. The Go runtime exposes this metric as
/sched/latencies:seconds, and the Linux kernel'sschedstattracks process scheduling latency per-cpu and per-process.Solution
Measuring this requires keeping track of the time each task enters the queue in order to calculate the elapsed duration when the task is finally polled. I stored this time in the task header because it's hot data that's updated each time the task is scheduled. This does add some overhead, so I put the metric behind
tokio_unstable, just like the poll time histogram metric.