fix(recent-block-cache): Reduce fetch fan out and preserve negative cache entries.#4421
fix(recent-block-cache): Reduce fetch fan out and preserve negative cache entries.#4421metalurgical wants to merge 15 commits into
Conversation
…che entries Use fetch_values instead of join_all to reduce fan out. Preserve negative cache entries instead of removing empty entries.
There was a problem hiding this comment.
Code Review
This pull request refactors the RecentBlockCache by removing the request_sharing dependency and simplifying the fetching logic into a single retry loop within fetch_inner_many. It also corrects a bug in the garbage collection process where negative cache entries (empty results) were being prematurely evicted, and adds a regression test to verify their persistence. Review feedback identified that the new retry implementation introduces unnecessary latency on the final failure, loses original error context by using anyhow::bail!, and violates repository guidelines regarding detailed error reporting for batch fetches.
| for _ in 0..=self.maximum_retries { | ||
| match self.fetcher.fetch_values(keys.clone(), block).await { | ||
| Ok(values) => return Ok(values), | ||
| Err(err) => tracing::warn!("retrying fetch because error: {:?}", err), | ||
| } | ||
| .boxed() | ||
| }); | ||
| shared.await.context("could not fetch liquidity") | ||
| tokio::time::sleep(self.delay_between_retries).await; | ||
| } | ||
| anyhow::bail!("could not fetch liquidity"); |
There was a problem hiding this comment.
The current retry logic in fetch_inner_many has several issues:
- Unnecessary Latency: It performs a
tokio::time::sleepeven after the final retry attempt fails, adding redundant delay to the overall request time. - Error Context Loss: By using
anyhow::bail!, the original RPC error (e.g., timeout, node error) is discarded. Using.context()preserves the root cause for better troubleshooting. - Batch Error Detail: According to repository guidelines for batch fetches, the implementation should explicitly indicate which items failed and provide error details for each, rather than returning a single error for the entire batch. Please ensure the response structure reflects individual item statuses.
let mut keys = keys;
for i in 0..=self.maximum_retries {
let current_keys = if i == self.maximum_retries {
keys
} else {
keys.clone()
};
match self.fetcher.fetch_values(current_keys, block).await {
Ok(values) => return Ok(values),
Err(err) => {
if i == self.maximum_retries {
return Err(anyhow::Error::from(err).context("could not fetch liquidity"));
}
tracing::warn!(?err, "retrying fetch");
tokio::time::sleep(self.delay_between_retries).await;
}
}
}
unreachable!()References
- When fetching a batch of items where individual fetches can fail, do not silently ignore errors. The API response should explicitly indicate which items failed and provide error details for each failure.
There was a problem hiding this comment.
Point 1 and 2 have been corrected.
Point 3 is a trait redesign, Result<Vec<V>> will need to become HashMap<K, Result<Vec<V>> in multiple places for this.
There was a problem hiding this comment.
Code Review
This pull request refactors the RecentBlockCache to implement batch fetching and simplified retry logic, removing the dependency on individual request sharing. It also fixes a bug in the cache's garbage collection logic that caused negative cache entries to be prematurely evicted and adds a regression test. A high-severity recommendation was made to include an early return for empty key sets in fetch_inner_many to avoid unnecessary RPC calls.
| let retries = self.maximum_retries; | ||
| let delay = self.delay_between_retries; | ||
| let fetcher = self.fetcher.clone(); | ||
| let shared = self.requests.shared_or_else((key, block), |entry| { |
There was a problem hiding this comment.
I don't understand why you're removing the request sharing, it's also not covered in the description
There was a problem hiding this comment.
Previously it was doing individual fetches for each key, each going through request sharing. Now it does a single fetch for the key set. Request sharing no longer appears necessary here. Do you agree or do you see another reason to keep it?
| *values = vec![]; | ||
| for ((block, key), values) in self.entries.iter().rev() { | ||
| if !cached_keys.insert((*key).clone()) { | ||
| entries_to_remove.push((*block, (*key).clone())); |
There was a problem hiding this comment.
What's the motivation for this change? I'm not seeing how the other fan-out change makes this necessary. Are those 2 unrelated changes bundled together?
There was a problem hiding this comment.
Yes, two different changes in the same file.
|
|
||
| async fn fetch_inner_many(&self, keys: HashSet<K>, block: Block) -> Result<Vec<V>> { | ||
| let fetched = | ||
| futures::future::join_all(keys.iter().map(|key| self.fetch_inner(key.clone(), block))) |
There was a problem hiding this comment.
From the PR description it's not really clear to me what the desired outcome of this change is. Which metric would you expect to see improved based on this change so that we can verify somehow that it works as desired.
I imagine due to doing 1 bulk request with n items instead of n individual requests there should be a lot less congestion on the mutex around the cache so lower latency which reduces latency. Is that the idea for this PR?
I wonder if this could backfire since we are now doing a TON of synchronous work before yielding the thread back to the runtime.
There was a problem hiding this comment.
The metric that will improve here is the call count and the related overhead per call, especially on a cold cache.
Your concern at the end applies equally to the old code as well as this change, not a regression here.
Description
Reduce fetch fan out and preserve negative cache entries.
Changes
fetch_values(keys, block)directly instead of expanding them into individual key requestsHow to test
cargo nextest run