Skip to content

[Store] Add BatchEvict candidate-selection benchmark#2584

Open
jacklin78911-collab wants to merge 5 commits into
kvcache-ai:mainfrom
jacklin78911-collab:add-batch-evict-benchmark
Open

[Store] Add BatchEvict candidate-selection benchmark#2584
jacklin78911-collab wants to merge 5 commits into
kvcache-ai:mainfrom
jacklin78911-collab:add-batch-evict-benchmark

Conversation

@jacklin78911-collab

Copy link
Copy Markdown

Description

Adds a benchmark-only gtest that drives the real MasterService::BatchEvict path at controlled metadata scale, to make the eviction-cycle cost and lock-hold behavior from #2560 reproducible and reviewable.

It changes no production behavior. The only non-test change is adding BatchEvictBenchTest as a test friend of MasterService (matching the existing test-friend pattern) so the benchmark can populate metadata and take snapshot_mutex_ from the test.

The benchmark has two parts:

  1. RealBatchEvictScales (always on): end-to-end BatchEvict wall-clock time vs object count. Default scales are small (10k, 100k) to stay cheap in CI; set MOONCAKE_EVICT_BENCH_LARGE=1 to also run 1M.

  2. SingleWaiterSnapshotMutexProbe (opt-in via MOONCAKE_EVICT_BENCH_LOCK_PROBE=1): per trial, a single unique_lock waiter on snapshot_mutex_ arrives shortly after a BatchEvict cycle begins; the test records how long it waits to acquire, and reports p50/p95/max over N trials.

The workload is synthetic (single tenant, all objects expired, all no-pin, one completed memory replica each). It is meant as a reproducible baseline for the scan/lock-hold cost, not a production-latency claim.

Module

  • Mooncake Store (mooncake-store)

Type of Change

  • Other (benchmark / test only)

How Has This Been Tested?

  • cmake --build build --target batch_evict_bench_test -j$(nproc)
  • ./build/mooncake-store/tests/batch_evict_bench_test
  • MOONCAKE_EVICT_BENCH_LOCK_PROBE=1 MOONCAKE_EVICT_BENCH_LOCK_OBJECTS=10000 MOONCAKE_EVICT_BENCH_LOCK_TRIALS=3 ./build/mooncake-store/tests/batch_evict_bench_test
  • clang-format-20 applied.

Relates to #2560.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using clang-format before submitting.
  • I have added tests to prove my changes are effective.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark-only test, BatchEvictBenchTest, to evaluate the candidate-selection cost of MasterService::BatchEvict at scale. It also adds the necessary friend class declarations in MasterService and updates the CMake configuration. The reviewer suggested a performance optimization in the benchmark helper function PercentileValue to pass the vector by reference instead of by value, preventing redundant copies.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

return static_cast<size_t>(parsed);
}

static uint64_t PercentileValue(std::vector<uint64_t> values,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing std::vector<uint64_t> by value to PercentileValue causes the vector to be copied on every call. Since PercentileValue is called multiple times on the same vectors (total_us and wait_us) in the benchmark, we can avoid these redundant copies by passing the vector by non-const reference. Sorting the vector in-place on the first call is perfectly fine and avoids any copy overhead.

Suggested change
static uint64_t PercentileValue(std::vector<uint64_t> values,
static uint64_t PercentileValue(std::vector<uint64_t>& values,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, applied — passing by reference now. The two vectors are local and discarded after use, so in-place sorting is fine and avoids the copies.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.02564% with 76 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/tests/batch_evict_bench_test.cpp 61.02% 76 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yokinoshitayoki

yokinoshitayoki commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Sorry to ask, but can we add result from this benchmark here?

@yokinoshitayoki

yokinoshitayoki commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Before changing the production data structure, I think it would be useful to break down like:

  • total BatchEvict time
  • snapshot_mutex_ shared-lock hold time
  • per-shard metadata RW-lock hold max/sum
  • data from this benchmark

We might need more detailed data to trade off what kind of an optimization we should make.

@jacklin78911-collab

jacklin78911-collab commented Jun 24, 2026

Copy link
Copy Markdown
Author

Thanks @yokinoshitayoki. Yes — here is what the benchmark in this PR produces on my current build.

Running RealBatchEvictScales as-is on current main on my AutoDL host with RelWithDebInfo:

num_objects,total_us,objects_before,objects_after,evicted_count,freed_bytes
10000,11450,10000,5000,5000,5120000
100000,123848,100000,50000,50000,51200000

The lock probe also runs at the small scale:

num_objects,trials,batch_evict_total_p50_us,unique_lock_wait_p50_us,unique_lock_wait_p95_us,unique_lock_wait_max_us
10000,3,8329,7145,7223,7223

For the larger 1M-object case, my earlier investigation for #2560 on ef0312f8 measured end-to-end BatchEvict at roughly 0.95–1.0s, with some run-to-run spread on the shared instance. A 30-trial lock probe at 1M objects gave unique-lock wait p50 around 1.00s, p95 around 1.21s, and max around 1.36s, so the waiter was blocked for essentially the full eviction cycle.

A separately instrumented build, not included in this PR, attributed about 73–75% of the cycle to candidate scan/traversal, about 0.1% to nth_element, and less than 0.5% to the actual eviction work. That is why this PR only keeps the benchmark-side end-to-end timing and lock-wait probe, without adding production-path instrumentation.

I can also run the 1M case on this exact PR build and paste the output if that would be useful.

On the per-shard metadata RW-lock hold time and the direct snapshot_mutex_ shared-lock hold time: agreed, those would be useful for deciding the optimization direction. I can take that as a follow-up measurement and post the numbers separately, so this PR can stay focused on the minimal baseline benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants