[Store] Add BatchEvict candidate-selection benchmark by jacklin78911-collab · Pull Request #2584 · kvcache-ai/Mooncake

jacklin78911-collab · 2026-06-23T17:26:01Z

Description

Adds a benchmark-only gtest that drives the real MasterService::BatchEvict path at controlled metadata scale, to make the eviction-cycle cost and lock-hold behavior from #2560 reproducible and reviewable.

It changes no production behavior. The only non-test change is adding BatchEvictBenchTest as a test friend of MasterService (matching the existing test-friend pattern) so the benchmark can populate metadata and take snapshot_mutex_ from the test.

The benchmark has two parts:

RealBatchEvictScales (always on): end-to-end BatchEvict wall-clock time vs object count. Default scales are small (10k, 100k) to stay cheap in CI; set MOONCAKE_EVICT_BENCH_LARGE=1 to also run 1M.
SingleWaiterSnapshotMutexProbe (opt-in via MOONCAKE_EVICT_BENCH_LOCK_PROBE=1): per trial, a single unique_lock waiter on snapshot_mutex_ arrives shortly after a BatchEvict cycle begins; the test records how long it waits to acquire, and reports p50/p95/max over N trials.

The workload is synthetic (single tenant, all objects expired, all no-pin, one completed memory replica each). It is meant as a reproducible baseline for the scan/lock-hold cost, not a production-latency claim.

Module

Mooncake Store (mooncake-store)

Type of Change

Other (benchmark / test only)

How Has This Been Tested?

cmake --build build --target batch_evict_bench_test -j$(nproc)
./build/mooncake-store/tests/batch_evict_bench_test
MOONCAKE_EVICT_BENCH_LOCK_PROBE=1 MOONCAKE_EVICT_BENCH_LOCK_OBJECTS=10000 MOONCAKE_EVICT_BENCH_LOCK_TRIALS=3 ./build/mooncake-store/tests/batch_evict_bench_test
clang-format-20 applied.

Relates to #2560.

Checklist

I have performed a self-review of my own code.
I have formatted my own code using clang-format before submitting.
I have added tests to prove my changes are effective.

Add BatchEvictBenchTest friend declaration

gemini-code-assist

Code Review

This pull request introduces a new benchmark-only test, BatchEvictBenchTest, to evaluate the candidate-selection cost of MasterService::BatchEvict at scale. It also adds the necessary friend class declarations in MasterService and updates the CMake configuration. The reviewer suggested a performance optimization in the benchmark helper function PercentileValue to pass the vector by reference instead of by value, preventing redundant copies.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T17:27:58Z

+        return static_cast<size_t>(parsed);
+    }
+
+    static uint64_t PercentileValue(std::vector<uint64_t> values,


Passing std::vector<uint64_t> by value to PercentileValue causes the vector to be copied on every call. Since PercentileValue is called multiple times on the same vectors (total_us and wait_us) in the benchmark, we can avoid these redundant copies by passing the vector by non-const reference. Sorting the vector in-place on the first call is perfectly fine and avoids any copy overhead.

Suggested change

static uint64_t PercentileValue(std::vector<uint64_t> values,

static uint64_t PercentileValue(std::vector<uint64_t>& values,

Thanks, applied — passing by reference now. The two vectors are local and discarded after use, so in-place sorting is fine and avoids the copies.

some changes

codecov-commenter · 2026-06-24T01:50:17Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.02564% with 76 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/tests/batch_evict_bench_test.cpp	61.02%	76 Missing ⚠️

📢 Thoughts on this report? Let us know!

yokinoshitayoki · 2026-06-24T02:28:45Z

Sorry to ask, but can we add result from this benchmark here?

yokinoshitayoki · 2026-06-24T02:40:55Z

Before changing the production data structure, I think it would be useful to break down like:

total BatchEvict time
snapshot_mutex_ shared-lock hold time
per-shard metadata RW-lock hold max/sum
data from this benchmark

We might need more detailed data to trade off what kind of an optimization we should make.

jacklin78911-collab · 2026-06-24T02:46:25Z

Thanks @yokinoshitayoki. Yes — here is what the benchmark in this PR produces on my current build.

Running RealBatchEvictScales as-is on current main on my AutoDL host with RelWithDebInfo:

num_objects,total_us,objects_before,objects_after,evicted_count,freed_bytes
10000,11450,10000,5000,5000,5120000
100000,123848,100000,50000,50000,51200000

The lock probe also runs at the small scale:

num_objects,trials,batch_evict_total_p50_us,unique_lock_wait_p50_us,unique_lock_wait_p95_us,unique_lock_wait_max_us
10000,3,8329,7145,7223,7223

For the larger 1M-object case, my earlier investigation for #2560 on ef0312f8 measured end-to-end BatchEvict at roughly 0.95–1.0s, with some run-to-run spread on the shared instance. A 30-trial lock probe at 1M objects gave unique-lock wait p50 around 1.00s, p95 around 1.21s, and max around 1.36s, so the waiter was blocked for essentially the full eviction cycle.

A separately instrumented build, not included in this PR, attributed about 73–75% of the cycle to candidate scan/traversal, about 0.1% to nth_element, and less than 0.5% to the actual eviction work. That is why this PR only keeps the benchmark-side end-to-end timing and lock-wait probe, without adding production-path instrumentation.

I can also run the 1M case on this exact PR build and paste the output if that would be useful.

On the per-shard metadata RW-lock hold time and the direct snapshot_mutex_ shared-lock hold time: agreed, those would be useful for deciding the optimization direction. I can take that as a follow-up measurement and post the numbers separately, so this PR can stay focused on the minimal baseline benchmark.

jacklin78911-collab added 3 commits June 24, 2026 01:08

Add files via upload

bc2c663

Update master_service.h

aaab1ae

Add BatchEvictBenchTest friend declaration

Update CMakeLists.txt

fb6c07a

jacklin78911-collab requested review from XucSh, YiXR, stmatengss and ykwd as code owners June 23, 2026 17:26

github-actions Bot added run-ci Store labels Jun 23, 2026

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Update batch_evict_bench_test.cpp

ea4cce0

some changes

Update batch_evict_bench_test.cpp

b8dff17

jacklin78911-collab mentioned this pull request Jun 25, 2026

[Performance]: BatchEvict full-metadata scan dominates eviction-cycle time at 1M objects (and holds snapshot_mutex_ for that duration) #2560

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Store] Add BatchEvict candidate-selection benchmark#2584

[Store] Add BatchEvict candidate-selection benchmark#2584
jacklin78911-collab wants to merge 5 commits into
kvcache-ai:mainfrom
jacklin78911-collab:add-batch-evict-benchmark

jacklin78911-collab commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

jacklin78911-collab Jun 24, 2026

Uh oh!

codecov-commenter commented Jun 24, 2026

Uh oh!

yokinoshitayoki commented Jun 24, 2026 •

edited

Loading

Uh oh!

yokinoshitayoki commented Jun 24, 2026 •

edited

Loading

Uh oh!

jacklin78911-collab commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	static uint64_t PercentileValue(std::vector<uint64_t> values,
	static uint64_t PercentileValue(std::vector<uint64_t>& values,

Uh oh!

Conversation

jacklin78911-collab commented Jun 23, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

jacklin78911-collab Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 24, 2026

Codecov Report

Uh oh!

yokinoshitayoki commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yokinoshitayoki commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacklin78911-collab commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yokinoshitayoki commented Jun 24, 2026 •

edited

Loading

yokinoshitayoki commented Jun 24, 2026 •

edited

Loading

jacklin78911-collab commented Jun 24, 2026 •

edited

Loading