[Feature]Prefetch SSD-Only Objects to DRAM on Exist by huangdong2022 · Pull Request #2646 · kvcache-ai/Mooncake

huangdong2022 · 2026-06-27T03:51:35Z

Description

Implements SSD prefetch-on-exist for Mooncake Store (RFC #2213): when is_exist / batch_is_exist is called with ExistOptions.prefetch_to_memory=true, asynchronously promote SSD-only keys (LOCAL_DISK, no MEMORY) back to DRAM, so later get() can hit DRAM instead of SSD.

Core changes:

Dedicated prefetch RPC path (GetReplicaListForPrefetch, BatchGetReplicaListForPrefetch, RegisterPrefetchTask) — no lease/sketch/promotion-on-hit queue side effects.
Client triggerSsdPrefetch: chunked batch query (128 keys/chunk), pipelined register+promote, bounded prefetch_pool_ (4 threads), PrefetchThrottle (dedup TTL + DRAM-pressure cooldown).
Cross-node holder delegation via prefetch_offload_object RPC.
Get-side optional wait (ssd_get_wait_ms, default 10ms) with [GET-SRC] / [PREFETCH-OUTCOME] logging.
NotifyPromotionSuccess(from_prefetch=true) grants normal KV lease.
Bug fixes: tenant-scoped staging key in PrefetchKeys; BatchOffload commits local index before NotifyOffloadSuccess.

Python/C API: ExistOptions.prefetch_to_memory; setup() adds ssd_prefetch_* / ssd_get_wait_ms.

Related: RFC #2213, PR #2071. Validated with vLLM-Ascend KV pool (HBM/DRAM/SSD).

Module

Type of Change

How Has This Been Tested?

Test commands:

# Mooncake Python integration test (requires running master + SSD offload env)
export MOONCAKE_OFFLOAD_FILE_STORAGE_PATH=/path/to/offload
export MOONCAKE_OFFLOAD_BUCKET_KEYS_LIMIT=10
export MOONCAKE_OFFLOAD_BUCKET_SIZE_LIMIT_BYTES=10485760
python -m unittest mooncake-wheel.tests.test_prefetch_on_exist.TestPrefetchOnExist -v

# Optional: cross-node case (opt-in)
# export MC_TEST_CROSS_NODE=1 NODE_A_HOSTNAME=... NODE_B_HOSTNAME=...

Manual integration (vLLM-Ascend + Mooncake master, SSD offload enabled):

End-to-end prefix-cache workload (80×32K): cold run → warm SSD → re-run; TTFT and Prefill improved on re-run.
GPQA accuracy run: no INVALID_KEY / get failures after B10 fix.

Test results:

Unit tests pass
Integration tests pass (if applicable)
Manual testing done (describe below)

Highlights:

test_prefetch_on_exist: is_exist / batch_is_exist with prefetch_to_memory=true promotes LOCAL_DISK-only keys to MEMORY; post-prefetch get does not hit SSD offload RPC path.
B10 fix: concurrent get INVALID_KEY eliminated under offload+prefetch load.
vLLM-Ascend re-run after SSD warm-up: TTFT −220ms (−3.2%), Prefill +162 t/s (+3.4%) on 80×32K workload.

Checklist

I have performed a self-review of my own code
I have formatted my code using ./scripts/code_format.sh
I have run pre-commit run --all-files and all hooks pass
I have updated the documentation (if applicable)
I have added tests to prove my changes are effective
For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

No AI tools were used
AI tools were used (specify below)

AI tools (Cursor/Claude) assisted with design doc, log analysis, test updates, and PR description drafting. All changes reviewed by the submitter.

Signed-off-by: h30027576 <huangdong51@huawei.com>

gemini-code-assist

Code Review

This pull request implements the SSD Prefetch-on-Exist feature, enabling asynchronous, best-effort promotion of SSD-only keys to DRAM during the cache-probing phase to reduce TTFT. It introduces throttling mechanisms, bounded thread pools, get-side waiting, cross-node delegation, and corresponding Python bindings and tests. The review feedback identifies several critical improvement opportunities: a bug where failed prefetch task registration leaves throttle states stuck in a triggered state, a duplicate header include, a flaw in the throttle reservation logic that prematurely erases active entries, and a performance concern regarding high-frequency synchronous RPC polling of the master service during get-side waiting.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T03:53:32Z

+        auto register_result = client->RegisterPrefetchTask(local_keys[i]);
+        if (!register_result) {
+            VLOG(1) << "SSD prefetch: RegisterPrefetchTask failed for"
+                    << "key=" << local_keys[i]
+                    << ", error=" << register_result.error();
+            continue;
+        }


If RegisterPrefetchTask fails, the key is skipped and never passed to file_storage->PrefetchKeys. As a result, on_key_done is never invoked for this key, leaving its throttle state stuck in kTriggered until the TTL expires. This prevents subsequent prefetch attempts and causes any get-side waiting threads to block until they timeout.\n\nPlease invoke on_key_done(local_keys[i], false) before continuing the loop.

auto register_result = client->RegisterPrefetchTask(local_keys[i]); if (!register_result) { VLOG(1) << "SSD prefetch: RegisterPrefetchTask failed for" << "key=" << local_keys[i] << ", error=" << register_result.error(); if (on_key_done) { on_key_done(local_keys[i], false); } continue; }

gemini-code-assist · 2026-06-27T03:53:32Z

 #include "storage_backend.h"
 #include "pinned_buffer_pool.h"

+#include <functional>


The header is included twice in this file (at line 3 and line 10). Please remove the duplicate include at line 10.

gemini-code-assist · 2026-06-27T03:53:32Z

+        for (auto it = entries_.begin(); it != entries_.end();) {
+            const int64_t last_ms = it->second.completed_ms >= 0
+                                        ? it->second.completed_ms
+                                        : it->second.trigger_ms;
+            if (now - last_ms > ttl_ms) {
+                it = entries_.erase(it);
+            } else {
+                ++it;
+            }
+        }


In PrefetchThrottle::reserve, entries are erased from entries_ solely based on the elapsed time since trigger_ms (if completed_ms is -1) exceeding ttl_ms.\n\nIf a prefetch task takes longer than ttl_ms to complete (or if ttl_ms is configured to be very short), an active entry in the kTriggered or kInFlight state can be prematurely erased. This would:\n1. Allow duplicate prefetch tasks to be triggered for the same key (defeating the deduplication logic).\n2. Cause any concurrent waitForCompletion calls for that key to fail immediately because the entry is no longer found in entries_.\n\nConsider only erasing entries that are in a terminal state (such as kCompleted, kFailed, or kAlreadyResident).

for (auto it = entries_.begin(); it != entries_.end();) { const int64_t last_ms = it->second.completed_ms >= 0 ? it->second.completed_ms : it->second.trigger_ms; const bool is_terminal = it->second.state == State::kCompleted || it->second.state == State::kFailed || it->second.state == State::kAlreadyResident; if (is_terminal && now - last_ms > ttl_ms) { it = entries_.erase(it); } else { ++it; } }

gemini-code-assist · 2026-06-27T03:53:32Z

+            } else {
+                prefetch_wait_mode = "master";
+                const int64_t deadline =
+                    PrefetchThrottle::NowMs() + ssd_get_wait_ms_;
+                while (PrefetchThrottle::NowMs() < deadline) {
+                    if (auto qr = TryRefreshBestMemoryReplica(
+                            client_.get(), key, local_endpoints)) {
+                        refreshed_qr.emplace(std::move(*qr));
+                        best_replica = SelectBestReplica(refreshed_qr->replicas,
+                                                         local_endpoints);
+                        prefetch_done_ms = PrefetchThrottle::NowMs();
+                        break;
+                    }
+                    std::this_thread::sleep_for(
+                        std::chrono::milliseconds(kPollMs));
+                }
+            }


In the master wait mode, the client polls the master via TryRefreshBestMemoryReplica (which performs a synchronous master RPC) every 1 ms (kPollMs = 1) up to ssd_get_wait_ms_ (default 10 ms) per key.\n\nSince this loop runs sequentially for each key in the batch, if a batch contains multiple SSD-only keys that are not yet promoted, this can lead to:\n1. A massive flood of synchronous RPC queries to the master, potentially overwhelming the master service under high concurrency.\n2. Significant accumulation of latency on the client side (e.g., 10 keys * 10 ms = 100 ms delay).\n\nConsider increasing the poll interval for the master query path (e.g., to 2-5 ms) or batching/limiting the master queries to avoid overwhelming the master.

huangdong2022 added 2 commits June 27, 2026 10:49

[Feature]Prefetch SSD-Only Objects to DRAM on Exist

d52fb53

Signed-off-by: h30027576 <huangdong51@huawei.com>

Add UT

46db15c

Signed-off-by: h30027576 <huangdong51@huawei.com>

huangdong2022 requested review from ShangmingCai, XucSh, YiXR, stmatengss and ykwd as code owners June 27, 2026 03:51

github-actions Bot added documentation Improvements or additions to documentation run-ci Store Installation Integration Tests labels Jun 27, 2026

huangdong2022 marked this pull request as draft June 27, 2026 03:52

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]Prefetch SSD-Only Objects to DRAM on Exist#2646

[Feature]Prefetch SSD-Only Objects to DRAM on Exist#2646
huangdong2022 wants to merge 2 commits into
kvcache-ai:mainfrom
huangdong2022:main_prefetch_pr

huangdong2022 commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

huangdong2022 commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Module

Type of Change

How Has This Been Tested?

Checklist

AI Assistance Disclosure

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huangdong2022 commented Jun 27, 2026 •

edited by github-actions Bot

Loading