[Feature]Prefetch SSD-Only Objects to DRAM on Exist#2646
Conversation
Signed-off-by: h30027576 <huangdong51@huawei.com>
There was a problem hiding this comment.
Code Review
This pull request implements the SSD Prefetch-on-Exist feature, enabling asynchronous, best-effort promotion of SSD-only keys to DRAM during the cache-probing phase to reduce TTFT. It introduces throttling mechanisms, bounded thread pools, get-side waiting, cross-node delegation, and corresponding Python bindings and tests. The review feedback identifies several critical improvement opportunities: a bug where failed prefetch task registration leaves throttle states stuck in a triggered state, a duplicate header include, a flaw in the throttle reservation logic that prematurely erases active entries, and a performance concern regarding high-frequency synchronous RPC polling of the master service during get-side waiting.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| auto register_result = client->RegisterPrefetchTask(local_keys[i]); | ||
| if (!register_result) { | ||
| VLOG(1) << "SSD prefetch: RegisterPrefetchTask failed for" | ||
| << "key=" << local_keys[i] | ||
| << ", error=" << register_result.error(); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
If RegisterPrefetchTask fails, the key is skipped and never passed to file_storage->PrefetchKeys. As a result, on_key_done is never invoked for this key, leaving its throttle state stuck in kTriggered until the TTL expires. This prevents subsequent prefetch attempts and causes any get-side waiting threads to block until they timeout.\n\nPlease invoke on_key_done(local_keys[i], false) before continuing the loop.
auto register_result = client->RegisterPrefetchTask(local_keys[i]);
if (!register_result) {
VLOG(1) << "SSD prefetch: RegisterPrefetchTask failed for"
<< "key=" << local_keys[i]
<< ", error=" << register_result.error();
if (on_key_done) {
on_key_done(local_keys[i], false);
}
continue;
}| #include "storage_backend.h" | ||
| #include "pinned_buffer_pool.h" | ||
|
|
||
| #include <functional> |
| for (auto it = entries_.begin(); it != entries_.end();) { | ||
| const int64_t last_ms = it->second.completed_ms >= 0 | ||
| ? it->second.completed_ms | ||
| : it->second.trigger_ms; | ||
| if (now - last_ms > ttl_ms) { | ||
| it = entries_.erase(it); | ||
| } else { | ||
| ++it; | ||
| } | ||
| } |
There was a problem hiding this comment.
In PrefetchThrottle::reserve, entries are erased from entries_ solely based on the elapsed time since trigger_ms (if completed_ms is -1) exceeding ttl_ms.\n\nIf a prefetch task takes longer than ttl_ms to complete (or if ttl_ms is configured to be very short), an active entry in the kTriggered or kInFlight state can be prematurely erased. This would:\n1. Allow duplicate prefetch tasks to be triggered for the same key (defeating the deduplication logic).\n2. Cause any concurrent waitForCompletion calls for that key to fail immediately because the entry is no longer found in entries_.\n\nConsider only erasing entries that are in a terminal state (such as kCompleted, kFailed, or kAlreadyResident).
for (auto it = entries_.begin(); it != entries_.end();) {
const int64_t last_ms = it->second.completed_ms >= 0
? it->second.completed_ms
: it->second.trigger_ms;
const bool is_terminal = it->second.state == State::kCompleted ||
it->second.state == State::kFailed ||
it->second.state == State::kAlreadyResident;
if (is_terminal && now - last_ms > ttl_ms) {
it = entries_.erase(it);
} else {
++it;
}
}| } else { | ||
| prefetch_wait_mode = "master"; | ||
| const int64_t deadline = | ||
| PrefetchThrottle::NowMs() + ssd_get_wait_ms_; | ||
| while (PrefetchThrottle::NowMs() < deadline) { | ||
| if (auto qr = TryRefreshBestMemoryReplica( | ||
| client_.get(), key, local_endpoints)) { | ||
| refreshed_qr.emplace(std::move(*qr)); | ||
| best_replica = SelectBestReplica(refreshed_qr->replicas, | ||
| local_endpoints); | ||
| prefetch_done_ms = PrefetchThrottle::NowMs(); | ||
| break; | ||
| } | ||
| std::this_thread::sleep_for( | ||
| std::chrono::milliseconds(kPollMs)); | ||
| } | ||
| } |
There was a problem hiding this comment.
In the master wait mode, the client polls the master via TryRefreshBestMemoryReplica (which performs a synchronous master RPC) every 1 ms (kPollMs = 1) up to ssd_get_wait_ms_ (default 10 ms) per key.\n\nSince this loop runs sequentially for each key in the batch, if a batch contains multiple SSD-only keys that are not yet promoted, this can lead to:\n1. A massive flood of synchronous RPC queries to the master, potentially overwhelming the master service under high concurrency.\n2. Significant accumulation of latency on the client side (e.g., 10 keys * 10 ms = 100 ms delay).\n\nConsider increasing the poll interval for the master query path (e.g., to 2-5 ms) or batching/limiting the master queries to avoid overwhelming the master.
Description
Implements SSD prefetch-on-exist for Mooncake Store (RFC #2213): when
is_exist/batch_is_existis called withExistOptions.prefetch_to_memory=true, asynchronously promote SSD-only keys (LOCAL_DISK, noMEMORY) back to DRAM, so laterget()can hit DRAM instead of SSD.Core changes:
GetReplicaListForPrefetch,BatchGetReplicaListForPrefetch,RegisterPrefetchTask) — no lease/sketch/promotion-on-hit queue side effects.triggerSsdPrefetch: chunked batch query (128 keys/chunk), pipelined register+promote, boundedprefetch_pool_(4 threads),PrefetchThrottle(dedup TTL + DRAM-pressure cooldown).prefetch_offload_objectRPC.ssd_get_wait_ms, default 10ms) with[GET-SRC]/[PREFETCH-OUTCOME]logging.NotifyPromotionSuccess(from_prefetch=true)grants normal KV lease.PrefetchKeys;BatchOffloadcommits local index beforeNotifyOffloadSuccess.Python/C API:
ExistOptions.prefetch_to_memory;setup()addsssd_prefetch_*/ssd_get_wait_ms.Related: RFC #2213, PR #2071. Validated with vLLM-Ascend KV pool (HBM/DRAM/SSD).
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-pg)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-common)mooncake-rl)Type of Change
How Has This Been Tested?
Test commands:
Manual integration (vLLM-Ascend + Mooncake master, SSD offload enabled):
INVALID_KEY/ get failures after B10 fix.Test results:
Highlights:
test_prefetch_on_exist:is_exist/batch_is_existwithprefetch_to_memory=truepromotes LOCAL_DISK-only keys to MEMORY; post-prefetchgetdoes not hit SSD offload RPC path.INVALID_KEYeliminated under offload+prefetch load.Checklist
./scripts/code_format.shpre-commit run --all-filesand all hooks passAI Assistance Disclosure
AI tools (Cursor/Claude) assisted with design doc, log analysis, test updates, and PR description drafting. All changes reviewed by the submitter.