[Store] Optimize put performence by reduce memcpy and pinned memory#2598
[Store] Optimize put performence by reduce memcpy and pinned memory#2598zxpdemonio wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes data transfer and memory staging by deferring staging to the transfer layer, introducing zero-copy batch upsert methods, executing local memory copies inline on the calling thread to eliminate thread synchronization overhead, and pinning memory regions with cudaHostRegister for DMA-backed GPU transfers. The review feedback focuses on further optimizing and safeguarding these changes, including adding a fast-path check for zero-sized copies in MemcpySafe, querying device pointer status once outside the loop in submitMemcpyOperation, reusing the split_into_slices helper to simplify RDMA staging, and guarding memory pinning to avoid registering zero-sized local buffers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Remove the unconditional alloc + memcpy staging from all 6 Pattern A *_internal functions (put, put_parts, put_batch, upsert, upsert_parts, upsert_batch). Slices are now created directly from source data via split_into_slices(ptr, size), deferring staging to ensureRegisteredForRDMA only when the RDMA path encounters unregistered memory. Key changes: - real_client.cpp: 6 *_internal functions skip staging, use direct slices - transfer_task.cpp: ensureRegisteredForRDMA stages unregistered slices with single contiguous alloc; selectStrategy dispatches LOCAL_MEMCPY (inline, GPU-safe) vs TRANSFER_ENGINE; kMaxSliceSize splitting applied to both registered and staged slices - gpu_staging_utils.h: MemcpySafe, CopyAuto, IsDevicePointer utilities; async batch copy support (CUDA Driver API / MUSA Runtime API) - transfer_engine: isLocalMemoryRegistered query for registration check - client_service: setStagingAllocator to pass allocator to TransferSubmitter - store_py.cpp: batch_put/upsert_tensor use multi_buffers API to bypass integration-layer staging - batch_upsert_from_multi_buffers: new API mirroring batch_put variant Performance: LOCAL_MEMCPY path reduces from 2 copies to 1 for all Pattern A callers. RDMA with registered buffers also saves 1 copy. RDMA with unregistered buffers unchanged (staging moves to submit layer). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin staging and segment buffers by default so that GPU→host copies use direct DMA instead of CUDA's internal staging through a temporary pinned buffer. This roughly doubles GPU→CPU memcpy bandwidth on PCIe. Opt out with MC_STORE_PIN_MEMORY=0 if pinning conflicts with other GPU workloads or exceeds the locked-page limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the MC_STORE_ASYNC_STAGING path and its CUDA/MUSA async-copy helpers. Benchmarking showed no stable benefit for GPU→RDMA staging: single put_from was 0.97x-1.01x and small batch cases regressed up to 0.93x, while the synchronous MemcpySafe path remains simpler and stable. Pinned host memory remains enabled by default for GPU builds, which provides the material GPU→CPU gain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Handle zero-sized copies safely and reduce redundant GPU pointer queries in the memcpy submit path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9fb1021 to
49078b3
Compare
Cap Store-managed cudaHostRegister usage with MC_STORE_PIN_MEMORY_MAX_BYTES and keep formatting changes targeted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c808b7c to
dc769ca
Compare
Validate full registered ranges before RDMA fast paths and clean up pinned host registrations on teardown/failure paths so zero-copy writes fail closed without leaking resources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply the clang-format-20 layout expected by CI for the zero-copy write and pinned-memory changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format pinned-memory logging with clang-format 20.1.8 to match the CI formatter exactly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the default host-register flag value directly so CUDA-alike builds such as MUSA do not require a cudaHostRegisterDefault macro mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@gemini-code-assist please review again |
There was a problem hiding this comment.
Code Review
This pull request introduces zero-copy optimizations and performance enhancements to the Mooncake store and transfer engine. Key changes include deferring staging to the transfer layer, executing memcpy operations inline on the calling thread to eliminate synchronization overhead, and implementing GPU-safe memory copying and host memory pinning. The review feedback highlights critical issues where modifying the active GPU device via SetDevice without restoring it can pollute the calling thread's state. Additionally, defining static tracking variables inside inline functions in a header file may lead to duplicate instances across shared library boundaries, and there is a minor parameter name mismatch between the declaration and definition of batch_upsert_from_multi_buffers.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Restore the caller's active GPU device after inline GPU copies and move pinned-memory tracking state into a single translation unit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Serialize pinned-memory registration state transitions and restore GPU device state in remaining copy paths so setup and transfer failure paths do not leak caller-visible state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid requiring a staging allocator for TCP-only transfer-engine writes, while keeping RDMA registration staging for non-TCP transports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid routing CXL writes through RDMA staging by checking for an RDMA transport directly instead of treating every non-TCP path as RDMA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Check each target buffer protocol before RDMA staging so CXL replicas do not inherit RDMA-only requirements when RDMA is also installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep copy-style put and upsert APIs failing when no client buffer is configured, while leaving explicit external-buffer write paths unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| auto group_ids_error = | ||
| ValidateGroupIdsForBatchConfig(config, keys.size(), "put"); | ||
| if (!group_ids_error.empty()) return group_ids_error; | ||
|
|
||
| std::vector<int> results(keys.size(), 0); | ||
| { |
There was a problem hiding this comment.
Extending batch_write_tensor_impl is better for readability.
|
We recommend you trying to minimize code modifications as possible, or spiliting it to multiple PRs or commits. |
Motivation
The store write path currently performs unnecessary staging copies for many high-level write APIs.
High-level write APIs (
put,put_parts,put_batch,put_tensor,upsert,upsert_parts,upsert_batch,upsert_tensor) previously always copied user data into a Store-managed staging buffer before submitting the transfer:This has three issues:
std::memcpy.This PR removes the unconditional high-level write staging copy and makes GPU→CPU writes a first-class fast path.
Scenario
This optimization targets write-heavy GPU/CPU store workloads, especially:
The most important case is GPU→CPU local write. On NVIDIA A10, pinned host memory improves 512MB GPU→CPU write bandwidth from ~8–9 GB/s to ~25 GB/s.
Description
Remove unconditional staging in high-level write internals
This PR updates the six high-level write internal functions:
put_internalput_parts_internalput_batch_internalupsert_internalupsert_parts_internalupsert_batch_internalBefore:
After:
The transfer layer now decides whether staging is actually required.
For local memcpy writes, this reduces the write path from two copies to one copy.
For RDMA writes with already-registered buffers, this also removes one host memcpy.
For RDMA writes with unregistered buffers, the total copy count remains the same, but staging is now centralized in the transfer submit layer.
Add RDMA-safe staging in the transfer layer
The transfer submit layer now checks whether write slices are registered with the transfer engine.
If all slices are already registered:
If any slice is not registered:
The staging path:
BufferHandles alive until transfer completionMemcpySafe()so GPU pointers are handled by the proper accelerator runtimekMaxSliceSizeThis preserves RDMA correctness while avoiding unnecessary staging for local and registered-buffer paths.
Enable pinned host memory by default for GPU builds
For GPU builds, Store now tries to pin Store-managed host memory with
cudaHostRegisterby default.Pinned memory is applied to two Store-owned buffer classes:
Client staging buffer
RealClientteardown.Mounted segment buffers
This is a best-effort acceleration path:
cudaHostRegisterfails, Store keeps running and falls back to pageable host memory for that buffer.cudaHostUnregisterfor those tracked regions.Opt out:
MC_STORE_PIN_MEMORY=0 # or MC_STORE_PIN_MEMORY=falseOptionally cap total Store-managed pinned memory:
MC_STORE_PIN_MEMORY_MAX_BYTES=8589934592 # 8GBUnset or
0means no explicit Store-side cap.Pinned memory has important side effects: pinned pages cannot be swapped or reclaimed by the OS, large registrations have setup cost, and pinning a whole large segment can increase system memory pressure. The cap is therefore a safety valve for deployments that use large segments or share the machine with other workloads.
This PR intentionally implements the minimal safe policy: default-on best-effort pinning for Store-managed buffers, plus an opt-out and a total pinned-memory cap. Future work should replace whole-buffer pinning with a registration cache that pins page-aligned hot ranges on demand and evicts them with an LRU/lazy-unpin policy. That would reduce page-locked memory pressure for very large segments while preserving the direct GPU→segment fast path for frequently written ranges.
Python tensor batch path uses multi-buffer writes
batch_put_tensorandbatch_upsert_tensornow pass tensor metadata and tensor data as separate buffers through the multi-buffer write APIs.This avoids the integration-layer staging copy that previously happened before calling the lower-level Store write APIs.
Basic Usage
No API change is required.
Existing write APIs automatically benefit:
Pinned host memory is enabled by default for GPU builds.
To disable pinned memory:
export MC_STORE_PIN_MEMORY=0To cap total Store-managed pinned memory:
If the cap is exceeded or
cudaHostRegisterfails, Store keeps running and uses pageable host memory for that buffer.Module
mooncake-transfer-engine)mooncake-store)mooncake-integration)Type of Change
Performance
Benchmarks were run on an NVIDIA A10 test machine.
Copy count reduction
The affected APIs are the high-level write APIs that accept user data directly, such as
put,put_parts,put_batch,put_tensor,upsert,upsert_parts,upsert_batch, andupsert_tensor.Before this PR, these APIs always copied user data into a Store-managed staging buffer before submitting the transfer. This PR removes that unconditional staging copy and lets the transfer layer stage only when RDMA correctness requires it.
put,put_tensor,upsert, etc.)std::memcpyon user data, which is unsafe for GPU pointers. The new path uses accelerator-aware copy in the transfer layer.put_from,batch_put_from, etc.) + RDMA + unregistered user bufferput_from,batch_put_from, etc.) + registered/local bufferbatch_put_tensor/batch_upsert_tensorlocal memcpyMeasured copy-reduction benefit
The main measured before/after comparison is 512MB CPU
put_tensorwith local memcpy. This is the case where the old path paid one extra host memcpy and the new path removes it.put_tensorThis saved time corresponds to removing one extra host memcpy from the high-level local write path.
For reference, the direct-slice API baseline after this change was:
put_fromMC_STORE_MEMCPY=1, 512MBput_tensorMC_STORE_MEMCPY=1, 512MBput_fromThese rows are not before/after numbers. They are sanity checks showing that, after this PR, high-level writes no longer pay the old unconditional staging copy and are comparable to direct-slice writes under the same transfer strategy.
Pinned memory benefit
GPU→CPU local write, pinned enabled vs disabled:
put_tensorput_tensorput_tensorput_fromput_fromput_fromPinned host memory makes GPU→CPU copies DMA directly into Store buffers instead of going through CUDA’s internal pageable-memory staging path.
How Has This Been Tested?
Build commands:
cmake --build build --target store -j$(nproc)Also built on the GPU/RDMA test machine after syncing this branch:
cmake --build build --target store -j$(nproc)Test commands:
Test results:
client_integration_testput,put_parts,upsert,upsert_parts,upsert_batch,batch_upsert_from,batch_put_from_multi_buffersclient_tcp_local_memcpy_testpybind_client_testUpsertand batch upsert casesCTest summary:
Performance validation
The performance numbers above were collected with temporary ad-hoc benchmark scripts on the test machine. These scripts are not part of this PR; they were used only to validate this optimization.
The benchmarks covered two comparisons:
put_tensor) against direct-slice baseline (put_from) under local memcpy and RDMA+MC_STORE_MEMCPY=1.MC_STORE_PIN_MEMORY=0.The test machine used:
erdma_0192.168.22.70192.168.22.70:8090192.168.22.70:50060Note: on this test machine, PyTorch's CUDA runtime was preloaded for the benchmark process to avoid a CUDA runtime symbol conflict when importing both PyTorch and the Mooncake Python extension:
Checklist
pre-commit run --all-filesand all hooks passAI Assistance Disclosure
Claude Opus 4.6 assisted with code generation, code review, benchmark analysis, and PR description drafting. All changes were reviewed and validated by me。