Skip to content

[Store] Optimize put performence by reduce memcpy and pinned memory#2598

Open
zxpdemonio wants to merge 15 commits into
kvcache-ai:mainfrom
openanolis:cruz/optimize_put
Open

[Store] Optimize put performence by reduce memcpy and pinned memory#2598
zxpdemonio wants to merge 15 commits into
kvcache-ai:mainfrom
openanolis:cruz/optimize_put

Conversation

@zxpdemonio

@zxpdemonio zxpdemonio commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Motivation

The store write path currently performs unnecessary staging copies for many high-level write APIs.

High-level write APIs (put, put_parts, put_batch, put_tensor, upsert, upsert_parts, upsert_batch, upsert_tensor) previously always copied user data into a Store-managed staging buffer before submitting the transfer:

user buffer
  -> memcpy to client staging buffer
  -> memcpy/RDMA to target segment

This has three issues:

  1. Local writes pay one extra host memcpy.
  2. GPU tensor writes are not handled safely in the old high-level write path because device pointers must not be copied with std::memcpy.
  3. GPU→CPU copies use pageable host memory unless users manually enable pinning, which limits bandwidth.

This PR removes the unconditional high-level write staging copy and makes GPU→CPU writes a first-class fast path.

Scenario

This optimization targets write-heavy GPU/CPU store workloads, especially:

  • GPU tensor writes to local Store segments.
  • CPU tensor/object writes through high-level APIs.
  • RDMA writes where user buffers may or may not already be registered.
  • Tensor batch writes from Python integration APIs.

The most important case is GPU→CPU local write. On NVIDIA A10, pinned host memory improves 512MB GPU→CPU write bandwidth from ~8–9 GB/s to ~25 GB/s.

Description

Remove unconditional staging in high-level write internals

This PR updates the six high-level write internal functions:

  • put_internal
  • put_parts_internal
  • put_batch_internal
  • upsert_internal
  • upsert_parts_internal
  • upsert_batch_internal

Before:

allocate client staging buffer
memcpy user data -> staging buffer
split staging buffer into slices
submit slices

After:

split user buffer directly into slices
submit slices

The transfer layer now decides whether staging is actually required.

For local memcpy writes, this reduces the write path from two copies to one copy.

For RDMA writes with already-registered buffers, this also removes one host memcpy.

For RDMA writes with unregistered buffers, the total copy count remains the same, but staging is now centralized in the transfer submit layer.

Add RDMA-safe staging in the transfer layer

The transfer submit layer now checks whether write slices are registered with the transfer engine.

If all slices are already registered:

registered user buffer
  -> RDMA WRITE

If any slice is not registered:

user buffer
  -> MemcpySafe() to registered staging buffer
  -> RDMA WRITE

The staging path:

  • batches all unregistered slices into one contiguous staging allocation
  • keeps staging BufferHandles alive until transfer completion
  • uses MemcpySafe() so GPU pointers are handled by the proper accelerator runtime
  • splits both staged and already-registered slices by kMaxSliceSize

This preserves RDMA correctness while avoiding unnecessary staging for local and registered-buffer paths.

Enable pinned host memory by default for GPU builds

For GPU builds, Store now tries to pin Store-managed host memory with cudaHostRegister by default.

Pinned memory is applied to two Store-owned buffer classes:

  1. Client staging buffer

    • Used when RDMA writes need a registered source buffer but the user buffer is not already registered.
    • Pinning this buffer improves GPU→CPU staging copies before RDMA WRITE.
    • The buffer is unpinned during RealClient teardown.
  2. Mounted segment buffers

    • Used as the local Store target for memory-backed objects.
    • Pinning these buffers improves GPU→CPU local writes because CUDA can DMA directly into the Store segment instead of using its internal pageable-memory staging path.
    • Segment buffers are unpinned on unmount. If mounting the segment to the master fails after local registration, Store now also unpins and unregisters the buffer on that failure path.

This is a best-effort acceleration path:

  • If cudaHostRegister fails, Store keeps running and falls back to pageable host memory for that buffer.
  • If the configured pinned-memory budget is exceeded, Store skips pinning that buffer and keeps running.
  • Correctness does not depend on pinned memory; only GPU→CPU copy performance does.
  • Store tracks only successfully pinned regions and only calls cudaHostUnregister for those tracked regions.

Opt out:

MC_STORE_PIN_MEMORY=0
# or
MC_STORE_PIN_MEMORY=false

Optionally cap total Store-managed pinned memory:

MC_STORE_PIN_MEMORY_MAX_BYTES=8589934592  # 8GB

Unset or 0 means no explicit Store-side cap.

Pinned memory has important side effects: pinned pages cannot be swapped or reclaimed by the OS, large registrations have setup cost, and pinning a whole large segment can increase system memory pressure. The cap is therefore a safety valve for deployments that use large segments or share the machine with other workloads.

This PR intentionally implements the minimal safe policy: default-on best-effort pinning for Store-managed buffers, plus an opt-out and a total pinned-memory cap. Future work should replace whole-buffer pinning with a registration cache that pins page-aligned hot ranges on demand and evicts them with an LRU/lazy-unpin policy. That would reduce page-locked memory pressure for very large segments while preserving the direct GPU→segment fast path for frequently written ranges.

Python tensor batch path uses multi-buffer writes

batch_put_tensor and batch_upsert_tensor now pass tensor metadata and tensor data as separate buffers through the multi-buffer write APIs.

This avoids the integration-layer staging copy that previously happened before calling the lower-level Store write APIs.

Basic Usage

No API change is required.

Existing write APIs automatically benefit:

store.put(key, data)
store.put_tensor(key, tensor)
store.batch_put_tensor(keys, tensors)
store.upsert_tensor(key, tensor)

Pinned host memory is enabled by default for GPU builds.

To disable pinned memory:

export MC_STORE_PIN_MEMORY=0

To cap total Store-managed pinned memory:

export MC_STORE_PIN_MEMORY_MAX_BYTES=8589934592  # 8GB

If the cap is exceeded or cudaHostRegister fails, Store keeps running and uses pageable host memory for that buffer.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Integration (mooncake-integration)
  • Docs

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement

Performance

Benchmarks were run on an NVIDIA A10 test machine.

Copy count reduction

The affected APIs are the high-level write APIs that accept user data directly, such as put, put_parts, put_batch, put_tensor, upsert, upsert_parts, upsert_batch, and upsert_tensor.

Before this PR, these APIs always copied user data into a Store-managed staging buffer before submitting the transfer. This PR removes that unconditional staging copy and lets the transfer layer stage only when RDMA correctness requires it.

API / scenario Before After Copy reduction Notes
High-level write APIs + local memcpy (put, put_tensor, upsert, etc.) 2 copies 1 copy -1 copy User buffer is copied directly into the local Store segment.
High-level write APIs + RDMA + registered user buffer 2 copies 1 copy -1 copy Registered user buffer can be used directly as the RDMA source.
High-level write APIs + RDMA + unregistered user buffer 2 copies 2 copies 0 Staging is still required, but it is now centralized in the transfer layer.
High-level write APIs + GPU source buffer Crash-prone 1 or 2 copies N/A The old path used std::memcpy on user data, which is unsafe for GPU pointers. The new path uses accelerator-aware copy in the transfer layer.
Direct-slice APIs (put_from, batch_put_from, etc.) + RDMA + unregistered user buffer Crash-prone 2 copies N/A The old direct-slice RDMA path assumed source buffers were already registered. The new transfer-layer staging path handles unregistered buffers safely.
Direct-slice APIs (put_from, batch_put_from, etc.) + registered/local buffer 1 copy 1 copy 0 These APIs already submit user slices directly.
Python batch_put_tensor / batch_upsert_tensor local memcpy 2 copies 1 copy -1 copy Metadata and tensor data are now passed as separate buffers through multi-buffer writes.

Measured copy-reduction benefit

The main measured before/after comparison is 512MB CPU put_tensor with local memcpy. This is the case where the old path paid one extra host memcpy and the new path removes it.

Scenario Path Before latency After latency Improvement After bandwidth
Local memcpy, 512MB put_tensor ~112 ms 73.01 ms ~1.53x faster, ~39 ms saved 7.35 GB/s

This saved time corresponds to removing one extra host memcpy from the high-level local write path.

For reference, the direct-slice API baseline after this change was:

Scenario Path Latency Bandwidth Notes
Local memcpy, 512MB put_from 38.23 ms 14.04 GB/s Existing direct-slice API; not the optimized target in this PR.
RDMA setup with MC_STORE_MEMCPY=1, 512MB put_tensor 76.90 ms 6.98 GB/s High-level write API after removing unconditional staging.
RDMA setup with MC_STORE_MEMCPY=1, 512MB put_from 74.70 ms 7.19 GB/s Existing direct-slice API under the same memcpy-forced setup.

These rows are not before/after numbers. They are sanity checks showing that, after this PR, high-level writes no longer pay the old unconditional staging copy and are comparable to direct-slice writes under the same transfer strategy.

Pinned memory benefit

GPU→CPU local write, pinned enabled vs disabled:

Path Size Pinned OFF Pinned ON Speedup
GPU put_tensor 64MB 8.26 GB/s 25.10 GB/s 3.04x
GPU put_tensor 256MB 9.32 GB/s 25.76 GB/s 2.76x
GPU put_tensor 512MB 9.03 GB/s 24.84 GB/s 2.75x
GPU put_from 64MB 8.11 GB/s 25.29 GB/s 3.12x
GPU put_from 256MB 8.25 GB/s 25.82 GB/s 3.13x
GPU put_from 512MB 8.20 GB/s 25.91 GB/s 3.16x

Pinned host memory makes GPU→CPU copies DMA directly into Store buffers instead of going through CUDA’s internal pageable-memory staging path.

How Has This Been Tested?

Build commands:

cmake --build build --target store -j$(nproc)

Also built on the GPU/RDMA test machine after syncing this branch:

cmake --build build --target store -j$(nproc)

Test commands:

cmake --build build --target client_integration_test client_tcp_local_memcpy_test pybind_client_test -j$(nproc)
ctest --test-dir build -R 'client_integration_test|client_tcp_local_memcpy_test|pybind_client_test' --output-on-failure

Test results:

  • Unit tests pass
  • Integration tests pass
  • Manual performance testing done on GPU/RDMA test machine
Test Result Coverage
client_integration_test Passed put, put_parts, upsert, upsert_parts, upsert_batch, batch_upsert_from, batch_put_from_multi_buffers
client_tcp_local_memcpy_test Passed TCP local memcpy behavior
pybind_client_test Passed PyClient write/update paths including Upsert and batch upsert cases

CTest summary:

100% tests passed, 0 tests failed out of 3
Total Test time (real) = 379.76 sec

Performance validation

The performance numbers above were collected with temporary ad-hoc benchmark scripts on the test machine. These scripts are not part of this PR; they were used only to validate this optimization.

The benchmarks covered two comparisons:

  1. copy reduction: compare high-level write API (put_tensor) against direct-slice baseline (put_from) under local memcpy and RDMA+MC_STORE_MEMCPY=1.
  2. pinned host memory: compare GPU→CPU local writes with default pinned memory against MC_STORE_PIN_MEMORY=0.

The test machine used:

Item Value
GPU NVIDIA A10
RDMA device erdma_0
Host 192.168.22.70
Master HTTP 192.168.22.70:8090
Master RPC 192.168.22.70:50060
Tensor sizes 64MB, 256MB, 512MB

Note: on this test machine, PyTorch's CUDA runtime was preloaded for the benchmark process to avoid a CUDA runtime symbol conflict when importing both PyTorch and the Mooncake Python extension:

LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12

Checklist

  • I have performed a self-review of my own code
  • I have formatted the changed code
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • AI tools were used (specify below)

Claude Opus 4.6 assisted with code generation, code review, benchmark analysis, and PR description drafting. All changes were reviewed and validated by me。

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes data transfer and memory staging by deferring staging to the transfer layer, introducing zero-copy batch upsert methods, executing local memory copies inline on the calling thread to eliminate thread synchronization overhead, and pinning memory regions with cudaHostRegister for DMA-backed GPU transfers. The review feedback focuses on further optimizing and safeguarding these changes, including adding a fast-path check for zero-sized copies in MemcpySafe, querying device pointer status once outside the loop in submitMemcpyOperation, reusing the split_into_slices helper to simplify RDMA staging, and guarding memory pinning to avoid registering zero-sized local buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-store/include/gpu_staging_utils.h
Comment thread mooncake-store/src/transfer_task.cpp
Comment thread mooncake-store/src/transfer_task.cpp Outdated
Comment thread mooncake-store/src/real_client.cpp Outdated
zxpdemonio and others added 3 commits June 24, 2026 08:09
Remove the unconditional alloc + memcpy staging from all 6 Pattern A
*_internal functions (put, put_parts, put_batch, upsert, upsert_parts,
upsert_batch). Slices are now created directly from source data via
split_into_slices(ptr, size), deferring staging to ensureRegisteredForRDMA
only when the RDMA path encounters unregistered memory.

Key changes:
- real_client.cpp: 6 *_internal functions skip staging, use direct slices
- transfer_task.cpp: ensureRegisteredForRDMA stages unregistered slices
  with single contiguous alloc; selectStrategy dispatches LOCAL_MEMCPY
  (inline, GPU-safe) vs TRANSFER_ENGINE; kMaxSliceSize splitting applied
  to both registered and staged slices
- gpu_staging_utils.h: MemcpySafe, CopyAuto, IsDevicePointer utilities;
  async batch copy support (CUDA Driver API / MUSA Runtime API)
- transfer_engine: isLocalMemoryRegistered query for registration check
- client_service: setStagingAllocator to pass allocator to TransferSubmitter
- store_py.cpp: batch_put/upsert_tensor use multi_buffers API to bypass
  integration-layer staging
- batch_upsert_from_multi_buffers: new API mirroring batch_put variant

Performance: LOCAL_MEMCPY path reduces from 2 copies to 1 for all
Pattern A callers. RDMA with registered buffers also saves 1 copy.
RDMA with unregistered buffers unchanged (staging moves to submit layer).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pin staging and segment buffers by default so that GPU→host copies
use direct DMA instead of CUDA's internal staging through a temporary
pinned buffer. This roughly doubles GPU→CPU memcpy bandwidth on PCIe.

Opt out with MC_STORE_PIN_MEMORY=0 if pinning conflicts with other
GPU workloads or exceeds the locked-page limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the MC_STORE_ASYNC_STAGING path and its CUDA/MUSA async-copy helpers.
Benchmarking showed no stable benefit for GPU→RDMA staging: single put_from
was 0.97x-1.01x and small batch cases regressed up to 0.93x, while the
synchronous MemcpySafe path remains simpler and stable.

Pinned host memory remains enabled by default for GPU builds, which provides
the material GPU→CPU gain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zxpdemonio zxpdemonio changed the title Cruz/optimize put [Store] Optimize put performence by reduce memcpy and pinned memroy Jun 24, 2026
@zxpdemonio zxpdemonio changed the title [Store] Optimize put performence by reduce memcpy and pinned memroy [Store] Optimize put performence by reduce memcpy and pinned memory Jun 24, 2026
@zxpdemonio zxpdemonio linked an issue Jun 24, 2026 that may be closed by this pull request
@zxpdemonio zxpdemonio added this to the 2479 milestone Jun 24, 2026
Handle zero-sized copies safely and reduce redundant GPU pointer queries in the memcpy submit path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cap Store-managed cudaHostRegister usage with MC_STORE_PIN_MEMORY_MAX_BYTES and keep formatting changes targeted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zxpdemonio and others added 4 commits June 24, 2026 09:59
Validate full registered ranges before RDMA fast paths and clean up pinned host registrations on teardown/failure paths so zero-copy writes fail closed without leaking resources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Apply the clang-format-20 layout expected by CI for the zero-copy write and pinned-memory changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format pinned-memory logging with clang-format 20.1.8 to match the CI formatter exactly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use the default host-register flag value directly so CUDA-alike builds such as MUSA do not require a cudaHostRegisterDefault macro mapping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zxpdemonio

Copy link
Copy Markdown
Collaborator Author

@gemini-code-assist please review again

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces zero-copy optimizations and performance enhancements to the Mooncake store and transfer engine. Key changes include deferring staging to the transfer layer, executing memcpy operations inline on the calling thread to eliminate synchronization overhead, and implementing GPU-safe memory copying and host memory pinning. The review feedback highlights critical issues where modifying the active GPU device via SetDevice without restoring it can pollute the calling thread's state. Additionally, defining static tracking variables inside inline functions in a header file may lead to duplicate instances across shared library boundaries, and there is a minor parameter name mismatch between the declaration and definition of batch_upsert_from_multi_buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-store/include/gpu_staging_utils.h Outdated
Comment thread mooncake-store/src/transfer_task.cpp Outdated
Comment thread mooncake-store/include/gpu_staging_utils.h Outdated
Comment thread mooncake-store/src/real_client.cpp Outdated
Restore the caller's active GPU device after inline GPU copies and move pinned-memory tracking state into a single translation unit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zxpdemonio and others added 3 commits June 24, 2026 12:30
Serialize pinned-memory registration state transitions and restore GPU device state in remaining copy paths so setup and transfer failure paths do not leak caller-visible state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid requiring a staging allocator for TCP-only transfer-engine writes, while keeping RDMA registration staging for non-TCP transports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoid routing CXL writes through RDMA staging by checking for an RDMA transport directly instead of treating every non-TCP path as RDMA.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

zxpdemonio and others added 2 commits June 25, 2026 05:33
Check each target buffer protocol before RDMA staging so CXL replicas do not inherit RDMA-only requirements when RDMA is also installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep copy-style put and upsert APIs failing when no client buffer is configured, while leaving explicit external-buffer write paths unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +802 to +807
auto group_ids_error =
ValidateGroupIdsForBatchConfig(config, keys.size(), "put");
if (!group_ids_error.empty()) return group_ids_error;

std::vector<int> results(keys.size(), 0);
{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending batch_write_tensor_impl is better for readability.

@alogfans

Copy link
Copy Markdown
Collaborator

We recommend you trying to minimize code modifications as possible, or spiliting it to multiple PRs or commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Zero-Copy Read/Write: Eliminate Redundant Data Copies

3 participants