[Store] Optimize put performence by reduce memcpy and pinned memory by zxpdemonio · Pull Request #2598 · kvcache-ai/Mooncake

zxpdemonio · 2026-06-24T08:05:17Z

Motivation

The store write path currently performs unnecessary staging copies for many high-level write APIs.

High-level write APIs (put, put_parts, put_batch, put_tensor, upsert, upsert_parts, upsert_batch, upsert_tensor) previously always copied user data into a Store-managed staging buffer before submitting the transfer:

user buffer
  -> memcpy to client staging buffer
  -> memcpy/RDMA to target segment

This has three issues:

Local writes pay one extra host memcpy.
GPU tensor writes are not handled safely in the old high-level write path because device pointers must not be copied with std::memcpy.
GPU→CPU copies use pageable host memory unless users manually enable pinning, which limits bandwidth.

This PR removes the unconditional high-level write staging copy and makes GPU→CPU writes a first-class fast path.

Scenario

This optimization targets write-heavy GPU/CPU store workloads, especially:

GPU tensor writes to local Store segments.
CPU tensor/object writes through high-level APIs.
RDMA writes where user buffers may or may not already be registered.
Tensor batch writes from Python integration APIs.

The most important case is GPU→CPU local write. On NVIDIA A10, pinned host memory improves 512MB GPU→CPU write bandwidth from ~8–9 GB/s to ~25 GB/s.

Description

Remove unconditional staging in high-level write internals

This PR updates the six high-level write internal functions:

put_internal
put_parts_internal
put_batch_internal
upsert_internal
upsert_parts_internal
upsert_batch_internal

Before:

allocate client staging buffer
memcpy user data -> staging buffer
split staging buffer into slices
submit slices

After:

split user buffer directly into slices
submit slices

The transfer layer now decides whether staging is actually required.

For local memcpy writes, this reduces the write path from two copies to one copy.

For RDMA writes with already-registered buffers, this also removes one host memcpy.

For RDMA writes with unregistered buffers, the total copy count remains the same, but staging is now centralized in the transfer submit layer.

Add RDMA-safe staging in the transfer layer

The transfer submit layer now checks whether write slices are registered with the transfer engine.

If all slices are already registered:

registered user buffer
  -> RDMA WRITE

If any slice is not registered:

user buffer
  -> MemcpySafe() to registered staging buffer
  -> RDMA WRITE

The staging path:

batches all unregistered slices into one contiguous staging allocation
keeps staging BufferHandles alive until transfer completion
uses MemcpySafe() so GPU pointers are handled by the proper accelerator runtime
splits both staged and already-registered slices by kMaxSliceSize

This preserves RDMA correctness while avoiding unnecessary staging for local and registered-buffer paths.

Enable pinned host memory by default for GPU builds

For GPU builds, Store now tries to pin Store-managed host memory with cudaHostRegister by default.

Pinned memory is applied to two Store-owned buffer classes:

Client staging buffer
- Used when RDMA writes need a registered source buffer but the user buffer is not already registered.
- Pinning this buffer improves GPU→CPU staging copies before RDMA WRITE.
- The buffer is unpinned during RealClient teardown.
Mounted segment buffers
- Used as the local Store target for memory-backed objects.
- Pinning these buffers improves GPU→CPU local writes because CUDA can DMA directly into the Store segment instead of using its internal pageable-memory staging path.
- Segment buffers are unpinned on unmount. If mounting the segment to the master fails after local registration, Store now also unpins and unregisters the buffer on that failure path.

This is a best-effort acceleration path:

If cudaHostRegister fails, Store keeps running and falls back to pageable host memory for that buffer.
If the configured pinned-memory budget is exceeded, Store skips pinning that buffer and keeps running.
Correctness does not depend on pinned memory; only GPU→CPU copy performance does.
Store tracks only successfully pinned regions and only calls cudaHostUnregister for those tracked regions.

Opt out:

MC_STORE_PIN_MEMORY=0
# or
MC_STORE_PIN_MEMORY=false

Optionally cap total Store-managed pinned memory:

MC_STORE_PIN_MEMORY_MAX_BYTES=8589934592  # 8GB

Unset or 0 means no explicit Store-side cap.

Pinned memory has important side effects: pinned pages cannot be swapped or reclaimed by the OS, large registrations have setup cost, and pinning a whole large segment can increase system memory pressure. The cap is therefore a safety valve for deployments that use large segments or share the machine with other workloads.

This PR intentionally implements the minimal safe policy: default-on best-effort pinning for Store-managed buffers, plus an opt-out and a total pinned-memory cap. Future work should replace whole-buffer pinning with a registration cache that pins page-aligned hot ranges on demand and evicts them with an LRU/lazy-unpin policy. That would reduce page-locked memory pressure for very large segments while preserving the direct GPU→segment fast path for frequently written ranges.

Python tensor batch path uses multi-buffer writes

batch_put_tensor and batch_upsert_tensor now pass tensor metadata and tensor data as separate buffers through the multi-buffer write APIs.

This avoids the integration-layer staging copy that previously happened before calling the lower-level Store write APIs.

Basic Usage

No API change is required.

Existing write APIs automatically benefit:

store.put(key, data)
store.put_tensor(key, tensor)
store.batch_put_tensor(keys, tensors)
store.upsert_tensor(key, tensor)

Pinned host memory is enabled by default for GPU builds.

To disable pinned memory:

export MC_STORE_PIN_MEMORY=0

To cap total Store-managed pinned memory:

export MC_STORE_PIN_MEMORY_MAX_BYTES=8589934592  # 8GB

If the cap is exceeded or cudaHostRegister fails, Store keeps running and uses pageable host memory for that buffer.

Module

Transfer Engine (mooncake-transfer-engine)
Mooncake Store (mooncake-store)
Integration (mooncake-integration)
Docs

Type of Change

Performance

Benchmarks were run on an NVIDIA A10 test machine.

Copy count reduction

The affected APIs are the high-level write APIs that accept user data directly, such as put, put_parts, put_batch, put_tensor, upsert, upsert_parts, upsert_batch, and upsert_tensor.

Before this PR, these APIs always copied user data into a Store-managed staging buffer before submitting the transfer. This PR removes that unconditional staging copy and lets the transfer layer stage only when RDMA correctness requires it.

API / scenario	Before	After	Copy reduction	Notes
High-level write APIs + local memcpy (`put`, `put_tensor`, `upsert`, etc.)	2 copies	1 copy	-1 copy	User buffer is copied directly into the local Store segment.
High-level write APIs + RDMA + registered user buffer	2 copies	1 copy	-1 copy	Registered user buffer can be used directly as the RDMA source.
High-level write APIs + RDMA + unregistered user buffer	2 copies	2 copies	0	Staging is still required, but it is now centralized in the transfer layer.
High-level write APIs + GPU source buffer	Crash-prone	1 or 2 copies	N/A	The old path used `std::memcpy` on user data, which is unsafe for GPU pointers. The new path uses accelerator-aware copy in the transfer layer.
Direct-slice APIs (`put_from`, `batch_put_from`, etc.) + RDMA + unregistered user buffer	Crash-prone	2 copies	N/A	The old direct-slice RDMA path assumed source buffers were already registered. The new transfer-layer staging path handles unregistered buffers safely.
Direct-slice APIs (`put_from`, `batch_put_from`, etc.) + registered/local buffer	1 copy	1 copy	0	These APIs already submit user slices directly.
Python `batch_put_tensor` / `batch_upsert_tensor` local memcpy	2 copies	1 copy	-1 copy	Metadata and tensor data are now passed as separate buffers through multi-buffer writes.

Measured copy-reduction benefit

The main measured before/after comparison is 512MB CPU put_tensor with local memcpy. This is the case where the old path paid one extra host memcpy and the new path removes it.

Scenario	Path	Before latency	After latency	Improvement	After bandwidth
Local memcpy, 512MB	`put_tensor`	~112 ms	73.01 ms	~1.53x faster, ~39 ms saved	7.35 GB/s

This saved time corresponds to removing one extra host memcpy from the high-level local write path.

For reference, the direct-slice API baseline after this change was:

Scenario	Path	Latency	Bandwidth	Notes
Local memcpy, 512MB	`put_from`	38.23 ms	14.04 GB/s	Existing direct-slice API; not the optimized target in this PR.
RDMA setup with `MC_STORE_MEMCPY=1`, 512MB	`put_tensor`	76.90 ms	6.98 GB/s	High-level write API after removing unconditional staging.
RDMA setup with `MC_STORE_MEMCPY=1`, 512MB	`put_from`	74.70 ms	7.19 GB/s	Existing direct-slice API under the same memcpy-forced setup.

These rows are not before/after numbers. They are sanity checks showing that, after this PR, high-level writes no longer pay the old unconditional staging copy and are comparable to direct-slice writes under the same transfer strategy.

Pinned memory benefit

GPU→CPU local write, pinned enabled vs disabled:

Path	Size	Pinned OFF	Pinned ON	Speedup
GPU `put_tensor`	64MB	8.26 GB/s	25.10 GB/s	3.04x
GPU `put_tensor`	256MB	9.32 GB/s	25.76 GB/s	2.76x
GPU `put_tensor`	512MB	9.03 GB/s	24.84 GB/s	2.75x
GPU `put_from`	64MB	8.11 GB/s	25.29 GB/s	3.12x
GPU `put_from`	256MB	8.25 GB/s	25.82 GB/s	3.13x
GPU `put_from`	512MB	8.20 GB/s	25.91 GB/s	3.16x

Pinned host memory makes GPU→CPU copies DMA directly into Store buffers instead of going through CUDA’s internal pageable-memory staging path.

How Has This Been Tested?

Build commands:

cmake --build build --target store -j$(nproc)

Also built on the GPU/RDMA test machine after syncing this branch:

cmake --build build --target store -j$(nproc)

Test commands:

cmake --build build --target client_integration_test client_tcp_local_memcpy_test pybind_client_test -j$(nproc)
ctest --test-dir build -R 'client_integration_test|client_tcp_local_memcpy_test|pybind_client_test' --output-on-failure

Test results:

Unit tests pass
Integration tests pass
Manual performance testing done on GPU/RDMA test machine

Test	Result	Coverage
`client_integration_test`	Passed	`put`, `put_parts`, `upsert`, `upsert_parts`, `upsert_batch`, `batch_upsert_from`, `batch_put_from_multi_buffers`
`client_tcp_local_memcpy_test`	Passed	TCP local memcpy behavior
`pybind_client_test`	Passed	PyClient write/update paths including `Upsert` and batch upsert cases

CTest summary:

100% tests passed, 0 tests failed out of 3
Total Test time (real) = 379.76 sec

Performance validation

The performance numbers above were collected with temporary ad-hoc benchmark scripts on the test machine. These scripts are not part of this PR; they were used only to validate this optimization.

The benchmarks covered two comparisons:

copy reduction: compare high-level write API (put_tensor) against direct-slice baseline (put_from) under local memcpy and RDMA+MC_STORE_MEMCPY=1.
pinned host memory: compare GPU→CPU local writes with default pinned memory against MC_STORE_PIN_MEMORY=0.

The test machine used:

Item	Value
GPU	NVIDIA A10
RDMA device	`erdma_0`
Host	`192.168.22.70`
Master HTTP	`192.168.22.70:8090`
Master RPC	`192.168.22.70:50060`
Tensor sizes	64MB, 256MB, 512MB

Note: on this test machine, PyTorch's CUDA runtime was preloaded for the benchmark process to avoid a CUDA runtime symbol conflict when importing both PyTorch and the Mooncake Python extension:

LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/nvidia/cuda_runtime/lib/libcudart.so.12

Checklist

I have performed a self-review of my own code
I have formatted the changed code
I have run pre-commit run --all-files and all hooks pass
I have updated the documentation (if applicable)
I have added tests to prove my changes are effective
For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

AI tools were used (specify below)

Claude Opus 4.6 assisted with code generation, code review, benchmark analysis, and PR description drafting. All changes were reviewed and validated by me。

gemini-code-assist

Code Review

This pull request optimizes data transfer and memory staging by deferring staging to the transfer layer, introducing zero-copy batch upsert methods, executing local memory copies inline on the calling thread to eliminate thread synchronization overhead, and pinning memory regions with cudaHostRegister for DMA-backed GPU transfers. The review feedback focuses on further optimizing and safeguarding these changes, including adding a fast-path check for zero-sized copies in MemcpySafe, querying device pointer status once outside the loop in submitMemcpyOperation, reusing the split_into_slices helper to simplify RDMA staging, and guarding memory pinning to avoid registering zero-sized local buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Remove the unconditional alloc + memcpy staging from all 6 Pattern A *_internal functions (put, put_parts, put_batch, upsert, upsert_parts, upsert_batch). Slices are now created directly from source data via split_into_slices(ptr, size), deferring staging to ensureRegisteredForRDMA only when the RDMA path encounters unregistered memory. Key changes: - real_client.cpp: 6 *_internal functions skip staging, use direct slices - transfer_task.cpp: ensureRegisteredForRDMA stages unregistered slices with single contiguous alloc; selectStrategy dispatches LOCAL_MEMCPY (inline, GPU-safe) vs TRANSFER_ENGINE; kMaxSliceSize splitting applied to both registered and staged slices - gpu_staging_utils.h: MemcpySafe, CopyAuto, IsDevicePointer utilities; async batch copy support (CUDA Driver API / MUSA Runtime API) - transfer_engine: isLocalMemoryRegistered query for registration check - client_service: setStagingAllocator to pass allocator to TransferSubmitter - store_py.cpp: batch_put/upsert_tensor use multi_buffers API to bypass integration-layer staging - batch_upsert_from_multi_buffers: new API mirroring batch_put variant Performance: LOCAL_MEMCPY path reduces from 2 copies to 1 for all Pattern A callers. RDMA with registered buffers also saves 1 copy. RDMA with unregistered buffers unchanged (staging moves to submit layer). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pin staging and segment buffers by default so that GPU→host copies use direct DMA instead of CUDA's internal staging through a temporary pinned buffer. This roughly doubles GPU→CPU memcpy bandwidth on PCIe. Opt out with MC_STORE_PIN_MEMORY=0 if pinning conflicts with other GPU workloads or exceeds the locked-page limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the MC_STORE_ASYNC_STAGING path and its CUDA/MUSA async-copy helpers. Benchmarking showed no stable benefit for GPU→RDMA staging: single put_from was 0.97x-1.01x and small batch cases regressed up to 0.93x, while the synchronous MemcpySafe path remains simpler and stable. Pinned host memory remains enabled by default for GPU builds, which provides the material GPU→CPU gain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Handle zero-sized copies safely and reduce redundant GPU pointer queries in the memcpy submit path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cap Store-managed cudaHostRegister usage with MC_STORE_PIN_MEMORY_MAX_BYTES and keep formatting changes targeted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Validate full registered ranges before RDMA fast paths and clean up pinned host registrations on teardown/failure paths so zero-copy writes fail closed without leaking resources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Apply the clang-format-20 layout expected by CI for the zero-copy write and pinned-memory changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Format pinned-memory logging with clang-format 20.1.8 to match the CI formatter exactly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use the default host-register flag value directly so CUDA-alike builds such as MUSA do not require a cudaHostRegisterDefault macro mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio · 2026-06-24T11:57:35Z

@gemini-code-assist please review again

gemini-code-assist

Code Review

This pull request introduces zero-copy optimizations and performance enhancements to the Mooncake store and transfer engine. Key changes include deferring staging to the transfer layer, executing memcpy operations inline on the calling thread to eliminate synchronization overhead, and implementing GPU-safe memory copying and host memory pinning. The review feedback highlights critical issues where modifying the active GPU device via SetDevice without restoring it can pollute the calling thread's state. Additionally, defining static tracking variables inside inline functions in a header file may lead to duplicate instances across shared library boundaries, and there is a minor parameter name mismatch between the declaration and definition of batch_upsert_from_multi_buffers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Restore the caller's active GPU device after inline GPU copies and move pinned-memory tracking state into a single translation unit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Serialize pinned-memory registration state transitions and restore GPU device state in remaining copy paths so setup and transfer failure paths do not leak caller-visible state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Avoid requiring a staging allocator for TCP-only transfer-engine writes, while keeping RDMA registration staging for non-TCP transports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Avoid routing CXL writes through RDMA staging by checking for an RDMA transport directly instead of treating every non-TCP path as RDMA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-06-25T04:25:41Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 31.10465% with 237 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/transfer_task.cpp	32.72%	74 Missing ⚠️
mooncake-integration/store/store_py.cpp	0.00%	50 Missing ⚠️
mooncake-store/src/real_client.cpp	38.75%	49 Missing ⚠️
mooncake-store/include/gpu_staging_utils.h	39.13%	42 Missing ⚠️
...ncake-transfer-engine/src/transfer_engine_impl.cpp	0.00%	8 Missing ⚠️
mooncake-store/src/client_service.cpp	46.15%	7 Missing ⚠️
mooncake-store/include/pyclient.h	0.00%	2 Missing ⚠️
mooncake-store/include/transfer_task.h	0.00%	2 Missing ⚠️
mooncake-transfer-engine/src/transfer_engine.cpp	0.00%	2 Missing ⚠️
...ooncake-integration/store/store_py_parallel_read.h	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Check each target buffer protocol before RDMA staging so CXL replicas do not inherit RDMA-only requirements when RDMA is also installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep copy-style put and upsert APIs failing when no client buffer is configured, while leaving explicit external-buffer write paths unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alogfans · 2026-06-26T08:17:53Z

+        auto group_ids_error =
+            ValidateGroupIdsForBatchConfig(config, keys.size(), "put");
+        if (!group_ids_error.empty()) return group_ids_error;
+
+        std::vector<int> results(keys.size(), 0);
+        {


Extending batch_write_tensor_impl is better for readability.

alogfans · 2026-06-26T08:20:17Z

We recommend you trying to minimize code modifications as possible, or spiliting it to multiple PRs or commits.

zxpdemonio requested review from XucSh, YiXR, alogfans, chestnut-Q, doujiang24, stmatengss and ykwd as code owners June 24, 2026 08:05

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread mooncake-store/include/gpu_staging_utils.h

Comment thread mooncake-store/src/transfer_task.cpp

Comment thread mooncake-store/src/transfer_task.cpp Outdated

Comment thread mooncake-store/src/real_client.cpp Outdated

github-actions Bot added run-ci Store Transfer Engine Integration labels Jun 24, 2026

zxpdemonio and others added 3 commits June 24, 2026 08:09

zxpdemonio changed the title ~~Cruz/optimize put~~ [Store] Optimize put performence by reduce memcpy and pinned memroy Jun 24, 2026

zxpdemonio changed the title ~~[Store] Optimize put performence by reduce memcpy and pinned memroy~~ [Store] Optimize put performence by reduce memcpy and pinned memory Jun 24, 2026

zxpdemonio linked an issue Jun 24, 2026 that may be closed by this pull request

[RFC]: Zero-Copy Read/Write: Eliminate Redundant Data Copies #2479

Open

zxpdemonio added this to the 2479 milestone Jun 24, 2026

[Store] Address zero-copy write review comments

49078b3

Handle zero-sized copies safely and reduce redundant GPU pointer queries in the memcpy submit path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio force-pushed the cruz/optimize_put branch from 9fb1021 to 49078b3 Compare June 24, 2026 08:24

[Store] Add bounded pinned host memory

dc769ca

Cap Store-managed cudaHostRegister usage with MC_STORE_PIN_MEMORY_MAX_BYTES and keep formatting changes targeted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio force-pushed the cruz/optimize_put branch from c808b7c to dc769ca Compare June 24, 2026 09:18

zxpdemonio and others added 4 commits June 24, 2026 09:59

[Store] Tighten zero-copy write safety

c65bed7

Validate full registered ranges before RDMA fast paths and clean up pinned host registrations on teardown/failure paths so zero-copy writes fail closed without leaking resources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Fix zero-copy write formatting

317d3ac

Apply the clang-format-20 layout expected by CI for the zero-copy write and pinned-memory changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Match CI clang-format output

ade3a43

Format pinned-memory logging with clang-format 20.1.8 to match the CI formatter exactly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Avoid CUDA-only host register flag

555c63a

Use the default host-register flag value directly so CUDA-alike builds such as MUSA do not require a cudaHostRegisterDefault macro mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread mooncake-store/include/gpu_staging_utils.h Outdated

Comment thread mooncake-store/src/transfer_task.cpp Outdated

Comment thread mooncake-store/include/gpu_staging_utils.h Outdated

Comment thread mooncake-store/src/real_client.cpp Outdated

[Store] Address GPU device guard review

9ff8daf

Restore the caller's active GPU device after inline GPU copies and move pinned-memory tracking state into a single translation unit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio and others added 3 commits June 24, 2026 12:30

[Store] Tighten GPU copy lifecycle handling

f337b3d

Serialize pinned-memory registration state transitions and restore GPU device state in remaining copy paths so setup and transfer failure paths do not leak caller-visible state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] skip RDMA staging for TCP-only writes

f8c43ff

Avoid requiring a staging allocator for TCP-only transfer-engine writes, while keeping RDMA registration staging for non-TCP transports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Gate staging on RDMA transport

852cf75

Avoid routing CXL writes through RDMA staging by checking for an RDMA transport directly instead of treating every non-TCP path as RDMA. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zxpdemonio and others added 2 commits June 25, 2026 05:33

[Store] Gate staging on target protocol

5f2c69d

Check each target buffer protocol before RDMA staging so CXL replicas do not inherit RDMA-only requirements when RDMA is also installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Store] Preserve zero local buffer write failure

d24edcd

Keep copy-style put and upsert APIs failing when no client buffer is configured, while leaving explicit external-buffer write paths unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alogfans reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Store] Optimize put performence by reduce memcpy and pinned memory#2598

[Store] Optimize put performence by reduce memcpy and pinned memory#2598
zxpdemonio wants to merge 15 commits into
kvcache-ai:mainfrom
openanolis:cruz/optimize_put

zxpdemonio commented Jun 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zxpdemonio commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 25, 2026 •

edited

Loading

Uh oh!

alogfans Jun 26, 2026

Uh oh!

alogfans commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zxpdemonio commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Scenario

Description

Remove unconditional staging in high-level write internals

Add RDMA-safe staging in the transfer layer

Enable pinned host memory by default for GPU builds

Python tensor batch path uses multi-buffer writes

Basic Usage

Module

Type of Change

Performance

Copy count reduction

Measured copy-reduction benefit

Pinned memory benefit

How Has This Been Tested?

Performance validation

Checklist

AI Assistance Disclosure

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zxpdemonio commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alogfans Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

alogfans commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zxpdemonio commented Jun 24, 2026 •

edited

Loading

codecov-commenter commented Jun 25, 2026 •

edited

Loading