fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017) by jiejingzhangamd · Pull Request #2644 · kvcache-ai/Mooncake

jiejingzhangamd · 2026-06-27T02:47:35Z

Closes #2017.

Problem

RdmaTransport::registerLocalMemory() registers a single MR for the whole buffer and publishes BufferDesc.length = full length, but RdmaContext::registerMemoryRegionInternal() silently shrinks length to the device max_mr_size (some RoCE NICs cap this at 2 GiB) and registers an MR of only that size. The metadata then advertises bytes past the registered region, so any remote RDMA op whose target address falls past max_mr_size completes with IBV_WC_REM_ACCESS_ERR.

It is address-driven, not size-driven: ops to low addresses succeed; ops past the boundary fail (even small ones). Any registered buffer larger than max_mr_size is affected (e.g. large disaggregated-PD KV pools) — it works for a while, then fails once an op targets a high address.

Fix

Mirror what the EFA transport already does: split a buffer larger than max_mr_size into <= max_mr_size chunks, register each as its own MR, and publish one BufferDesc per chunk. The per-context rkey()/lkey() lookups are address-range based (findMemoryRegionContaining), so each chunk resolves to the correct key. Chunk start-addresses are tracked (chunk_map_) so unregisterLocalMemory(base_addr) tears down every chunk; unregister continues past a per-chunk failure (reporting the first error) to avoid leaks; best-effort rollback on partial registration. Pre-touch is decided from the original buffer length, not the capped chunk length. No metadata schema change.

Test

tests/rdma_large_mr_test.cpp (RDMALargeMrTest.WritePastMaxMrSizeBoundary): sets a small MC_MAX_MR_SIZE (64 MiB), registers a larger buffer (256 MiB -> 4 chunks), then issues a loopback RDMA WRITE whose target lands past the boundary. Before: the transfer FAILS with a remote access error. After: it COMPLETES and the bytes match. Runs on any RDMA device (incl. soft-RoCE / loopback) plus a metadata server; built but left out of auto-ctest (needs a device), like the sibling rdma_loopback_test.

Verification (real workload)

A disaggregated prefill/decode KV transfer at high tensor-parallel degree (per-layer KV ~3.2 GiB > a 2 GiB max_mr_size): the patched build logged ~488 Auto-splitting buffer (3213176832 bytes) into 2 chunks and ran the full window with 0 remote-access-errors across a concurrency sweep; unpatched it failed mid-run with remote access errors.

gemini-code-assist

Code Review

This pull request addresses issue #2017 by splitting buffers larger than the device's max_mr_size into multiple smaller chunks during registration, preventing silent truncation and subsequent remote access errors. It also introduces a tracking map to ensure all chunks are properly cleaned up during unregistration, and adds a regression test. The review feedback highlights two critical issues: first, a performance regression where parallel pre-touch is disabled because the pre-touch check uses the capped chunk_len instead of the original buffer length; second, a potential resource leak during unregistration if an early chunk failure prevents the remaining chunks from being cleaned up.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T02:48:56Z

+        bool do_pre_touch = context_list_.size() > 0 &&
+                            std::thread::hardware_concurrency() >= 4 &&
+                            chunk_len >= (size_t)4 * 1024 * 1024 * 1024;


Using chunk_len to determine do_pre_touch introduces a performance regression. Since chunk_len is capped by max_mr_size (which is typically 2 GiB), chunk_len >= 4 GiB will always evaluate to false. This completely disables the parallel pre-touch optimization for large buffers (e.g., an 8 GiB buffer split into 2 GiB chunks). Using the original total length of the buffer instead of chunk_len ensures that large buffers are correctly pre-touched.

bool do_pre_touch = context_list_.size() > 0 && std::thread::hardware_concurrency() >= 4 && length >= (size_t)4 * 1024 * 1024 * 1024;

gemini-code-assist · 2026-06-27T02:48:56Z

+    if (!chunk_addrs.empty()) {
+        for (uint64_t ca : chunk_addrs) {
+            void *cap = reinterpret_cast<void *>(ca);
+            int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata);
+            if (rc) return rc;
+            for (auto &context : context_list_) {
+                int ret = context->unregisterMemoryRegion(cap);
+                if (ret) {
+                    LOG(ERROR) << "Failed to unregister chunk MR at " << cap;
+                    return ret;
+                }
+            }
+        }
+        return 0;
+    }


If unregistration fails for any chunk, returning early prevents the remaining chunks from being unregistered. Since the chunk addresses have already been erased from chunk_map_, those remaining chunks are leaked permanently (both their hardware MRs on the RDMA contexts and their metadata entries). A best-effort cleanup loop should be used to ensure all chunks are unregistered, returning the first encountered error at the end.

if (!chunk_addrs.empty()) { int first_err = 0; for (uint64_t ca : chunk_addrs) { void *cap = reinterpret_cast<void *>(ca); int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata); if (rc && !first_err) { first_err = rc; } for (auto &context : context_list_) { int ret = context->unregisterMemoryRegion(cap); if (ret) { LOG(ERROR) << "Failed to unregister chunk MR at " << cap; if (!first_err) { first_err = ret; } } } } return first_err; }

…2017) RdmaTransport::registerLocalMemory registers a single MR for the whole buffer and publishes BufferDesc.length = full length, but RdmaContext::registerMemoryRegionInternal silently shrinks `length` to the device max_mr_size (some RoCE NICs cap this at 2 GiB) and registers an MR of only that size. The metadata then advertises bytes past the registered region, so any remote RDMA op whose target address falls past max_mr_size completes with IBV_WC_REM_ACCESS_ERR. It is address-driven, not size-driven: ops to low addresses succeed, ops past the boundary fail (even small ones). Any buffer larger than max_mr_size is affected (e.g. large disaggregated-PD KV pools). Fix (mirrors what the EFA transport already does): split a buffer larger than max_mr_size into <= max_mr_size chunks, register each as its own MR, and publish one BufferDesc per chunk. The per-context rkey/lkey lookups are address-range based (findMemoryRegionContaining), so each chunk resolves to the correct key. Track chunk start-addresses (chunk_map_) so unregisterLocalMemory(base_addr) tears down every chunk; unregister continues past a per-chunk failure (reporting the first error) to avoid leaks; best-effort rollback on partial registration. Pre-touch decision uses the original buffer length, not the capped chunk length. No metadata schema change. Adds tests/rdma_large_mr_test.cpp: sets a small MC_MAX_MR_SIZE, registers a buffer larger than it, then does a loopback RDMA WRITE whose target lands past the boundary. Pre-fix the transfer FAILS (remote access error); post-fix it COMPLETES and the bytes match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jiejingzhangamd · 2026-06-27T02:58:41Z

Thanks for the review. Both points are addressed in the latest push:

Pre-touch regression — do_pre_touch now keys off the original buffer length (not the capped chunk_len), so the >= 4 GiB check fires as before and parallel pre-touch is preserved.
Unregister leak — the per-chunk unregister loop no longer returns early on a failure; it unregisters every chunk and reports the first error, so no chunk MR / metadata is leaked.

Also rebased onto latest main and clang-formatted.

codecov-commenter · 2026-06-27T03:32:01Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 162 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ne/src/transport/rdma_transport/rdma_transport.cpp	0.00%	116 Missing ⚠️
...ncake-transfer-engine/tests/rdma_large_mr_test.cpp	0.00%	46 Missing ⚠️

📢 Thoughts on this report? Let us know!

alogfans · 2026-06-28T05:12:30Z

-        for (auto &thread : reg_threads) {
-            thread.join();
+    // Best-effort rollback of already-registered chunks [0, up_to_ci].
+    auto rollbackChunks = [&](size_t up_to_ci) {


If failed, print a warning log.

jiejingzhangamd requested review from alogfans, chestnut-Q and doujiang24 as code owners June 27, 2026 02:47

github-actions Bot added run-ci Transfer Engine labels Jun 27, 2026

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

jiejingzhangamd force-pushed the fix/rdma-max-mr-size-chunking-2017 branch from ce488fd to de46160 Compare June 27, 2026 02:55

jiejingzhangamd force-pushed the fix/rdma-max-mr-size-chunking-2017 branch from de46160 to 32e2f2f Compare June 27, 2026 02:57

alogfans reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644

fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644
jiejingzhangamd wants to merge 1 commit into
kvcache-ai:mainfrom
jiejingzhangamd:fix/rdma-max-mr-size-chunking-2017

jiejingzhangamd commented Jun 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Uh oh!

jiejingzhangamd commented Jun 27, 2026

Uh oh!

codecov-commenter commented Jun 27, 2026

Uh oh!

alogfans Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jiejingzhangamd commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Verification (real workload)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

jiejingzhangamd commented Jun 27, 2026

Uh oh!

codecov-commenter commented Jun 27, 2026

Codecov Report

Uh oh!

alogfans Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiejingzhangamd commented Jun 27, 2026 •

edited

Loading