Skip to content

fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644

Open
jiejingzhangamd wants to merge 1 commit into
kvcache-ai:mainfrom
jiejingzhangamd:fix/rdma-max-mr-size-chunking-2017
Open

fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644
jiejingzhangamd wants to merge 1 commit into
kvcache-ai:mainfrom
jiejingzhangamd:fix/rdma-max-mr-size-chunking-2017

Conversation

@jiejingzhangamd

@jiejingzhangamd jiejingzhangamd commented Jun 27, 2026

Copy link
Copy Markdown

Closes #2017.

Problem

RdmaTransport::registerLocalMemory() registers a single MR for the whole buffer and publishes BufferDesc.length = full length, but RdmaContext::registerMemoryRegionInternal() silently shrinks length to the device max_mr_size (some RoCE NICs cap this at 2 GiB) and registers an MR of only that size. The metadata then advertises bytes past the registered region, so any remote RDMA op whose target address falls past max_mr_size completes with IBV_WC_REM_ACCESS_ERR.

It is address-driven, not size-driven: ops to low addresses succeed; ops past the boundary fail (even small ones). Any registered buffer larger than max_mr_size is affected (e.g. large disaggregated-PD KV pools) — it works for a while, then fails once an op targets a high address.

Fix

Mirror what the EFA transport already does: split a buffer larger than max_mr_size into <= max_mr_size chunks, register each as its own MR, and publish one BufferDesc per chunk. The per-context rkey()/lkey() lookups are address-range based (findMemoryRegionContaining), so each chunk resolves to the correct key. Chunk start-addresses are tracked (chunk_map_) so unregisterLocalMemory(base_addr) tears down every chunk; unregister continues past a per-chunk failure (reporting the first error) to avoid leaks; best-effort rollback on partial registration. Pre-touch is decided from the original buffer length, not the capped chunk length. No metadata schema change.

Test

tests/rdma_large_mr_test.cpp (RDMALargeMrTest.WritePastMaxMrSizeBoundary): sets a small MC_MAX_MR_SIZE (64 MiB), registers a larger buffer (256 MiB -> 4 chunks), then issues a loopback RDMA WRITE whose target lands past the boundary. Before: the transfer FAILS with a remote access error. After: it COMPLETES and the bytes match. Runs on any RDMA device (incl. soft-RoCE / loopback) plus a metadata server; built but left out of auto-ctest (needs a device), like the sibling rdma_loopback_test.

Verification (real workload)

A disaggregated prefill/decode KV transfer at high tensor-parallel degree (per-layer KV ~3.2 GiB > a 2 GiB max_mr_size): the patched build logged ~488 Auto-splitting buffer (3213176832 bytes) into 2 chunks and ran the full window with 0 remote-access-errors across a concurrency sweep; unpatched it failed mid-run with remote access errors.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #2017 by splitting buffers larger than the device's max_mr_size into multiple smaller chunks during registration, preventing silent truncation and subsequent remote access errors. It also introduces a tracking map to ensure all chunks are properly cleaned up during unregistration, and adds a regression test. The review feedback highlights two critical issues: first, a performance regression where parallel pre-touch is disabled because the pre-touch check uses the capped chunk_len instead of the original buffer length; second, a potential resource leak during unregistration if an early chunk failure prevents the remaining chunks from being cleaned up.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +249 to +251
bool do_pre_touch = context_list_.size() > 0 &&
std::thread::hardware_concurrency() >= 4 &&
chunk_len >= (size_t)4 * 1024 * 1024 * 1024;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using chunk_len to determine do_pre_touch introduces a performance regression. Since chunk_len is capped by max_mr_size (which is typically 2 GiB), chunk_len >= 4 GiB will always evaluate to false. This completely disables the parallel pre-touch optimization for large buffers (e.g., an 8 GiB buffer split into 2 GiB chunks). Using the original total length of the buffer instead of chunk_len ensures that large buffers are correctly pre-touched.

        bool do_pre_touch = context_list_.size() > 0 &&
                            std::thread::hardware_concurrency() >= 4 &&
                            length >= (size_t)4 * 1024 * 1024 * 1024;

Comment on lines +377 to +391
if (!chunk_addrs.empty()) {
for (uint64_t ca : chunk_addrs) {
void *cap = reinterpret_cast<void *>(ca);
int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata);
if (rc) return rc;
for (auto &context : context_list_) {
int ret = context->unregisterMemoryRegion(cap);
if (ret) {
LOG(ERROR) << "Failed to unregister chunk MR at " << cap;
return ret;
}
}
}
return 0;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If unregistration fails for any chunk, returning early prevents the remaining chunks from being unregistered. Since the chunk addresses have already been erased from chunk_map_, those remaining chunks are leaked permanently (both their hardware MRs on the RDMA contexts and their metadata entries). A best-effort cleanup loop should be used to ensure all chunks are unregistered, returning the first encountered error at the end.

    if (!chunk_addrs.empty()) {
        int first_err = 0;
        for (uint64_t ca : chunk_addrs) {
            void *cap = reinterpret_cast<void *>(ca);
            int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata);
            if (rc && !first_err) {
                first_err = rc;
            }
            for (auto &context : context_list_) {
                int ret = context->unregisterMemoryRegion(cap);
                if (ret) {
                    LOG(ERROR) << "Failed to unregister chunk MR at " << cap;
                    if (!first_err) {
                        first_err = ret;
                    }
                }
            }
        }
        return first_err;
    }

@jiejingzhangamd jiejingzhangamd force-pushed the fix/rdma-max-mr-size-chunking-2017 branch from ce488fd to de46160 Compare June 27, 2026 02:55
…2017)

RdmaTransport::registerLocalMemory registers a single MR for the whole buffer and
publishes BufferDesc.length = full length, but RdmaContext::registerMemoryRegionInternal
silently shrinks `length` to the device max_mr_size (some RoCE NICs cap this at 2 GiB)
and registers an MR of only that size. The metadata then advertises bytes past the
registered region, so any remote RDMA op whose target address falls past max_mr_size
completes with IBV_WC_REM_ACCESS_ERR. It is address-driven, not size-driven: ops to low
addresses succeed, ops past the boundary fail (even small ones). Any buffer larger than
max_mr_size is affected (e.g. large disaggregated-PD KV pools).

Fix (mirrors what the EFA transport already does): split a buffer larger than
max_mr_size into <= max_mr_size chunks, register each as its own MR, and publish one
BufferDesc per chunk. The per-context rkey/lkey lookups are address-range based
(findMemoryRegionContaining), so each chunk resolves to the correct key. Track chunk
start-addresses (chunk_map_) so unregisterLocalMemory(base_addr) tears down every chunk;
unregister continues past a per-chunk failure (reporting the first error) to avoid leaks;
best-effort rollback on partial registration. Pre-touch decision uses the original buffer
length, not the capped chunk length. No metadata schema change.

Adds tests/rdma_large_mr_test.cpp: sets a small MC_MAX_MR_SIZE, registers a buffer larger
than it, then does a loopback RDMA WRITE whose target lands past the boundary. Pre-fix the
transfer FAILS (remote access error); post-fix it COMPLETES and the bytes match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jiejingzhangamd jiejingzhangamd force-pushed the fix/rdma-max-mr-size-chunking-2017 branch from de46160 to 32e2f2f Compare June 27, 2026 02:57
@jiejingzhangamd

Copy link
Copy Markdown
Author

Thanks for the review. Both points are addressed in the latest push:

  1. Pre-touch regressiondo_pre_touch now keys off the original buffer length (not the capped chunk_len), so the >= 4 GiB check fires as before and parallel pre-touch is preserved.
  2. Unregister leak — the per-chunk unregister loop no longer returns early on a failure; it unregisters every chunk and reports the first error, so no chunk MR / metadata is leaked.

Also rebased onto latest main and clang-formatted.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 162 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ne/src/transport/rdma_transport/rdma_transport.cpp 0.00% 116 Missing ⚠️
...ncake-transfer-engine/tests/rdma_large_mr_test.cpp 0.00% 46 Missing ⚠️

📢 Thoughts on this report? Let us know!

for (auto &thread : reg_threads) {
thread.join();
// Best-effort rollback of already-registered chunks [0, up_to_ci].
auto rollbackChunks = [&](size_t up_to_ci) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failed, print a warning log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ibv_reg_mr is silently truncated to max_mr_size, but BufferDesc.length advertises the original size → IBV_WC_REM_ACCESS_ERR

4 participants