fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644
fix(rdma): auto-chunk MRs larger than device max_mr_size (#2017)#2644jiejingzhangamd wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses issue #2017 by splitting buffers larger than the device's max_mr_size into multiple smaller chunks during registration, preventing silent truncation and subsequent remote access errors. It also introduces a tracking map to ensure all chunks are properly cleaned up during unregistration, and adds a regression test. The review feedback highlights two critical issues: first, a performance regression where parallel pre-touch is disabled because the pre-touch check uses the capped chunk_len instead of the original buffer length; second, a potential resource leak during unregistration if an early chunk failure prevents the remaining chunks from being cleaned up.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| bool do_pre_touch = context_list_.size() > 0 && | ||
| std::thread::hardware_concurrency() >= 4 && | ||
| chunk_len >= (size_t)4 * 1024 * 1024 * 1024; |
There was a problem hiding this comment.
Using chunk_len to determine do_pre_touch introduces a performance regression. Since chunk_len is capped by max_mr_size (which is typically 2 GiB), chunk_len >= 4 GiB will always evaluate to false. This completely disables the parallel pre-touch optimization for large buffers (e.g., an 8 GiB buffer split into 2 GiB chunks). Using the original total length of the buffer instead of chunk_len ensures that large buffers are correctly pre-touched.
bool do_pre_touch = context_list_.size() > 0 &&
std::thread::hardware_concurrency() >= 4 &&
length >= (size_t)4 * 1024 * 1024 * 1024;| if (!chunk_addrs.empty()) { | ||
| for (uint64_t ca : chunk_addrs) { | ||
| void *cap = reinterpret_cast<void *>(ca); | ||
| int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata); | ||
| if (rc) return rc; | ||
| for (auto &context : context_list_) { | ||
| int ret = context->unregisterMemoryRegion(cap); | ||
| if (ret) { | ||
| LOG(ERROR) << "Failed to unregister chunk MR at " << cap; | ||
| return ret; | ||
| } | ||
| } | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
If unregistration fails for any chunk, returning early prevents the remaining chunks from being unregistered. Since the chunk addresses have already been erased from chunk_map_, those remaining chunks are leaked permanently (both their hardware MRs on the RDMA contexts and their metadata entries). A best-effort cleanup loop should be used to ensure all chunks are unregistered, returning the first encountered error at the end.
if (!chunk_addrs.empty()) {
int first_err = 0;
for (uint64_t ca : chunk_addrs) {
void *cap = reinterpret_cast<void *>(ca);
int rc = metadata_->removeLocalMemoryBuffer(cap, update_metadata);
if (rc && !first_err) {
first_err = rc;
}
for (auto &context : context_list_) {
int ret = context->unregisterMemoryRegion(cap);
if (ret) {
LOG(ERROR) << "Failed to unregister chunk MR at " << cap;
if (!first_err) {
first_err = ret;
}
}
}
}
return first_err;
}ce488fd to
de46160
Compare
…2017) RdmaTransport::registerLocalMemory registers a single MR for the whole buffer and publishes BufferDesc.length = full length, but RdmaContext::registerMemoryRegionInternal silently shrinks `length` to the device max_mr_size (some RoCE NICs cap this at 2 GiB) and registers an MR of only that size. The metadata then advertises bytes past the registered region, so any remote RDMA op whose target address falls past max_mr_size completes with IBV_WC_REM_ACCESS_ERR. It is address-driven, not size-driven: ops to low addresses succeed, ops past the boundary fail (even small ones). Any buffer larger than max_mr_size is affected (e.g. large disaggregated-PD KV pools). Fix (mirrors what the EFA transport already does): split a buffer larger than max_mr_size into <= max_mr_size chunks, register each as its own MR, and publish one BufferDesc per chunk. The per-context rkey/lkey lookups are address-range based (findMemoryRegionContaining), so each chunk resolves to the correct key. Track chunk start-addresses (chunk_map_) so unregisterLocalMemory(base_addr) tears down every chunk; unregister continues past a per-chunk failure (reporting the first error) to avoid leaks; best-effort rollback on partial registration. Pre-touch decision uses the original buffer length, not the capped chunk length. No metadata schema change. Adds tests/rdma_large_mr_test.cpp: sets a small MC_MAX_MR_SIZE, registers a buffer larger than it, then does a loopback RDMA WRITE whose target lands past the boundary. Pre-fix the transfer FAILS (remote access error); post-fix it COMPLETES and the bytes match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
de46160 to
32e2f2f
Compare
|
Thanks for the review. Both points are addressed in the latest push:
Also rebased onto latest main and clang-formatted. |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| for (auto &thread : reg_threads) { | ||
| thread.join(); | ||
| // Best-effort rollback of already-registered chunks [0, up_to_ci]. | ||
| auto rollbackChunks = [&](size_t up_to_ci) { |
There was a problem hiding this comment.
If failed, print a warning log.
Closes #2017.
Problem
RdmaTransport::registerLocalMemory()registers a single MR for the whole buffer and publishesBufferDesc.length = full length, butRdmaContext::registerMemoryRegionInternal()silently shrinkslengthto the devicemax_mr_size(some RoCE NICs cap this at 2 GiB) and registers an MR of only that size. The metadata then advertises bytes past the registered region, so any remote RDMA op whose target address falls pastmax_mr_sizecompletes withIBV_WC_REM_ACCESS_ERR.It is address-driven, not size-driven: ops to low addresses succeed; ops past the boundary fail (even small ones). Any registered buffer larger than
max_mr_sizeis affected (e.g. large disaggregated-PD KV pools) — it works for a while, then fails once an op targets a high address.Fix
Mirror what the EFA transport already does: split a buffer larger than
max_mr_sizeinto<= max_mr_sizechunks, register each as its own MR, and publish oneBufferDescper chunk. The per-contextrkey()/lkey()lookups are address-range based (findMemoryRegionContaining), so each chunk resolves to the correct key. Chunk start-addresses are tracked (chunk_map_) sounregisterLocalMemory(base_addr)tears down every chunk; unregister continues past a per-chunk failure (reporting the first error) to avoid leaks; best-effort rollback on partial registration. Pre-touch is decided from the original buffer length, not the capped chunk length. No metadata schema change.Test
tests/rdma_large_mr_test.cpp(RDMALargeMrTest.WritePastMaxMrSizeBoundary): sets a smallMC_MAX_MR_SIZE(64 MiB), registers a larger buffer (256 MiB -> 4 chunks), then issues a loopback RDMA WRITE whose target lands past the boundary. Before: the transfer FAILS with a remote access error. After: it COMPLETES and the bytes match. Runs on any RDMA device (incl. soft-RoCE / loopback) plus a metadata server; built but left out of auto-ctest (needs a device), like the siblingrdma_loopback_test.Verification (real workload)
A disaggregated prefill/decode KV transfer at high tensor-parallel degree (per-layer KV ~3.2 GiB > a 2 GiB
max_mr_size): the patched build logged ~488Auto-splitting buffer (3213176832 bytes) into 2 chunksand ran the full window with 0 remote-access-errors across a concurrency sweep; unpatched it failed mid-run with remote access errors.