Endpoint MR for RDMA protocol by ryanhankins · Pull Request #1056 · aws/aws-ofi-nccl

ryanhankins · 2025-11-20T21:01:42Z

FI_MR_ENDPOINT Implementation for RDMA protocol

In the AWS OFI NCCL plugin, the implementation of FI_MR_ENDPOINT mode is required for certain providers. This mode necessitates a call to fi_mr_bind and fi_mr_enable on the MR to associate it with the endpoint with which it will be used. For RDMA mode that association affects the MR cache. In this series of commits, MRs are mapped to the appropriate rail endpoints depending on whether those MRs will be used for control or data transfer.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rauteric

Thanks Ryan, made a first pass through the PR.

ryanhankins · 2025-12-08T17:47:35Z

Thanks for looking at this @rauteric. I have gotten a big pile of work to do separate from this, so I will get back to this. really appreciate the comments.

bwbarrett · 2026-01-27T22:45:26Z

bot:aws:retest

bwbarrett

a couple high level questions (let's talk before you change code):

The number of communicators is relatively bounded. It seems like the code is much more complex because we're trying to reuse buffers at the domain level. Does the code get simpler if we just allocate a buffer per communicator in MR_ENDPOINT mode?
Can you help me understand why we need to split the rail binding behavior? I'm not a huge fan of using templates in that code in general, but also not super clear I understand why we need to do all that work.

ryanhankins · 2026-02-10T14:15:31Z

Thanks for looking at this.

a couple high level questions (let's talk before you change code):

The number of communicators is relatively bounded. It seems like the code is much more complex because we're trying to reuse buffers at the domain level. Does the code get simpler if we just allocate a buffer per communicator in MR_ENDPOINT mode?

Per communicator buffers in FI_MR_ENDPOINT mode would simplify this considerably. This would be much cleaner, and the resources used are small.

Can you help me understand why we need to split the rail binding behavior? I'm not a huge fan of using templates in that code in general, but also not super clear I understand why we need to do all that work.

The templates let us register on the data rails or control rails during registration; control buffers are only be registered on control rails and data buffers only on data rails. We can simply register on both data and control rails. We may have a hardware limit, but it may be high enough that we don't hit it.

ryanhankins · 2026-02-10T14:43:36Z

An additional thought here is that bind/enable commonly across control and data rails would simplify the MR caching logic as well; that was introduced as a skeleton in commit 0b8bc0f, but this would cause MRs to be generically registered and and the cache logic to track their type would become unnecessary.

bwbarrett · 2026-02-10T16:05:28Z

Per communicator buffers in FI_MR_ENDPOINT mode would simplify this considerably. This would be much cleaner, and the resources used are small.

Then let's make this Plan of Record. I eventually want to rewrite the flush code anyway to better handle dealing with Neuron and AMD semantics, and I think just having it associated with endpoints is going to be considerably simpler.

The templates let us register on the data rails or control rails during registration; control buffers are only be registered on control rails and data buffers only on data rails. We can simply register on both data and control rails. We may have a hardware limit, but it may be high enough that we don't hit it.

Let's start with this. We can always iterate more later.

ryanhankins · 2026-04-13T14:41:06Z

Are there any clues on why the CI is failing?

bwbarrett

I really dislike this approach; we're adding a ton of complexity because we reused MR data structures between control and data, and there's no reason to have done that. That said, I think asking you to fix that issue isn't fair, and this solves your immediate problem.

ryanhankins · 2026-04-23T21:02:16Z

I really dislike this approach; we're adding a ton of complexity because we reused MR data structures between control and data, and there's no reason to have done that. That said, I think asking you to fix that issue isn't fair, and this solves your immediate problem.

I can take a shot at fixing this, ideally on top of this PR.

bwbarrett · 2026-04-23T21:32:25Z

definitely ok with waiting to another PR for cleanup. I think your pushing updates cause my approval to dissapear.

Less ok with the merge commit you just pushed :)

tyler-aws · 2026-04-29T20:02:04Z

bot:aws:retest

In preparation for FI_MR_ENDPOINT support, replace the monolithic mr[] array with three distinct fields on the MR handle: mr_data[0..num_rails-1] owned fid_mr objects for data rails mr_ctrl_owned[0..num_ctrl_rails-1] owned fid_mr objects for ctrl rails (unpopulated until FI_MR_ENDPOINT support is enabled) mr_ctrl[0..num_ctrl_rails-1] non-owning view; always populated after registration In non-endpoint-MR mode (the only mode at this stage), mr_ctrl[] is aliased entry-for-entry to mr_data[], so IO-path callers can use mr_ctrl[rail_id] without a mode check. Update all read sites to their final form: - data path (post_rdma_write, post_rdma_eager_send, post_eager_copy, alloc_rdma_read_req, write, post_flush_req): mr_data[rail_id] - ctrl path (send_ctrl_post, post_rdma_ctrl): mr_ctrl[rail_id] - prepare_send_connect_message: mr_ctrl[rail_id] over num_control_rails - post_rx_buffer: mr_data[rail_id] (updated in a later commit) - get_mr_key: mr_data[0] No behaviour change: mr_ctrl[] aliases mr_data[] in non-endpoint-MR mode. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>

Add an nccl_net_ofi_rdma_ep_t *ep parameter to: reg_mr_on_device(), reg_mr(), reg_internal_mr(), reg_internal_mr_dma_buf() At all call sites the argument is `endpoint_mr ? ep : nullptr`, so fi_mr_bind is only invoked when FI_MR_ENDPOINT is actually negotiated: - send_comm::regMr / recv_comm::regMr endpoint_mr ? endpoint : nullptr - create_send_comm ctrl_mailbox reg endpoint_mr ? this : nullptr - prepare_recv_comm ctrl_mailbox reg endpoint_mr ? l_comm_ep : nullptr - alloc_and_reg_flush_buff nullptr (domain-owned) - freelist_regmr_host_fn nullptr (no ep binding yet) The ep pointer is not yet used inside reg_mr_on_device; the bind/enable logic is added in the next commit. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>

When ep != nullptr, reg_mr_on_device now performs the fi_mr_bind + fi_mr_enable steps required by FI_MR_ENDPOINT providers: - Data rails: each fid_mr is bound to the data ep rail and enabled. - Ctrl rails: a separate fid_mr is registered per ctrl ep rail, bound, and enabled; owned handles are stored in mr_ctrl_owned[] and mr_ctrl[] is set to point into them. A new private helper mr_bind_and_enable() encapsulates the bind+enable sequence with uniform diagnostic messages. When ep == nullptr (non-endpoint-MR mode) the function behaves exactly as before: a single domain-level fid_mr per rail, no endpoint binding. Also: - Advertise FI_MR_ENDPOINT in the libfabric hints so providers that require it can negotiate the mode bit. - Remove the early-exit guard in nccl_net_ofi_rdma_init() that rejected endpoint_mr configurations with -ENOTSUP. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>

Each endpoint now allocates and registers its own flush buffer (GPU or host pinned memory) unconditionally, regardless of the endpoint_mr mode. The allocation and registration are performed in init_rx_buffers() via ep_t::alloc_and_reg_flush_buff(), which is called for every endpoint on both EFA (domain-scoped MR) and CXI (endpoint-scoped MR) paths. fini_rx_buffers() deregisters and frees the per-endpoint flush buffer allocation unconditionally whenever flush_buff_mr_handle is non-null, covering both EFA and CXI endpoints. The flush buffer and its MR handle are removed from nccl_net_ofi_rdma_domain_t and are now owned entirely by nccl_net_ofi_rdma_ep_t. The domain constructor and destructor no longer call alloc_and_reg_flush_buff / dealloc_and_dereg_flush_buff. In CXI (endpoint_mr) mode, the destructor calls fini_rx_buffers() before release_rdma_ep_resources() to satisfy the libfabric requirement that endpoint-bound MRs are closed before the endpoint. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>

Add freelist_regmr_ep_ctx_t to carry domain+ep through the freelist MR-registration callback, replacing the old domain_void_ptr opaque. freelist_regmr_host_fn accepts the new ctx and forwards ep to reg_internal_mr. Add is_ctrl to ep_rail_t so post_rx_buffer can select mr_ctrl[] vs mr_data[] without a side-channel. Control rails are marked is_ctrl=true and data rails is_ctrl=false in init_rx_buffers(). Add rx_buff_regmr_ctx to ep_t (freed in fini_rx_buffers) and comm_buff_regmr_ctx to recv_comm (freed in recv_comm_destroy). init_rx_buffers() and prepare_recv_comm() allocate the respective ctx objects and pass them to the ctrl/eager/flush freelists. post_rx_buffer now uses rail->is_ctrl to pick the correct MR handle: ctrl rails dereference mr_ctrl[rid] (a raw pointer already bound to the ctrl endpoint in FI_MR_ENDPOINT mode); data rails dereference mr_data[rid].get() (the owned unique_ptr). Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>

ryanhankins requested a review from a team as a code owner November 20, 2025 21:01

rauteric reviewed Nov 24, 2025

View reviewed changes

Comment thread include/nccl_ofi_rdma.h Outdated

Comment thread src/nccl_ofi_rdma.cpp Outdated

Comment thread src/nccl_ofi_rdma.cpp Outdated

Comment thread src/nccl_ofi_rdma.cpp Outdated

Comment thread include/nccl_ofi_rdma.h Outdated

ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 92d1e19 to e8fcfe9 Compare January 26, 2026 15:56

ryanhankins force-pushed the endpoint_mr_for_rdma branch from e8fcfe9 to 8d9d37c Compare January 30, 2026 17:17

bwbarrett mentioned this pull request Feb 3, 2026

RDMA Memory Registration for FI_MR_ENDPOINT support #1042

Open

bwbarrett reviewed Feb 3, 2026

View reviewed changes

Comment thread include/nccl_ofi_rdma.h Outdated

ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 806788e to 890843d Compare April 10, 2026 15:33

ryanhankins force-pushed the endpoint_mr_for_rdma branch from 890843d to 9af5eec Compare April 13, 2026 16:08

ryanhankins force-pushed the endpoint_mr_for_rdma branch 3 times, most recently from 0c679d3 to 7ae5af9 Compare April 22, 2026 19:14

bwbarrett requested changes Apr 22, 2026

View reviewed changes

Comment thread include/nccl_ofi_rdma.h Outdated

Comment thread src/nccl_ofi_rdma.cpp Outdated

Comment thread include/nccl_ofi_rdma.h Outdated

ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 3f52284 to 317d85b Compare April 23, 2026 15:59

bwbarrett previously approved these changes Apr 23, 2026

View reviewed changes

ryanhankins dismissed bwbarrett’s stale review via f19dc60 April 23, 2026 19:33

ryanhankins force-pushed the endpoint_mr_for_rdma branch 3 times, most recently from cc034cd to 3e05e0e Compare April 23, 2026 20:42

ryanhankins force-pushed the endpoint_mr_for_rdma branch from 68c1ea9 to 5e31f4a Compare April 23, 2026 21:37

bwbarrett previously approved these changes Apr 29, 2026

View reviewed changes

anshumang reviewed Apr 29, 2026

View reviewed changes

Comment thread include/nccl_ofi_rdma.h Outdated

anshumang previously approved these changes Apr 30, 2026

View reviewed changes

Comment thread include/nccl_ofi_rdma.h

ryanhankins dismissed stale reviews from anshumang and bwbarrett via 11c310f April 30, 2026 06:37

ryanhankins force-pushed the endpoint_mr_for_rdma branch 5 times, most recently from 5c2f753 to e7be64f Compare May 2, 2026 13:36

ryanhankins added 5 commits May 6, 2026 08:11

ryanhankins force-pushed the endpoint_mr_for_rdma branch from e7be64f to 10330e4 Compare May 6, 2026 13:11

Conversation

ryanhankins commented Nov 20, 2025

Uh oh!

rauteric left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanhankins commented Dec 8, 2025

Uh oh!

bwbarrett commented Jan 27, 2026

Uh oh!

bwbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryanhankins commented Feb 10, 2026

Uh oh!

ryanhankins commented Feb 10, 2026

Uh oh!

bwbarrett commented Feb 10, 2026

Uh oh!

ryanhankins commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

ryanhankins commented Apr 23, 2026

Uh oh!

bwbarrett commented Apr 23, 2026

Uh oh!

tyler-aws commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants