Skip to content

Endpoint MR for RDMA protocol#1056

Open
ryanhankins wants to merge 5 commits intoaws:masterfrom
ryanhankins:endpoint_mr_for_rdma
Open

Endpoint MR for RDMA protocol#1056
ryanhankins wants to merge 5 commits intoaws:masterfrom
ryanhankins:endpoint_mr_for_rdma

Conversation

@ryanhankins
Copy link
Copy Markdown
Contributor

FI_MR_ENDPOINT Implementation for RDMA protocol

In the AWS OFI NCCL plugin, the implementation of FI_MR_ENDPOINT mode is required for certain providers. This mode necessitates a call to fi_mr_bind and fi_mr_enable on the MR to associate it with the endpoint with which it will be used. For RDMA mode that association affects the MR cache. In this series of commits, MRs are mapped to the appropriate rail endpoints depending on whether those MRs will be used for control or data transfer.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ryanhankins ryanhankins requested a review from a team as a code owner November 20, 2025 21:01
Copy link
Copy Markdown
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ryan, made a first pass through the PR.

Comment thread include/nccl_ofi_rdma.h Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment thread include/nccl_ofi_rdma.h Outdated
@ryanhankins
Copy link
Copy Markdown
Contributor Author

Thanks for looking at this @rauteric. I have gotten a big pile of work to do separate from this, so I will get back to this. really appreciate the comments.

@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 92d1e19 to e8fcfe9 Compare January 26, 2026 15:56
@bwbarrett
Copy link
Copy Markdown
Contributor

bot:aws:retest

Copy link
Copy Markdown
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple high level questions (let's talk before you change code):

  1. The number of communicators is relatively bounded. It seems like the code is much more complex because we're trying to reuse buffers at the domain level. Does the code get simpler if we just allocate a buffer per communicator in MR_ENDPOINT mode?

  2. Can you help me understand why we need to split the rail binding behavior? I'm not a huge fan of using templates in that code in general, but also not super clear I understand why we need to do all that work.

Comment thread include/nccl_ofi_rdma.h Outdated
@ryanhankins
Copy link
Copy Markdown
Contributor Author

Thanks for looking at this.

a couple high level questions (let's talk before you change code):

  1. The number of communicators is relatively bounded. It seems like the code is much more complex because we're trying to reuse buffers at the domain level. Does the code get simpler if we just allocate a buffer per communicator in MR_ENDPOINT mode?

Per communicator buffers in FI_MR_ENDPOINT mode would simplify this considerably. This would be much cleaner, and the resources used are small.

  1. Can you help me understand why we need to split the rail binding behavior? I'm not a huge fan of using templates in that code in general, but also not super clear I understand why we need to do all that work.

The templates let us register on the data rails or control rails during registration; control buffers are only be registered on control rails and data buffers only on data rails. We can simply register on both data and control rails. We may have a hardware limit, but it may be high enough that we don't hit it.

@ryanhankins
Copy link
Copy Markdown
Contributor Author

An additional thought here is that bind/enable commonly across control and data rails would simplify the MR caching logic as well; that was introduced as a skeleton in commit 0b8bc0f, but this would cause MRs to be generically registered and and the cache logic to track their type would become unnecessary.

@bwbarrett
Copy link
Copy Markdown
Contributor

Per communicator buffers in FI_MR_ENDPOINT mode would simplify this considerably. This would be much cleaner, and the resources used are small.

Then let's make this Plan of Record. I eventually want to rewrite the flush code anyway to better handle dealing with Neuron and AMD semantics, and I think just having it associated with endpoints is going to be considerably simpler.

The templates let us register on the data rails or control rails during registration; control buffers are only be registered on control rails and data buffers only on data rails. We can simply register on both data and control rails. We may have a hardware limit, but it may be high enough that we don't hit it.

Let's start with this. We can always iterate more later.

@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 806788e to 890843d Compare April 10, 2026 15:33
@ryanhankins
Copy link
Copy Markdown
Contributor Author

Are there any clues on why the CI is failing?

@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch from 890843d to 9af5eec Compare April 13, 2026 16:08
@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 3 times, most recently from 0c679d3 to 7ae5af9 Compare April 22, 2026 19:14
Comment thread include/nccl_ofi_rdma.h Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment thread include/nccl_ofi_rdma.h Outdated
@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 4 times, most recently from 3f52284 to 317d85b Compare April 23, 2026 15:59
bwbarrett
bwbarrett previously approved these changes Apr 23, 2026
Copy link
Copy Markdown
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this approach; we're adding a ton of complexity because we reused MR data structures between control and data, and there's no reason to have done that. That said, I think asking you to fix that issue isn't fair, and this solves your immediate problem.

@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 3 times, most recently from cc034cd to 3e05e0e Compare April 23, 2026 20:42
@ryanhankins
Copy link
Copy Markdown
Contributor Author

I really dislike this approach; we're adding a ton of complexity because we reused MR data structures between control and data, and there's no reason to have done that. That said, I think asking you to fix that issue isn't fair, and this solves your immediate problem.

I can take a shot at fixing this, ideally on top of this PR.

@bwbarrett
Copy link
Copy Markdown
Contributor

definitely ok with waiting to another PR for cleanup. I think your pushing updates cause my approval to dissapear.

Less ok with the merge commit you just pushed :)

@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch from 68c1ea9 to 5e31f4a Compare April 23, 2026 21:37
bwbarrett
bwbarrett previously approved these changes Apr 29, 2026
@tyler-aws
Copy link
Copy Markdown
Contributor

bot:aws:retest

Comment thread include/nccl_ofi_rdma.h Outdated
anshumang
anshumang previously approved these changes Apr 30, 2026
Comment thread include/nccl_ofi_rdma.h
@ryanhankins ryanhankins dismissed stale reviews from anshumang and bwbarrett via 11c310f April 30, 2026 06:37
@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch 5 times, most recently from 5c2f753 to e7be64f Compare May 2, 2026 13:36
In preparation for FI_MR_ENDPOINT support, replace the monolithic mr[]
array with three distinct fields on the MR handle:

  mr_data[0..num_rails-1]            owned fid_mr objects for data rails
  mr_ctrl_owned[0..num_ctrl_rails-1] owned fid_mr objects for ctrl rails
                                     (unpopulated until FI_MR_ENDPOINT
                                      support is enabled)
  mr_ctrl[0..num_ctrl_rails-1]       non-owning view; always populated
                                     after registration

In non-endpoint-MR mode (the only mode at this stage), mr_ctrl[] is
aliased entry-for-entry to mr_data[], so IO-path callers can use
mr_ctrl[rail_id] without a mode check.

Update all read sites to their final form:
  - data path (post_rdma_write, post_rdma_eager_send, post_eager_copy,
    alloc_rdma_read_req, write, post_flush_req): mr_data[rail_id]
  - ctrl path (send_ctrl_post, post_rdma_ctrl): mr_ctrl[rail_id]
  - prepare_send_connect_message: mr_ctrl[rail_id] over num_control_rails
  - post_rx_buffer: mr_data[rail_id] (updated in a later commit)
  - get_mr_key: mr_data[0]

No behaviour change: mr_ctrl[] aliases mr_data[] in non-endpoint-MR mode.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Add an nccl_net_ofi_rdma_ep_t *ep parameter to:

  reg_mr_on_device(), reg_mr(), reg_internal_mr(), reg_internal_mr_dma_buf()

At all call sites the argument is `endpoint_mr ? ep : nullptr`, so
fi_mr_bind is only invoked when FI_MR_ENDPOINT is actually negotiated:

  - send_comm::regMr / recv_comm::regMr   endpoint_mr ? endpoint : nullptr
  - create_send_comm ctrl_mailbox reg     endpoint_mr ? this : nullptr
  - prepare_recv_comm ctrl_mailbox reg    endpoint_mr ? l_comm_ep : nullptr
  - alloc_and_reg_flush_buff              nullptr (domain-owned)
  - freelist_regmr_host_fn                nullptr (no ep binding yet)

The ep pointer is not yet used inside reg_mr_on_device; the bind/enable
logic is added in the next commit.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
When ep != nullptr, reg_mr_on_device now performs the fi_mr_bind +
fi_mr_enable steps required by FI_MR_ENDPOINT providers:

  - Data rails: each fid_mr is bound to the data ep rail and enabled.
  - Ctrl rails: a separate fid_mr is registered per ctrl ep rail, bound,
    and enabled; owned handles are stored in mr_ctrl_owned[] and
    mr_ctrl[] is set to point into them.

A new private helper mr_bind_and_enable() encapsulates the bind+enable
sequence with uniform diagnostic messages.

When ep == nullptr (non-endpoint-MR mode) the function behaves exactly
as before: a single domain-level fid_mr per rail, no endpoint binding.

Also:
  - Advertise FI_MR_ENDPOINT in the libfabric hints so providers that
    require it can negotiate the mode bit.
  - Remove the early-exit guard in nccl_net_ofi_rdma_init() that
    rejected endpoint_mr configurations with -ENOTSUP.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Each endpoint now allocates and registers its own flush buffer
(GPU or host pinned memory) unconditionally, regardless of the
endpoint_mr mode. The allocation and registration are performed
in init_rx_buffers() via ep_t::alloc_and_reg_flush_buff(), which
is called for every endpoint on both EFA (domain-scoped MR) and
CXI (endpoint-scoped MR) paths.

fini_rx_buffers() deregisters and frees the per-endpoint flush
buffer allocation unconditionally whenever flush_buff_mr_handle
is non-null, covering both EFA and CXI endpoints.

The flush buffer and its MR handle are removed from
nccl_net_ofi_rdma_domain_t and are now owned entirely by
nccl_net_ofi_rdma_ep_t. The domain constructor and destructor
no longer call alloc_and_reg_flush_buff / dealloc_and_dereg_flush_buff.

In CXI (endpoint_mr) mode, the destructor calls fini_rx_buffers()
before release_rdma_ep_resources() to satisfy the libfabric
requirement that endpoint-bound MRs are closed before the endpoint.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Add freelist_regmr_ep_ctx_t to carry domain+ep through the freelist
MR-registration callback, replacing the old domain_void_ptr opaque.
freelist_regmr_host_fn accepts the new ctx and forwards ep to
reg_internal_mr.

Add is_ctrl to ep_rail_t so post_rx_buffer can select mr_ctrl[] vs
mr_data[] without a side-channel.  Control rails are marked is_ctrl=true
and data rails is_ctrl=false in init_rx_buffers().

Add rx_buff_regmr_ctx to ep_t (freed in fini_rx_buffers) and
comm_buff_regmr_ctx to recv_comm (freed in recv_comm_destroy).
init_rx_buffers() and prepare_recv_comm() allocate the respective ctx
objects and pass them to the ctrl/eager/flush freelists.

post_rx_buffer now uses rail->is_ctrl to pick the correct MR handle:
ctrl rails dereference mr_ctrl[rid] (a raw pointer already bound to the
ctrl endpoint in FI_MR_ENDPOINT mode); data rails dereference
mr_data[rid].get() (the owned unique_ptr).

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
@ryanhankins ryanhankins force-pushed the endpoint_mr_for_rdma branch from e7be64f to 10330e4 Compare May 6, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants