Endpoint MR for RDMA protocol#1056
Conversation
rauteric
left a comment
There was a problem hiding this comment.
Thanks Ryan, made a first pass through the PR.
|
Thanks for looking at this @rauteric. I have gotten a big pile of work to do separate from this, so I will get back to this. really appreciate the comments. |
92d1e19 to
e8fcfe9
Compare
|
bot:aws:retest |
e8fcfe9 to
8d9d37c
Compare
bwbarrett
left a comment
There was a problem hiding this comment.
a couple high level questions (let's talk before you change code):
-
The number of communicators is relatively bounded. It seems like the code is much more complex because we're trying to reuse buffers at the domain level. Does the code get simpler if we just allocate a buffer per communicator in MR_ENDPOINT mode?
-
Can you help me understand why we need to split the rail binding behavior? I'm not a huge fan of using templates in that code in general, but also not super clear I understand why we need to do all that work.
|
Thanks for looking at this.
Per communicator buffers in FI_MR_ENDPOINT mode would simplify this considerably. This would be much cleaner, and the resources used are small.
The templates let us register on the data rails or control rails during registration; control buffers are only be registered on control rails and data buffers only on data rails. We can simply register on both data and control rails. We may have a hardware limit, but it may be high enough that we don't hit it. |
|
An additional thought here is that bind/enable commonly across control and data rails would simplify the MR caching logic as well; that was introduced as a skeleton in commit 0b8bc0f, but this would cause MRs to be generically registered and and the cache logic to track their type would become unnecessary. |
Then let's make this Plan of Record. I eventually want to rewrite the flush code anyway to better handle dealing with Neuron and AMD semantics, and I think just having it associated with endpoints is going to be considerably simpler.
Let's start with this. We can always iterate more later. |
806788e to
890843d
Compare
|
Are there any clues on why the CI is failing? |
890843d to
9af5eec
Compare
0c679d3 to
7ae5af9
Compare
3f52284 to
317d85b
Compare
bwbarrett
left a comment
There was a problem hiding this comment.
I really dislike this approach; we're adding a ton of complexity because we reused MR data structures between control and data, and there's no reason to have done that. That said, I think asking you to fix that issue isn't fair, and this solves your immediate problem.
cc034cd to
3e05e0e
Compare
I can take a shot at fixing this, ideally on top of this PR. |
|
definitely ok with waiting to another PR for cleanup. I think your pushing updates cause my approval to dissapear. Less ok with the merge commit you just pushed :) |
68c1ea9 to
5e31f4a
Compare
|
bot:aws:retest |
5c2f753 to
e7be64f
Compare
In preparation for FI_MR_ENDPOINT support, replace the monolithic mr[]
array with three distinct fields on the MR handle:
mr_data[0..num_rails-1] owned fid_mr objects for data rails
mr_ctrl_owned[0..num_ctrl_rails-1] owned fid_mr objects for ctrl rails
(unpopulated until FI_MR_ENDPOINT
support is enabled)
mr_ctrl[0..num_ctrl_rails-1] non-owning view; always populated
after registration
In non-endpoint-MR mode (the only mode at this stage), mr_ctrl[] is
aliased entry-for-entry to mr_data[], so IO-path callers can use
mr_ctrl[rail_id] without a mode check.
Update all read sites to their final form:
- data path (post_rdma_write, post_rdma_eager_send, post_eager_copy,
alloc_rdma_read_req, write, post_flush_req): mr_data[rail_id]
- ctrl path (send_ctrl_post, post_rdma_ctrl): mr_ctrl[rail_id]
- prepare_send_connect_message: mr_ctrl[rail_id] over num_control_rails
- post_rx_buffer: mr_data[rail_id] (updated in a later commit)
- get_mr_key: mr_data[0]
No behaviour change: mr_ctrl[] aliases mr_data[] in non-endpoint-MR mode.
Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Add an nccl_net_ofi_rdma_ep_t *ep parameter to: reg_mr_on_device(), reg_mr(), reg_internal_mr(), reg_internal_mr_dma_buf() At all call sites the argument is `endpoint_mr ? ep : nullptr`, so fi_mr_bind is only invoked when FI_MR_ENDPOINT is actually negotiated: - send_comm::regMr / recv_comm::regMr endpoint_mr ? endpoint : nullptr - create_send_comm ctrl_mailbox reg endpoint_mr ? this : nullptr - prepare_recv_comm ctrl_mailbox reg endpoint_mr ? l_comm_ep : nullptr - alloc_and_reg_flush_buff nullptr (domain-owned) - freelist_regmr_host_fn nullptr (no ep binding yet) The ep pointer is not yet used inside reg_mr_on_device; the bind/enable logic is added in the next commit. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
When ep != nullptr, reg_mr_on_device now performs the fi_mr_bind +
fi_mr_enable steps required by FI_MR_ENDPOINT providers:
- Data rails: each fid_mr is bound to the data ep rail and enabled.
- Ctrl rails: a separate fid_mr is registered per ctrl ep rail, bound,
and enabled; owned handles are stored in mr_ctrl_owned[] and
mr_ctrl[] is set to point into them.
A new private helper mr_bind_and_enable() encapsulates the bind+enable
sequence with uniform diagnostic messages.
When ep == nullptr (non-endpoint-MR mode) the function behaves exactly
as before: a single domain-level fid_mr per rail, no endpoint binding.
Also:
- Advertise FI_MR_ENDPOINT in the libfabric hints so providers that
require it can negotiate the mode bit.
- Remove the early-exit guard in nccl_net_ofi_rdma_init() that
rejected endpoint_mr configurations with -ENOTSUP.
Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Each endpoint now allocates and registers its own flush buffer (GPU or host pinned memory) unconditionally, regardless of the endpoint_mr mode. The allocation and registration are performed in init_rx_buffers() via ep_t::alloc_and_reg_flush_buff(), which is called for every endpoint on both EFA (domain-scoped MR) and CXI (endpoint-scoped MR) paths. fini_rx_buffers() deregisters and frees the per-endpoint flush buffer allocation unconditionally whenever flush_buff_mr_handle is non-null, covering both EFA and CXI endpoints. The flush buffer and its MR handle are removed from nccl_net_ofi_rdma_domain_t and are now owned entirely by nccl_net_ofi_rdma_ep_t. The domain constructor and destructor no longer call alloc_and_reg_flush_buff / dealloc_and_dereg_flush_buff. In CXI (endpoint_mr) mode, the destructor calls fini_rx_buffers() before release_rdma_ep_resources() to satisfy the libfabric requirement that endpoint-bound MRs are closed before the endpoint. Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
Add freelist_regmr_ep_ctx_t to carry domain+ep through the freelist MR-registration callback, replacing the old domain_void_ptr opaque. freelist_regmr_host_fn accepts the new ctx and forwards ep to reg_internal_mr. Add is_ctrl to ep_rail_t so post_rx_buffer can select mr_ctrl[] vs mr_data[] without a side-channel. Control rails are marked is_ctrl=true and data rails is_ctrl=false in init_rx_buffers(). Add rx_buff_regmr_ctx to ep_t (freed in fini_rx_buffers) and comm_buff_regmr_ctx to recv_comm (freed in recv_comm_destroy). init_rx_buffers() and prepare_recv_comm() allocate the respective ctx objects and pass them to the ctrl/eager/flush freelists. post_rx_buffer now uses rail->is_ctrl to pick the correct MR handle: ctrl rails dereference mr_ctrl[rid] (a raw pointer already bound to the ctrl endpoint in FI_MR_ENDPOINT mode); data rails dereference mr_data[rid].get() (the owned unique_ptr). Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
e7be64f to
10330e4
Compare
FI_MR_ENDPOINT Implementation for RDMA protocol
In the AWS OFI NCCL plugin, the implementation of FI_MR_ENDPOINT mode is required for certain providers. This mode necessitates a call to fi_mr_bind and fi_mr_enable on the MR to associate it with the endpoint with which it will be used. For RDMA mode that association affects the MR cache. In this series of commits, MRs are mapped to the appropriate rail endpoints depending on whether those MRs will be used for control or data transfer.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.