RDMA: Support eager sends for grouped receives#1194
RDMA: Support eager sends for grouped receives#1194amitrad-aws wants to merge 14 commits intoaws:masterfrom
Conversation
| /* Update the ctrl mailbox slot with all n sub-entries. | ||
| * allocate_recv_req already populated entries[0] and the header. | ||
| * Now update num_recvs and populate entries[1..n-1]. */ | ||
| { |
There was a problem hiding this comment.
Is there a specific reason to put lines 3157-3181 in a nested code block? If not, I'd prefer minimizing the number of indentation to make code easier to fit on one line without wrapping.
There was a problem hiding this comment.
No, will remove
| private: | ||
| int n_recvs; | ||
| size_t buffer_size; | ||
| static constexpr int TAG = 1; |
There was a problem hiding this comment.
For my understanding, is there a specific reason to use the same TAG value for all sends/recvs instead of having unique values per send/subgroup recv pair? Do we expect NCCL to usually use the same or distinct tag values?
There was a problem hiding this comment.
NCCL uses the rank ID as the tag. In alltoall PXN, since the data that is not necessarily the origin, the tag is the so in that case they will all be the same in each step. That is what I tried to do here.
There are ordering rules on tags I chose a simple one to help cover all the cases
There was a problem hiding this comment.
That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.
There was a problem hiding this comment.
I'm assuming this is a development artifact that isn't a part of the PR.
rauteric
left a comment
There was a problem hiding this comment.
Made it partially through the eager commit before I ran out of steam. Including a few minor comments; no major concerns so far.
Will pick this up tomorrow morning.
| /* Write msg_seq_num last as the ready bit */ | ||
| this->ctrl_mailbox[slot].entries[0].msg_seq_num = req->msg_seq_num & MSG_SEQ_NUM_MASK; |
There was a problem hiding this comment.
Nit: I'm not sure why you need to write this last -- nobody should be polling on this location at this point.
| } | ||
|
|
||
| /* Update the ctrl mailbox slot with all n sub-entries. | ||
| * allocate_recv_req already populated entries[0] and the header. |
There was a problem hiding this comment.
Nit: I think it would be cleaner to populate all entries at once.
| uint16_t slot = seq_num % NCCL_OFI_CTRL_MAILBOX_SIZE; | ||
| return (s_comm->ctrl_mailbox[slot].entries[0].buff_len); | ||
| nccl_net_ofi_ctrl_msg_t *ctrl = &s_comm->ctrl_mailbox[slot]; | ||
| if (ctrl->entries[0].num_recvs <= 1) |
There was a problem hiding this comment.
Nit: We generally put if statement bodies in braces in the plugin.
| /* Determine if this should be sent eagerly. | ||
| * Disable eager entirely when maxRecvs > 1: the sender must wait for | ||
| * the ctrl msg so it can detect grouped receives and reuse msg_seq_num. */ | ||
| if (NCCL_OFI_MAX_RECVS == 1 && !have_ctrl && |
There was a problem hiding this comment.
Do you really want to use NCCL_OFI_MAX_RECVS here, or nr?
There was a problem hiding this comment.
This is temp, removed in the commit that enables eager
| /* Memory synchronization point to ensure that the data in the control msg is not | ||
| * read before the sequence number is checked */ | ||
| } else if (!in_group || s_comm->group_sends_remaining == s_comm->group_num_recvs) { | ||
| /* Memory synchronization point - only on first send of a group or non-grouped */ |
There was a problem hiding this comment.
Why don't you need the synchronization point after every send in the group?
There was a problem hiding this comment.
Because in has_ctrl_msg we verified all entries arrived before returning true
| * For a single recv (num_expected_senders=1), this registers once on the first completion. | ||
| * For grouped receives, each fi_writedata completion represents one sender. | ||
| * We use the original heuristic for non-grouped, and simple counting for grouped. */ | ||
| if (req->ncompls > recv_segms_data->total_expected_segms) { |
There was a problem hiding this comment.
I didn't understand how this if condition does what the comment says it's supposed to do.
There was a problem hiding this comment.
Updated the comment
/* Detect when a completion belongs to a new sender we haven't
* registered yet. Each sender's first segment pushes ncompls past
* the sum of all previously registered senders' segment counts.
* When that happens, we register this sender's total_nsegms.
* For single recv (1 sender), this fires once on the first completion.
* For grouped recv (N senders), this fires N times as each sender's
* first completion crosses the running total. */
| /* For grouped receives, distribute total received | ||
| * size evenly across sub-receives. */ |
There was a problem hiding this comment.
I don't think this adheres to the API, but I also don't think it matters in NCCL today.
There was a problem hiding this comment.
The comment is wrong, the logic is correct, removed the comment
|
By the way, the workflow failures look real -- looks like an unused variable. |
| * bytes of data, we just need a valid SGE) instead of the provided base | ||
| * pointer and MR | ||
| */ | ||
| /* Handle 0-byte messages: use flush buffer for valid SGE */ |
There was a problem hiding this comment.
but i like my long comment that explained why we do this :)
There was a problem hiding this comment.
Removed by accident, returned
| /* Update the ctrl mailbox slot with all n sub-entries. | ||
| * allocate_recv_req already populated entries[0] and the header. | ||
| * Now update num_recvs and populate entries[1..n-1]. */ | ||
| { |
There was a problem hiding this comment.
we're writing C++; we don't need a new scope block here :)
| if (local_recv_data->num_recvs > 1) { | ||
| /* For grouped receives, distribute total received | ||
| * size evenly across sub-receives. */ | ||
| size_t per_sub = req_size / local_recv_data->num_recvs; |
There was a problem hiding this comment.
I'm sure this works with NCCL (because if I remember correctly it only looks at zero / non-zero), but technically the README says that we should return the actual sizes (https://github.com/NVIDIA/nccl/blob/1933fdd6360a8bfccaa0166bd71bce363d32e5b6/plugins/net/README.md?plain=1#L413).
There was a problem hiding this comment.
Already fixed in a later commit
| props->latency = ofi_properties.latency; | ||
| props->maxComms = ofi_properties.max_communicators; | ||
| props->maxRecvs = ofi_properties.max_group_receives; | ||
| /* Neuron uses sendrecv protocol which does not support grouped receives */ |
There was a problem hiding this comment.
Neuron does not use the sendrecv protocol most of the time. But I'm also fine if we don't support grouped receive for neuron right now. Let's just get the comment right.
| static inline int handle_eager_recv(nccl_net_ofi_rdma_recv_comm *r_comm, | ||
| uint16_t msg_seq_num, | ||
| nccl_net_ofi_rdma_req *rx_buff_req) | ||
| static int eager_match_recv(nccl_net_ofi_rdma_req *recv_req, int32_t eager_tag) |
There was a problem hiding this comment.
for the love of a member function...
There was a problem hiding this comment.
I agree but it is called from another static function so I don't want to refactor all that in this urgent feature,
| /* | ||
| * @brief Data of request responsible for receive operation | ||
| */ |
There was a problem hiding this comment.
nit: duplicated comment from the rdma_req_recv_data_t datatype
| } | ||
|
|
||
| /* | ||
| * @brief Drain the receiver-side ordered eager queue. |
There was a problem hiding this comment.
Nit: this comment for drain_recv_eager_queue is separated from the function declaration/definition
| reuse_listen_comm | ||
| ring | ||
| gin | ||
| grouped_recveager_multirecv |
There was a problem hiding this comment.
Missing new line between grouped_recv and eager_multirecv
rauteric
left a comment
There was a problem hiding this comment.
Looked over all the code. The detail of review is limited given the size and time constraint, but the concepts generally make sense.
| @@ -4494,9 +4737,9 @@ static nccl_net_ofi_rdma_recv_comm *prepare_recv_comm(nccl_net_ofi_rdma_domain_t | |||
|
|
|||
| /* Allocate request freelist */ | |||
| /* Maximum freelist entries is 4*NCCL_OFI_MAX_REQUESTS because each receive request | |||
There was a problem hiding this comment.
Nit: update comment to 11*NCCL_OFI_MAX_REQUESTS
| - **Eager disabled for GPU buffers with 2-iovec sends**: The `fi_sendmsg` with | ||
| two iovecs (host header + GPU payload) requires provider support for | ||
| scatter-gather across host and device memory. |
There was a problem hiding this comment.
It's not really clear what limitation is being described in this bullet.
- Which providers support this?
- What is the minimum required version of Libfabric to support this with EFA provider?
| /* Create freelist for eager header buffers */ | ||
| ret_s_comm->eager_hdr_fl = new nccl_ofi_freelist( | ||
| sizeof(nccl_ofi_eager_msg_header_t), | ||
| NCCL_OFI_CTRL_MAILBOX_SIZE, 16, 0, |
There was a problem hiding this comment.
Can we use NCCL_OFI_MAX_EAGER_PENDING as the size limit for this freelist?
| private: | ||
| int n_recvs; | ||
| size_t buffer_size; | ||
| static constexpr int TAG = 1; |
There was a problem hiding this comment.
That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.
| return NULL; | ||
| } | ||
|
|
||
| static int test_struct_sizes() |
There was a problem hiding this comment.
Nit: all tests in this function could also just be static asserts in the code itself.
There was a problem hiding this comment.
Looks like an executable was committed mistakenly.
| } | ||
|
|
||
|
|
||
| /* Replicate the sort helpers from nccl_ofi_rdma.cpp for unit testing */ |
There was a problem hiding this comment.
I didn't understand what part of the actual code is being tested here -- it looks like you just duplicated a few functions from the actual code into this test.
If the goal is to test these functions, then duplicating them into the test doesn't seem like the right approach -- instead, they should be included from a header in the actual source code.
Implement sender-side eager queue (max 8 entries) with drain logic that matches eager sends against ctrl msgs by tag for grouped receives, or pops the front entry for single receives. Eager sends use fi_sendmsg with 2 iovecs (8-byte header + payload) carrying eager_offset and tag. The sender does not advance next_msg_seq_num for eager sends; drain advances it when ctrl msgs arrive. Drain also runs from update_send_request (test path) so eager sends can complete without requiring another send() call. Implement receiver-side eager queue using a sorted doubly-linked list with a pre-allocated pool of CTRL_MAILBOX_SIZE entries. Entries are sorted by (msg_seq_num, eager_offset) using wraparound-aware comparison. Drain processes entries in order with batch boundary tracking via prev_batch_count/prev_msg_seq_num in the eager header. Add entry_used field to ctrl_msg_entry_t to replace tag invalidation for preventing double-matching of ctrl entries. Add sub_recv_idx to eager_copy_data_t so eager copies target the correct sub-receive buffer. Record per-sub-receive size in set_eager_copy_completed. Add eager_copy_to_sub_recv helper that handles zero-sized messages, single recv, and grouped recv eager copies. Add eager_match_recv helper that matches eager entries to sub-receives by tag using a consumed flag to prevent double-matching. Add static_asserts to verify immediate data fields sum to 32 bits and that NCCL_OFI_MAX_RECVS fits in RECV_IDX_BITS. Re-enable eager for grouped receives. Increase eager_rx_buff_size by NCCL_OFI_EAGER_HEADER_SIZE bytes for the header. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Add test_eager_sorted_insert to ctrl_msg unit test covering: - In-order and reverse-order insertion - Same seq with interleaved offsets - Sequence number wraparound (10-bit) - Single element and duplicate keys - Multiple seq batches interleaved Signed-off-by: Amit Radzi <amitrad@amazon.com>
18 functional tests covering eager message support with multi-recv: - T5/T6: Single recv eager (late/early recv posting) - T7: 4 sequential single-recv eager messages - T8: Grouped recv (n=2) all eager - T9: Grouped recv (n=3) mixed eager + RDMA write - T10: Grouped recv (n=2) all large (no eager, regression) - T11: Eager across single + grouped recv - T12: Eager tag pushback across two groups - T13: 8 eager messages (sender queue full) - T14: Eager size boundary (8184 vs 8185 bytes) - T15: 2x grouped(n=4) all eager/write permutations per tag - T16: 4x grouped(n=2) same tags, ordering per tag - T17: Interleaved sends 2x grouped(n=8) all write - T18: Interleaved sends 2x grouped(n=8) all eager - T19: Interleaved sends 2x grouped(n=4) mixed eager+write - T21: 3x grouped(n=2) interleaved mixed sizes - T22: Max interleave 2x grouped(n=8) all write Interleaved tests use a rotating queue with per-tag ordering (blocked_tags set) to ensure same-tag sends maintain message sequence order while allowing different-tag sends to proceed. Signed-off-by: Amit Radzi <amitrad@amazon.com>
When data validation fails, log the byte offset and the received vs expected values before the generic error message. This helps identify which send was misrouted in multi-recv tests. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Update the design document to cover the eager message extension for grouped receives. New sections: - Eager message header format (8-byte prepended to bounce buffer) - Sender-side eager queue and drain logic - Receiver-side sorted eager queue with batch boundary tracking - Ordering requirements and continuity checks - First-ever batch sentinel (0xFFFF) for out-of-order safety - Target recv resolution in the receiver drain - Eager copy mechanism Update ctrl_msg_entry struct to reflect entry_used field (replacing tag invalidation). Update limitations to include eager-specific constraints. Remove the note about eager being not yet implemented. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Change NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS (32) to allow more outstanding eager sends before requiring ctrl msg drain. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Update T13 (queue full) from 8 to 32 messages to match the new NCCL_OFI_MAX_EAGER_PENDING value. Add T23: 12 eager messages spanning 3 grouped recvs (n=4 each). Tests eager batch resolution across multiple multi-recv groups. Add T24: 6 eager messages spanning alternating single + grouped(n=2) recv patterns. Tests eager batch crossing different recv types. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Eager sends for grouped receives use fi_sendmsg with 2 iovecs (host header + GPU payload). This requires the provider to support mixed host/device memory in a single send operation. Add configure check for rdma/fi_ext_efa.h and FI_EFA_FEATURE_OPS. At runtime, query the EFA provider via fi_open_ops to check for mixed_hmem_iov support before advertising max_group_receives > 1. Signed-off-by: Amit Radzi <amitrad@amazon.com>
When the provider does not support mixed host/device iovecs (use_eager_header=false), eager sends cannot prepend the 8-byte header because the 2-iovec fi_sendmsg with host header + GPU payload would fail. Add use_eager_header flag to the endpoint, set from the device supports_eager_header property (determined by the EFA feature query for mixed_hmem_iov). Without the header: - post_rdma_eager_send uses fi_senddata with a single iovec instead of fi_sendmsg with two iovecs. - handle_eager_recv bypasses the sorted eager queue and looks up the recv directly in the msgbuff by msg_seq_num. If the recv is already posted, it initiates the eager copy immediately. If not, it stashes the bounce buffer as a BUFF entry for later. - recv() checks the msgbuff for a stashed BUFF entry before inserting the request. If found, it replaces the BUFF with the REQ and initiates the eager copy. - post_eager_copy conditionally skips the header offset when reading from the bounce buffer. - eager_rx_buff_size does not include the header overhead. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Call getProperties on all devices in TestSuite::run_all() before any tests execute. This ensures device-level flags like supports_eager_header are initialized before communicators are created. Signed-off-by: Amit Radzi <amitrad@amazon.com>
When eager sends are interleaved across grouped receives, the receiver drain cannot skip past an unresolved entry to reach entries for an already-posted recv. If the receiver blocks on completing one recv before posting the next, this can deadlock. Adapt T11, T17, T18, T19, T21, T22, T23, T24 to post all grouped recvs before polling any completions, matching the behavior of NCCL's proxy thread which posts recvs independently. Fix validation patterns in T17, T19, T21, T22 to match sender data patterns. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Add limitation about interleaved eager sends across grouped receives that can deadlock if the receiver serializes recv posting. Note that this does not occur in practice since NCCL's proxy thread posts recvs independently. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Remove unused 'matched' counter and stale comment. Signed-off-by: Amit Radzi <amitrad@amazon.com>
Signed-off-by: Amit Radzi <amitrad@amazon.com>
Enable eager sends for grouped receives
Summary
This PR extends the eager send mechanism to work with grouped receives (multi-recv). Previously, eager sends were disabled when
maxRecvs > 1because the sender doesn't know the receiver's grouping at eager-send time. This PR adds the necessary protocol extensions to support eager sends across both single and grouped receives.Key Changes
Eager protocol for multi-recv:
eager_offset,tag, and batch boundary info (prev_batch_count,prev_msg_seq_num)fi_sendmsgwith 2 iovecs (host header + payload) instead offi_senddataProvider feature query:
mixed_hmem_iovsupport viaFI_EFA_FEATURE_OPSm4configure check forrdma/fi_ext_efa.handFI_EFA_FEATURE_OPSEager queue sizing:
NCCL_OFI_MAX_EAGER_PENDINGfrom 8 toNCCL_NET_MAX_REQUESTS(32)Tests:
getPropertiesfor all devices before creating communicatorsKnown Limitation
When sends are interleaved across grouped receives, a deadlock can occur if the receiver blocks posting a new recv on the completion of a previous recv whose eager messages are stuck behind the unposted recv's messages in the processing queue. This does not occur in practice because NCCL's proxy thread posts recvs independently without waiting for prior completions.
Testing
NCCL_NET_SHARED_COMMS