RDMA: Support eager sends for grouped receives by amitrad-aws · Pull Request #1194 · aws/aws-ofi-nccl

amitrad-aws · 2026-04-15T17:51:33Z

Enable eager sends for grouped receives

Summary

This PR extends the eager send mechanism to work with grouped receives (multi-recv). Previously, eager sends were disabled when maxRecvs > 1 because the sender doesn't know the receiver's grouping at eager-send time. This PR adds the necessary protocol extensions to support eager sends across both single and grouped receives.

Key Changes

Eager protocol for multi-recv:

Eager sends prepend an 8-byte header to the bounce buffer carrying eager_offset, tag, and batch boundary info (prev_batch_count, prev_msg_seq_num)
Sender uses fi_sendmsg with 2 iovecs (host header + payload) instead of fi_senddata
Sender-side circular eager queue with drain logic that matches eager sends against ctrl msg entries by tag
Receiver-side sorted doubly-linked list for ordered eager message processing with batch boundary tracking

Provider feature query:

Query EFA provider for mixed_hmem_iov support via FI_EFA_FEATURE_OPS
When the provider doesn't support mixed host/device iovecs, fall back to the original single-iovec eager path (no header)
Add m4 configure check for rdma/fi_ext_efa.h and FI_EFA_FEATURE_OPS

Eager queue sizing:

Increase NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS (32)

Tests:

Unit tests for eager sorted insert and recv_idx
19 functional tests (T5-T24) covering single eager, grouped eager, mixed eager+write, interleaved sends, queue full, size boundary, tag pushback, and multi-group spanning
Test infrastructure calls getProperties for all devices before creating communicators

Known Limitation

When sends are interleaved across grouped receives, a deadlock can occur if the receiver blocks posting a new recv on the completion of a previous recv whose eager messages are stuck behind the unposted recv's messages in the processing queue. This does not occur in practice because NCCL's proxy thread posts recvs independently without waiting for prior completions.

Testing

Unit tests: all pass
Functional tests: 19/19 pass (eager_multirecv)
NCCL tests on 2-node p5en (16 ranks), with both libfabric with mixed HMEM iov support and without:
- AllGather, AllReduce, AllToAll with eager enabled and disabled
- AllToAll with eager enabled/disabled and with/without NCCL_NET_SHARED_COMMS

avivbenchorin · 2026-04-15T21:18:58Z

+	/* Update the ctrl mailbox slot with all n sub-entries.
+	 * allocate_recv_req already populated entries[0] and the header.
+	 * Now update num_recvs and populate entries[1..n-1]. */
+	{


Is there a specific reason to put lines 3157-3181 in a nested code block? If not, I'd prefer minimizing the number of indentation to make code easier to fit on one line without wrapping.

No, will remove

avivbenchorin · 2026-04-15T22:02:21Z

+private:
+	int n_recvs;
+	size_t buffer_size;
+	static constexpr int TAG = 1;


For my understanding, is there a specific reason to use the same TAG value for all sends/recvs instead of having unique values per send/subgroup recv pair? Do we expect NCCL to usually use the same or distinct tag values?

NCCL uses the rank ID as the tag. In alltoall PXN, since the data that is not necessarily the origin, the tag is the so in that case they will all be the same in each step. That is what I tried to do here.
There are ordering rules on tags I chose a simple one to help cover all the cases

That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.

avivbenchorin · 2026-04-15T22:05:46Z

I'm assuming this is a development artifact that isn't a part of the PR.

rauteric

Made it partially through the eager commit before I ran out of steam. Including a few minor comments; no major concerns so far.

Will pick this up tomorrow morning.

rauteric · 2026-04-29T00:41:51Z

+	/* Write msg_seq_num last as the ready bit */
+	this->ctrl_mailbox[slot].entries[0].msg_seq_num = req->msg_seq_num & MSG_SEQ_NUM_MASK;


Nit: I'm not sure why you need to write this last -- nobody should be polling on this location at this point.

rauteric · 2026-04-29T00:45:28Z

+	}
+
+	/* Update the ctrl mailbox slot with all n sub-entries.
+	 * allocate_recv_req already populated entries[0] and the header.


Nit: I think it would be cleaner to populate all entries at once.

rauteric · 2026-04-29T01:04:44Z

 	uint16_t slot = seq_num % NCCL_OFI_CTRL_MAILBOX_SIZE;
-	return (s_comm->ctrl_mailbox[slot].entries[0].buff_len);
+	nccl_net_ofi_ctrl_msg_t *ctrl = &s_comm->ctrl_mailbox[slot];
+	if (ctrl->entries[0].num_recvs <= 1)


Nit: We generally put if statement bodies in braces in the plugin.

rauteric · 2026-04-29T01:07:20Z

+	/* Determine if this should be sent eagerly.
+	 * Disable eager entirely when maxRecvs > 1: the sender must wait for
+	 * the ctrl msg so it can detect grouped receives and reuse msg_seq_num. */
+	if (NCCL_OFI_MAX_RECVS == 1 && !have_ctrl &&


Do you really want to use NCCL_OFI_MAX_RECVS here, or nr?

This is temp, removed in the commit that enables eager

rauteric · 2026-04-29T01:08:33Z

-		/* Memory synchronization point to ensure that the data in the control msg is not
-		 * read before the sequence number is checked */
+	} else if (!in_group || s_comm->group_sends_remaining == s_comm->group_num_recvs) {
+		/* Memory synchronization point - only on first send of a group or non-grouped */


Why don't you need the synchronization point after every send in the group?

Because in has_ctrl_msg we verified all entries arrived before returning true

rauteric · 2026-04-29T01:11:06Z

+	 * For a single recv (num_expected_senders=1), this registers once on the first completion.
+	 * For grouped receives, each fi_writedata completion represents one sender.
+	 * We use the original heuristic for non-grouped, and simple counting for grouped. */
+	if (req->ncompls > recv_segms_data->total_expected_segms) {


I didn't understand how this if condition does what the comment says it's supposed to do.

Updated the comment
/* Detect when a completion belongs to a new sender we haven't
* registered yet. Each sender's first segment pushes ncompls past
* the sum of all previously registered senders' segment counts.
* When that happens, we register this sender's total_nsegms.
* For single recv (1 sender), this fires once on the first completion.
* For grouped recv (N senders), this fires N times as each sender's
* first completion crosses the running total. */

rauteric · 2026-04-29T01:14:54Z

+					/* For grouped receives, distribute total received
+					 * size evenly across sub-receives. */


I don't think this adheres to the API, but I also don't think it matters in NCCL today.

The comment is wrong, the logic is correct, removed the comment

rauteric · 2026-04-29T02:51:39Z

By the way, the workflow failures look real -- looks like an unused variable.

bwbarrett · 2026-04-29T03:18:19Z

-	 * bytes of data, we just need a valid SGE) instead of the provided base
-	 * pointer and MR
-	 */
+	/* Handle 0-byte messages: use flush buffer for valid SGE */


but i like my long comment that explained why we do this :)

Removed by accident, returned

bwbarrett · 2026-04-29T03:21:01Z

+	/* Update the ctrl mailbox slot with all n sub-entries.
+	 * allocate_recv_req already populated entries[0] and the header.
+	 * Now update num_recvs and populate entries[1..n-1]. */
+	{


we're writing C++; we don't need a new scope block here :)

bwbarrett · 2026-04-29T03:33:50Z

+				if (local_recv_data->num_recvs > 1) {
+					/* For grouped receives, distribute total received
+					 * size evenly across sub-receives. */
+					size_t per_sub = req_size / local_recv_data->num_recvs;


I'm sure this works with NCCL (because if I remember correctly it only looks at zero / non-zero), but technically the README says that we should return the actual sizes (https://github.com/NVIDIA/nccl/blob/1933fdd6360a8bfccaa0166bd71bce363d32e5b6/plugins/net/README.md?plain=1#L413).

Already fixed in a later commit

bwbarrett · 2026-04-29T03:36:37Z

 	props->latency = ofi_properties.latency;
 	props->maxComms = ofi_properties.max_communicators;
-	props->maxRecvs = ofi_properties.max_group_receives;
+	/* Neuron uses sendrecv protocol which does not support grouped receives */


Neuron does not use the sendrecv protocol most of the time. But I'm also fine if we don't support grouped receive for neuron right now. Let's just get the comment right.

bwbarrett · 2026-04-29T03:40:36Z

-static inline int handle_eager_recv(nccl_net_ofi_rdma_recv_comm *r_comm,
-					     uint16_t msg_seq_num,
-					     nccl_net_ofi_rdma_req *rx_buff_req)
+static int eager_match_recv(nccl_net_ofi_rdma_req *recv_req, int32_t eager_tag)


for the love of a member function...

I agree but it is called from another static function so I don't want to refactor all that in this urgent feature,

avivbenchorin · 2026-04-29T07:43:46Z

 /*
 * @brief	Data of request responsible for receive operation
 */


nit: duplicated comment from the rdma_req_recv_data_t datatype

avivbenchorin · 2026-04-29T07:46:21Z

+}
+
+/*
+ * @brief Drain the receiver-side ordered eager queue.


Nit: this comment for drain_recv_eager_queue is separated from the function declaration/definition

avivbenchorin · 2026-04-29T07:46:48Z

 reuse_listen_comm
 ring
 gin
+grouped_recveager_multirecv


Missing new line between grouped_recv and eager_multirecv

rauteric

Looked over all the code. The detail of review is limited given the size and time constraint, but the concepts generally make sense.

rauteric · 2026-04-29T17:36:14Z

@@ -4494,9 +4737,9 @@ static nccl_net_ofi_rdma_recv_comm *prepare_recv_comm(nccl_net_ofi_rdma_domain_t

 	/* Allocate request freelist */
 	/* Maximum freelist entries is 4*NCCL_OFI_MAX_REQUESTS because each receive request


Nit: update comment to 11*NCCL_OFI_MAX_REQUESTS

rauteric · 2026-04-29T20:38:03Z

+- **Eager disabled for GPU buffers with 2-iovec sends**: The `fi_sendmsg` with
+  two iovecs (host header + GPU payload) requires provider support for
+  scatter-gather across host and device memory.


It's not really clear what limitation is being described in this bullet.

Which providers support this?

What is the minimum required version of Libfabric to support this with EFA provider?

rauteric · 2026-04-29T20:50:10Z

+	/* Create freelist for eager header buffers */
+	ret_s_comm->eager_hdr_fl = new nccl_ofi_freelist(
+		sizeof(nccl_ofi_eager_msg_header_t),
+		NCCL_OFI_CTRL_MAILBOX_SIZE, 16, 0,


Can we use NCCL_OFI_MAX_EAGER_PENDING as the size limit for this freelist?

rauteric · 2026-04-29T21:35:46Z

+private:
+	int n_recvs;
+	size_t buffer_size;
+	static constexpr int TAG = 1;


That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.

rauteric · 2026-04-29T21:37:04Z

+	return NULL;
+}
+
+static int test_struct_sizes()


Nit: all tests in this function could also just be static asserts in the code itself.

rauteric · 2026-04-29T21:38:17Z

Looks like an executable was committed mistakenly.

rauteric · 2026-04-29T21:40:54Z

 }

+
+/* Replicate the sort helpers from nccl_ofi_rdma.cpp for unit testing */


I didn't understand what part of the actual code is being tested here -- it looks like you just duplicated a few functions from the actual code into this test.

If the goal is to test these functions, then duplicating them into the test doesn't seem like the right approach -- instead, they should be included from a header in the actual source code.

Implement sender-side eager queue (max 8 entries) with drain logic that matches eager sends against ctrl msgs by tag for grouped receives, or pops the front entry for single receives. Eager sends use fi_sendmsg with 2 iovecs (8-byte header + payload) carrying eager_offset and tag. The sender does not advance next_msg_seq_num for eager sends; drain advances it when ctrl msgs arrive. Drain also runs from update_send_request (test path) so eager sends can complete without requiring another send() call. Implement receiver-side eager queue using a sorted doubly-linked list with a pre-allocated pool of CTRL_MAILBOX_SIZE entries. Entries are sorted by (msg_seq_num, eager_offset) using wraparound-aware comparison. Drain processes entries in order with batch boundary tracking via prev_batch_count/prev_msg_seq_num in the eager header. Add entry_used field to ctrl_msg_entry_t to replace tag invalidation for preventing double-matching of ctrl entries. Add sub_recv_idx to eager_copy_data_t so eager copies target the correct sub-receive buffer. Record per-sub-receive size in set_eager_copy_completed. Add eager_copy_to_sub_recv helper that handles zero-sized messages, single recv, and grouped recv eager copies. Add eager_match_recv helper that matches eager entries to sub-receives by tag using a consumed flag to prevent double-matching. Add static_asserts to verify immediate data fields sum to 32 bits and that NCCL_OFI_MAX_RECVS fits in RECV_IDX_BITS. Re-enable eager for grouped receives. Increase eager_rx_buff_size by NCCL_OFI_EAGER_HEADER_SIZE bytes for the header. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Add test_eager_sorted_insert to ctrl_msg unit test covering: - In-order and reverse-order insertion - Same seq with interleaved offsets - Sequence number wraparound (10-bit) - Single element and duplicate keys - Multiple seq batches interleaved Signed-off-by: Amit Radzi <amitrad@amazon.com>

18 functional tests covering eager message support with multi-recv: - T5/T6: Single recv eager (late/early recv posting) - T7: 4 sequential single-recv eager messages - T8: Grouped recv (n=2) all eager - T9: Grouped recv (n=3) mixed eager + RDMA write - T10: Grouped recv (n=2) all large (no eager, regression) - T11: Eager across single + grouped recv - T12: Eager tag pushback across two groups - T13: 8 eager messages (sender queue full) - T14: Eager size boundary (8184 vs 8185 bytes) - T15: 2x grouped(n=4) all eager/write permutations per tag - T16: 4x grouped(n=2) same tags, ordering per tag - T17: Interleaved sends 2x grouped(n=8) all write - T18: Interleaved sends 2x grouped(n=8) all eager - T19: Interleaved sends 2x grouped(n=4) mixed eager+write - T21: 3x grouped(n=2) interleaved mixed sizes - T22: Max interleave 2x grouped(n=8) all write Interleaved tests use a rotating queue with per-tag ordering (blocked_tags set) to ensure same-tag sends maintain message sequence order while allowing different-tag sends to proceed. Signed-off-by: Amit Radzi <amitrad@amazon.com>

When data validation fails, log the byte offset and the received vs expected values before the generic error message. This helps identify which send was misrouted in multi-recv tests. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Update the design document to cover the eager message extension for grouped receives. New sections: - Eager message header format (8-byte prepended to bounce buffer) - Sender-side eager queue and drain logic - Receiver-side sorted eager queue with batch boundary tracking - Ordering requirements and continuity checks - First-ever batch sentinel (0xFFFF) for out-of-order safety - Target recv resolution in the receiver drain - Eager copy mechanism Update ctrl_msg_entry struct to reflect entry_used field (replacing tag invalidation). Update limitations to include eager-specific constraints. Remove the note about eager being not yet implemented. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Change NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS (32) to allow more outstanding eager sends before requiring ctrl msg drain. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Update T13 (queue full) from 8 to 32 messages to match the new NCCL_OFI_MAX_EAGER_PENDING value. Add T23: 12 eager messages spanning 3 grouped recvs (n=4 each). Tests eager batch resolution across multiple multi-recv groups. Add T24: 6 eager messages spanning alternating single + grouped(n=2) recv patterns. Tests eager batch crossing different recv types. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Eager sends for grouped receives use fi_sendmsg with 2 iovecs (host header + GPU payload). This requires the provider to support mixed host/device memory in a single send operation. Add configure check for rdma/fi_ext_efa.h and FI_EFA_FEATURE_OPS. At runtime, query the EFA provider via fi_open_ops to check for mixed_hmem_iov support before advertising max_group_receives > 1. Signed-off-by: Amit Radzi <amitrad@amazon.com>

When the provider does not support mixed host/device iovecs (use_eager_header=false), eager sends cannot prepend the 8-byte header because the 2-iovec fi_sendmsg with host header + GPU payload would fail. Add use_eager_header flag to the endpoint, set from the device supports_eager_header property (determined by the EFA feature query for mixed_hmem_iov). Without the header: - post_rdma_eager_send uses fi_senddata with a single iovec instead of fi_sendmsg with two iovecs. - handle_eager_recv bypasses the sorted eager queue and looks up the recv directly in the msgbuff by msg_seq_num. If the recv is already posted, it initiates the eager copy immediately. If not, it stashes the bounce buffer as a BUFF entry for later. - recv() checks the msgbuff for a stashed BUFF entry before inserting the request. If found, it replaces the BUFF with the REQ and initiates the eager copy. - post_eager_copy conditionally skips the header offset when reading from the bounce buffer. - eager_rx_buff_size does not include the header overhead. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Call getProperties on all devices in TestSuite::run_all() before any tests execute. This ensures device-level flags like supports_eager_header are initialized before communicators are created. Signed-off-by: Amit Radzi <amitrad@amazon.com>

When eager sends are interleaved across grouped receives, the receiver drain cannot skip past an unresolved entry to reach entries for an already-posted recv. If the receiver blocks on completing one recv before posting the next, this can deadlock. Adapt T11, T17, T18, T19, T21, T22, T23, T24 to post all grouped recvs before polling any completions, matching the behavior of NCCL's proxy thread which posts recvs independently. Fix validation patterns in T17, T19, T21, T22 to match sender data patterns. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Add limitation about interleaved eager sends across grouped receives that can deadlock if the receiver serializes recv posting. Note that this does not occur in practice since NCCL's proxy thread posts recvs independently. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Remove unused 'matched' counter and stale comment. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Signed-off-by: Amit Radzi <amitrad@amazon.com>

amitrad-aws requested review from a team and bwbarrett as code owners April 15, 2026 17:51

avivbenchorin reviewed Apr 16, 2026

View reviewed changes

amitrad-aws force-pushed the maxrecvs-8 branch from a652e68 to da9bad5 Compare April 28, 2026 19:57

amitrad-aws requested a review from a team April 28, 2026 19:58

rauteric reviewed Apr 29, 2026

View reviewed changes

bwbarrett reviewed Apr 29, 2026

View reviewed changes

avivbenchorin reviewed Apr 29, 2026

View reviewed changes

rauteric reviewed Apr 29, 2026

View reviewed changes

amitrad-aws and others added 12 commits May 4, 2026 13:27

Add first-mismatch reporting to validate_data

09f8aae

When data validation fails, log the byte offset and the received vs expected values before the generic error message. This helps identify which send was misrouted in multi-recv tests. Signed-off-by: Amit Radzi <amitrad@amazon.com>

Increase max eager pending to NCCL_NET_MAX_REQUESTS

d148d15

Change NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS (32) to allow more outstanding eager sends before requiring ctrl msg drain. Signed-off-by: Amit Radzi <amitrad@amazon.com>

amitrad-aws force-pushed the maxrecvs-8 branch from da9bad5 to 6b0d18d Compare May 4, 2026 13:35

Remove unused variable in drain_sender_eager_queue

1386924

Remove unused 'matched' counter and stale comment. Signed-off-by: Amit Radzi <amitrad@amazon.com>

amitrad-aws changed the title ~~RDMA: Support grouped receives (maxRecvs=8)~~ RDMA: Support eager sends for grouped receives May 4, 2026

tests: remove unused MAX_RECVS constant in eager_multirecv

ac963d1

Signed-off-by: Amit Radzi <amitrad@amazon.com>

		/* Write msg_seq_num last as the ready bit */
		this->ctrl_mailbox[slot].entries[0].msg_seq_num = req->msg_seq_num & MSG_SEQ_NUM_MASK;

		/* For grouped receives, distribute total received
		* size evenly across sub-receives. */

		@@ -4494,9 +4737,9 @@ static nccl_net_ofi_rdma_recv_comm *prepare_recv_comm(nccl_net_ofi_rdma_domain_t

		/* Allocate request freelist */
		/* Maximum freelist entries is 4*NCCL_OFI_MAX_REQUESTS because each receive request

		}


		/* Replicate the sort helpers from nccl_ofi_rdma.cpp for unit testing */

Conversation

amitrad-aws commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enable eager sends for grouped receives

Summary

Key Changes

Known Limitation

Testing

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rauteric left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rauteric commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

amitrad-aws commented Apr 15, 2026 •

edited

Loading