Skip to content

RDMA: Support eager sends for grouped receives#1194

Open
amitrad-aws wants to merge 14 commits intoaws:masterfrom
amitrad-aws:maxrecvs-8
Open

RDMA: Support eager sends for grouped receives#1194
amitrad-aws wants to merge 14 commits intoaws:masterfrom
amitrad-aws:maxrecvs-8

Conversation

@amitrad-aws
Copy link
Copy Markdown
Contributor

@amitrad-aws amitrad-aws commented Apr 15, 2026

Enable eager sends for grouped receives

Summary

This PR extends the eager send mechanism to work with grouped receives (multi-recv). Previously, eager sends were disabled when maxRecvs > 1 because the sender doesn't know the receiver's grouping at eager-send time. This PR adds the necessary protocol extensions to support eager sends across both single and grouped receives.

Key Changes

Eager protocol for multi-recv:

  • Eager sends prepend an 8-byte header to the bounce buffer carrying eager_offset, tag, and batch boundary info (prev_batch_count, prev_msg_seq_num)
  • Sender uses fi_sendmsg with 2 iovecs (host header + payload) instead of fi_senddata
  • Sender-side circular eager queue with drain logic that matches eager sends against ctrl msg entries by tag
  • Receiver-side sorted doubly-linked list for ordered eager message processing with batch boundary tracking

Provider feature query:

  • Query EFA provider for mixed_hmem_iov support via FI_EFA_FEATURE_OPS
  • When the provider doesn't support mixed host/device iovecs, fall back to the original single-iovec eager path (no header)
  • Add m4 configure check for rdma/fi_ext_efa.h and FI_EFA_FEATURE_OPS

Eager queue sizing:

  • Increase NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS (32)

Tests:

  • Unit tests for eager sorted insert and recv_idx
  • 19 functional tests (T5-T24) covering single eager, grouped eager, mixed eager+write, interleaved sends, queue full, size boundary, tag pushback, and multi-group spanning
  • Test infrastructure calls getProperties for all devices before creating communicators

Known Limitation

When sends are interleaved across grouped receives, a deadlock can occur if the receiver blocks posting a new recv on the completion of a previous recv whose eager messages are stuck behind the unposted recv's messages in the processing queue. This does not occur in practice because NCCL's proxy thread posts recvs independently without waiting for prior completions.

Testing

  • Unit tests: all pass
  • Functional tests: 19/19 pass (eager_multirecv)
  • NCCL tests on 2-node p5en (16 ranks), with both libfabric with mixed HMEM iov support and without:
    • AllGather, AllReduce, AllToAll with eager enabled and disabled
    • AllToAll with eager enabled/disabled and with/without NCCL_NET_SHARED_COMMS

@amitrad-aws amitrad-aws requested review from a team and bwbarrett as code owners April 15, 2026 17:51
Comment thread include/nccl_ofi_rdma.h
Comment thread src/nccl_ofi_rdma.cpp Outdated
/* Update the ctrl mailbox slot with all n sub-entries.
* allocate_recv_req already populated entries[0] and the header.
* Now update num_recvs and populate entries[1..n-1]. */
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to put lines 3157-3181 in a nested code block? If not, I'd prefer minimizing the number of indentation to make code easier to fit on one line without wrapping.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove

Comment thread include/nccl_ofi_rdma.h
private:
int n_recvs;
size_t buffer_size;
static constexpr int TAG = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding, is there a specific reason to use the same TAG value for all sends/recvs instead of having unique values per send/subgroup recv pair? Do we expect NCCL to usually use the same or distinct tag values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL uses the rank ID as the tag. In alltoall PXN, since the data that is not necessarily the origin, the tag is the so in that case they will all be the same in each step. That is what I tried to do here.
There are ordering rules on tags I chose a simple one to help cover all the cases

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.

Comment thread tests/unit/dlist Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is a development artifact that isn't a part of the PR.

Comment thread src/nccl_ofi_rdma.cpp Outdated
@amitrad-aws amitrad-aws requested a review from a team April 28, 2026 19:58
Copy link
Copy Markdown
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it partially through the eager commit before I ran out of steam. Including a few minor comments; no major concerns so far.

Will pick this up tomorrow morning.

Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment on lines +2939 to +2940
/* Write msg_seq_num last as the ready bit */
this->ctrl_mailbox[slot].entries[0].msg_seq_num = req->msg_seq_num & MSG_SEQ_NUM_MASK;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm not sure why you need to write this last -- nobody should be polling on this location at this point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/nccl_ofi_rdma.cpp Outdated
}

/* Update the ctrl mailbox slot with all n sub-entries.
* allocate_recv_req already populated entries[0] and the header.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be cleaner to populate all entries at once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/nccl_ofi_rdma.cpp Outdated
uint16_t slot = seq_num % NCCL_OFI_CTRL_MAILBOX_SIZE;
return (s_comm->ctrl_mailbox[slot].entries[0].buff_len);
nccl_net_ofi_ctrl_msg_t *ctrl = &s_comm->ctrl_mailbox[slot];
if (ctrl->entries[0].num_recvs <= 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We generally put if statement bodies in braces in the plugin.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/nccl_ofi_rdma.cpp Outdated
/* Determine if this should be sent eagerly.
* Disable eager entirely when maxRecvs > 1: the sender must wait for
* the ctrl msg so it can detect grouped receives and reuse msg_seq_num. */
if (NCCL_OFI_MAX_RECVS == 1 && !have_ctrl &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to use NCCL_OFI_MAX_RECVS here, or nr?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is temp, removed in the commit that enables eager

Comment thread src/nccl_ofi_rdma.cpp
/* Memory synchronization point to ensure that the data in the control msg is not
* read before the sequence number is checked */
} else if (!in_group || s_comm->group_sends_remaining == s_comm->group_num_recvs) {
/* Memory synchronization point - only on first send of a group or non-grouped */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you need the synchronization point after every send in the group?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in has_ctrl_msg we verified all entries arrived before returning true

Comment thread src/nccl_ofi_rdma.cpp
* For a single recv (num_expected_senders=1), this registers once on the first completion.
* For grouped receives, each fi_writedata completion represents one sender.
* We use the original heuristic for non-grouped, and simple counting for grouped. */
if (req->ncompls > recv_segms_data->total_expected_segms) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand how this if condition does what the comment says it's supposed to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment
/* Detect when a completion belongs to a new sender we haven't
* registered yet. Each sender's first segment pushes ncompls past
* the sum of all previously registered senders' segment counts.
* When that happens, we register this sender's total_nsegms.
* For single recv (1 sender), this fires once on the first completion.
* For grouped recv (N senders), this fires N times as each sender's
* first completion crosses the running total. */

Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment on lines +2489 to +2490
/* For grouped receives, distribute total received
* size evenly across sub-receives. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this adheres to the API, but I also don't think it matters in NCCL today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong, the logic is correct, removed the comment

@rauteric
Copy link
Copy Markdown
Contributor

By the way, the workflow failures look real -- looks like an unused variable.

Comment thread src/nccl_ofi_rdma.cpp Outdated
* bytes of data, we just need a valid SGE) instead of the provided base
* pointer and MR
*/
/* Handle 0-byte messages: use flush buffer for valid SGE */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but i like my long comment that explained why we do this :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed by accident, returned

Comment thread src/nccl_ofi_rdma.cpp Outdated
/* Update the ctrl mailbox slot with all n sub-entries.
* allocate_recv_req already populated entries[0] and the header.
* Now update num_recvs and populate entries[1..n-1]. */
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're writing C++; we don't need a new scope block here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment thread src/nccl_ofi_rdma.cpp Outdated
if (local_recv_data->num_recvs > 1) {
/* For grouped receives, distribute total received
* size evenly across sub-receives. */
size_t per_sub = req_size / local_recv_data->num_recvs;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this works with NCCL (because if I remember correctly it only looks at zero / non-zero), but technically the README says that we should return the actual sizes (https://github.com/NVIDIA/nccl/blob/1933fdd6360a8bfccaa0166bd71bce363d32e5b6/plugins/net/README.md?plain=1#L413).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already fixed in a later commit

Comment thread src/nccl_ofi_interface_neuron.cpp Outdated
props->latency = ofi_properties.latency;
props->maxComms = ofi_properties.max_communicators;
props->maxRecvs = ofi_properties.max_group_receives;
/* Neuron uses sendrecv protocol which does not support grouped receives */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neuron does not use the sendrecv protocol most of the time. But I'm also fine if we don't support grouped receive for neuron right now. Let's just get the comment right.

Comment thread src/nccl_ofi_rdma.cpp
static inline int handle_eager_recv(nccl_net_ofi_rdma_recv_comm *r_comm,
uint16_t msg_seq_num,
nccl_net_ofi_rdma_req *rx_buff_req)
static int eager_match_recv(nccl_net_ofi_rdma_req *recv_req, int32_t eager_tag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the love of a member function...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but it is called from another static function so I don't want to refactor all that in this urgent feature,

Comment thread include/nccl_ofi_rdma.h Outdated
Comment on lines 440 to 442
/*
* @brief Data of request responsible for receive operation
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicated comment from the rdma_req_recv_data_t datatype

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread src/nccl_ofi_rdma.cpp
}

/*
* @brief Drain the receiver-side ordered eager queue.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment for drain_recv_eager_queue is separated from the function declaration/definition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread tests/functional/.gitignore Outdated
reuse_listen_comm
ring
gin
grouped_recveager_multirecv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing new line between grouped_recv and eager_multirecv

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixec

Copy link
Copy Markdown
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over all the code. The detail of review is limited given the size and time constraint, but the concepts generally make sense.

Comment thread src/nccl_ofi_rdma.cpp
@@ -4494,9 +4737,9 @@ static nccl_net_ofi_rdma_recv_comm *prepare_recv_comm(nccl_net_ofi_rdma_domain_t

/* Allocate request freelist */
/* Maximum freelist entries is 4*NCCL_OFI_MAX_REQUESTS because each receive request
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: update comment to 11*NCCL_OFI_MAX_REQUESTS

Comment thread doc/multi-recv.md
Comment on lines +233 to +235
- **Eager disabled for GPU buffers with 2-iovec sends**: The `fi_sendmsg` with
two iovecs (host header + GPU payload) requires provider support for
scatter-gather across host and device memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really clear what limitation is being described in this bullet.

  • Which providers support this?
  • What is the minimum required version of Libfabric to support this with EFA provider?

Comment thread src/nccl_ofi_rdma.cpp
/* Create freelist for eager header buffers */
ret_s_comm->eager_hdr_fl = new nccl_ofi_freelist(
sizeof(nccl_ofi_eager_msg_header_t),
NCCL_OFI_CTRL_MAILBOX_SIZE, 16, 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use NCCL_OFI_MAX_EAGER_PENDING as the size limit for this freelist?

private:
int n_recvs;
size_t buffer_size;
static constexpr int TAG = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That surprises me; when I had implemented multi-receive before, the tags were all different within a multi-receive. (I think at the time, the tag was equal to the source rank.) But things may have changed in recent NCCL versions.

Comment thread tests/unit/ctrl_msg.cpp
return NULL;
}

static int test_struct_sizes()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: all tests in this function could also just be static asserts in the code itself.

Comment thread tests/unit/dlist Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an executable was committed mistakenly.

Comment thread tests/unit/ctrl_msg.cpp
}


/* Replicate the sort helpers from nccl_ofi_rdma.cpp for unit testing */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand what part of the actual code is being tested here -- it looks like you just duplicated a few functions from the actual code into this test.

If the goal is to test these functions, then duplicating them into the test doesn't seem like the right approach -- instead, they should be included from a header in the actual source code.

amitrad-aws and others added 12 commits May 4, 2026 13:27
Implement sender-side eager queue (max 8 entries) with drain logic
that matches eager sends against ctrl msgs by tag for grouped
receives, or pops the front entry for single receives. Eager sends
use fi_sendmsg with 2 iovecs (8-byte header + payload) carrying
eager_offset and tag. The sender does not advance next_msg_seq_num
for eager sends; drain advances it when ctrl msgs arrive. Drain
also runs from update_send_request (test path) so eager sends can
complete without requiring another send() call.

Implement receiver-side eager queue using a sorted doubly-linked
list with a pre-allocated pool of CTRL_MAILBOX_SIZE entries. Entries
are sorted by (msg_seq_num, eager_offset) using wraparound-aware
comparison. Drain processes entries in order with batch boundary
tracking via prev_batch_count/prev_msg_seq_num in the eager header.

Add entry_used field to ctrl_msg_entry_t to replace tag invalidation
for preventing double-matching of ctrl entries.

Add sub_recv_idx to eager_copy_data_t so eager copies target the
correct sub-receive buffer. Record per-sub-receive size in
set_eager_copy_completed.

Add eager_copy_to_sub_recv helper that handles zero-sized messages,
single recv, and grouped recv eager copies. Add eager_match_recv
helper that matches eager entries to sub-receives by tag using a
consumed flag to prevent double-matching.

Add static_asserts to verify immediate data fields sum to 32 bits
and that NCCL_OFI_MAX_RECVS fits in RECV_IDX_BITS.

Re-enable eager for grouped receives. Increase eager_rx_buff_size
by NCCL_OFI_EAGER_HEADER_SIZE bytes for the header.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Add test_eager_sorted_insert to ctrl_msg unit test covering:
- In-order and reverse-order insertion
- Same seq with interleaved offsets
- Sequence number wraparound (10-bit)
- Single element and duplicate keys
- Multiple seq batches interleaved

Signed-off-by: Amit Radzi <amitrad@amazon.com>
18 functional tests covering eager message support with multi-recv:
- T5/T6: Single recv eager (late/early recv posting)
- T7: 4 sequential single-recv eager messages
- T8: Grouped recv (n=2) all eager
- T9: Grouped recv (n=3) mixed eager + RDMA write
- T10: Grouped recv (n=2) all large (no eager, regression)
- T11: Eager across single + grouped recv
- T12: Eager tag pushback across two groups
- T13: 8 eager messages (sender queue full)
- T14: Eager size boundary (8184 vs 8185 bytes)
- T15: 2x grouped(n=4) all eager/write permutations per tag
- T16: 4x grouped(n=2) same tags, ordering per tag
- T17: Interleaved sends 2x grouped(n=8) all write
- T18: Interleaved sends 2x grouped(n=8) all eager
- T19: Interleaved sends 2x grouped(n=4) mixed eager+write
- T21: 3x grouped(n=2) interleaved mixed sizes
- T22: Max interleave 2x grouped(n=8) all write

Interleaved tests use a rotating queue with per-tag ordering
(blocked_tags set) to ensure same-tag sends maintain message
sequence order while allowing different-tag sends to proceed.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
When data validation fails, log the byte offset and the received vs
expected values before the generic error message. This helps identify
which send was misrouted in multi-recv tests.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Update the design document to cover the eager message extension
for grouped receives. New sections:
- Eager message header format (8-byte prepended to bounce buffer)
- Sender-side eager queue and drain logic
- Receiver-side sorted eager queue with batch boundary tracking
- Ordering requirements and continuity checks
- First-ever batch sentinel (0xFFFF) for out-of-order safety
- Target recv resolution in the receiver drain
- Eager copy mechanism

Update ctrl_msg_entry struct to reflect entry_used field (replacing
tag invalidation). Update limitations to include eager-specific
constraints.

Remove the note about eager being not yet implemented.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Change NCCL_OFI_MAX_EAGER_PENDING from 8 to NCCL_NET_MAX_REQUESTS
(32) to allow more outstanding eager sends before requiring ctrl
msg drain.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Update T13 (queue full) from 8 to 32 messages to match the new
NCCL_OFI_MAX_EAGER_PENDING value.

Add T23: 12 eager messages spanning 3 grouped recvs (n=4 each).
Tests eager batch resolution across multiple multi-recv groups.

Add T24: 6 eager messages spanning alternating single + grouped(n=2)
recv patterns. Tests eager batch crossing different recv types.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Eager sends for grouped receives use fi_sendmsg with 2 iovecs
(host header + GPU payload). This requires the provider to support
mixed host/device memory in a single send operation.

Add configure check for rdma/fi_ext_efa.h and FI_EFA_FEATURE_OPS.
At runtime, query the EFA provider via fi_open_ops to check for
mixed_hmem_iov support before advertising max_group_receives > 1.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
When the provider does not support mixed host/device iovecs
(use_eager_header=false), eager sends cannot prepend the 8-byte
header because the 2-iovec fi_sendmsg with host header + GPU
payload would fail.

Add use_eager_header flag to the endpoint, set from the device
supports_eager_header property (determined by the EFA feature
query for mixed_hmem_iov).

Without the header:
- post_rdma_eager_send uses fi_senddata with a single iovec
  instead of fi_sendmsg with two iovecs.
- handle_eager_recv bypasses the sorted eager queue and looks up
  the recv directly in the msgbuff by msg_seq_num. If the recv
  is already posted, it initiates the eager copy immediately.
  If not, it stashes the bounce buffer as a BUFF entry for later.
- recv() checks the msgbuff for a stashed BUFF entry before
  inserting the request. If found, it replaces the BUFF with the
  REQ and initiates the eager copy.
- post_eager_copy conditionally skips the header offset when
  reading from the bounce buffer.
- eager_rx_buff_size does not include the header overhead.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Call getProperties on all devices in TestSuite::run_all() before
any tests execute. This ensures device-level flags like
supports_eager_header are initialized before communicators are
created.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
When eager sends are interleaved across grouped receives, the
receiver drain cannot skip past an unresolved entry to reach
entries for an already-posted recv. If the receiver blocks on
completing one recv before posting the next, this can deadlock.

Adapt T11, T17, T18, T19, T21, T22, T23, T24 to post all grouped
recvs before polling any completions, matching the behavior of
NCCL's proxy thread which posts recvs independently.

Fix validation patterns in T17, T19, T21, T22 to match sender
data patterns.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Add limitation about interleaved eager sends across grouped receives
that can deadlock if the receiver serializes recv posting. Note that
this does not occur in practice since NCCL's proxy thread posts
recvs independently.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
Remove unused 'matched' counter and stale comment.

Signed-off-by: Amit Radzi <amitrad@amazon.com>
@amitrad-aws amitrad-aws changed the title RDMA: Support grouped receives (maxRecvs=8) RDMA: Support eager sends for grouped receives May 4, 2026
Signed-off-by: Amit Radzi <amitrad@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants