[Bugfix] Gate RDMA sends until active side confirms QP readiness#2625
[Bugfix] Gate RDMA sends until active side confirms QP readiness#2625LCAIZJ wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a ready ACK handshake mechanism for RDMA endpoints to ensure that passive endpoints confirm the peer has completed active setup before posting work requests. It adds tracking for ready ACK support in transfer metadata, implements readiness and timeout checks in RdmaEndPoint, and updates the worker pool to handle ready ACK timeouts. A critical race condition was identified in setupConnectionsByActive where ready_to_send_ is unconditionally set to true after sending a ready ACK without verifying if the endpoint remains in the CONNECTED state, which could bypass readiness checks on subsequent connection attempts.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| RWSpinlock::WriteGuard guard(lock_); | ||
| if (ack_ret) { | ||
| resetConnection("failed to send ready ACK"); | ||
| return ack_ret; | ||
| } | ||
| ready_wait_start_ts_.store(0, std::memory_order_relaxed); | ||
| ready_to_send_.store(true, std::memory_order_relaxed); | ||
| return 0; |
There was a problem hiding this comment.
There is a potential race condition here. Since sendReadyAck is called without holding the lock, another thread could concurrently call disconnect() or resetConnection(), which would transition the endpoint's status to UNCONNECTED or DESTROYING and reset ready_to_send_ to false.
When the current thread re-acquires the lock, it unconditionally sets ready_to_send_ to true without checking if the endpoint is still CONNECTED. If the endpoint was transitioned to UNCONNECTED, ready_to_send_ will remain true. On the next connection attempt, as soon as status_ becomes CONNECTED, readyToSend() will immediately return true before the new ready ACK is actually received, bypassing the gate entirely.
To prevent this, we should verify that the endpoint is still in the CONNECTED state after re-acquiring the lock.
RWSpinlock::WriteGuard guard(lock_);
if (status_.load(std::memory_order_relaxed) != CONNECTED) {
LOG(WARNING) << "Endpoint is no longer CONNECTED after sending ready ACK: " << toString();
return ERR_ENDPOINT;
}
if (ack_ret) {
resetConnection("failed to send ready ACK");
return ack_ret;
}
ready_wait_start_ts_.store(0, std::memory_order_relaxed);
ready_to_send_.store(true, std::memory_order_relaxed);
return 0;|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Description
This PR fixes a race in RDMA connection setup where the passive side may post WRs before the active side has finished moving its local QPs to RTR/RTS.
In the old flow, setupConnectionsByPassive() marks the endpoint as CONNECTED immediately after the passive side finishes its own QP setup. However, the active side still needs to process the handshake response and complete its own QP state transition. If the passive side posts WRs in this window, they may hit a remote QP that is not ready yet
sequenceDiagram participant B as "B active side" participant A as "A passive side" B->>A: Initial handshake with B QPNs A->>A: Passive setup: A QPs -> RTR/RTS A-->>B: Response with A QPNs Note over A: Old behavior:<br/>status = CONNECTED<br/>WRs may be posted immediately A--xB: RDMA WR arrives too early Note over B: B may not have completed<br/>local QP RTR/RTS yetModify
This PR separates local QP setup from send readiness. CONNECTED now means the local QPs have reached RTS, while ready_to_send_ means the peer has also confirmed its active-side QP setup is complete.
After the active side finishes doSetupConnection(), it sends an explicit RDMA ready ACK through the handshake path. The passive side only enables sending after receiving this ACK with matching peer QPNs.
sequenceDiagram participant B as "B active side" participant A as "A passive side" B->>A: Initial handshake with B QPNs A->>A: Passive setup succeeds A->>A: status = CONNECTED<br/>ready_to_send = false A-->>B: Response with A QPNs B->>B: Active setup succeeds<br/>B QPs -> RTR/RTS B->>A: Ready ACK with same B QPNs A->>A: Verify peer QPNs match A->>A: ready_to_send = true A->>B: Post RDMA WR safelyThe worker now checks readyToSend() before posting WRs. If an endpoint is connected but not ready yet, slices remain queued instead of being treated as transfer failures, so waiting for the ACK does not consume retry count.
For rolling compatibility, peers that do not include the new ready_ack field keep the previous behavior.