Skip to content

[Bugfix] Gate RDMA sends until active side confirms QP readiness#2625

Open
LCAIZJ wants to merge 2 commits into
kvcache-ai:mainfrom
LCAIZJ:fix-rdma-passive-ready-ack
Open

[Bugfix] Gate RDMA sends until active side confirms QP readiness#2625
LCAIZJ wants to merge 2 commits into
kvcache-ai:mainfrom
LCAIZJ:fix-rdma-passive-ready-ack

Conversation

@LCAIZJ

@LCAIZJ LCAIZJ commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR fixes a race in RDMA connection setup where the passive side may post WRs before the active side has finished moving its local QPs to RTR/RTS.

In the old flow, setupConnectionsByPassive() marks the endpoint as CONNECTED immediately after the passive side finishes its own QP setup. However, the active side still needs to process the handshake response and complete its own QP state transition. If the passive side posts WRs in this window, they may hit a remote QP that is not ready yet

sequenceDiagram
    participant B as "B active side"
    participant A as "A passive side"

    B->>A: Initial handshake with B QPNs
    A->>A: Passive setup: A QPs -> RTR/RTS
    A-->>B: Response with A QPNs

    Note over A: Old behavior:<br/>status = CONNECTED<br/>WRs may be posted immediately

    A--xB: RDMA WR arrives too early

    Note over B: B may not have completed<br/>local QP RTR/RTS yet
Loading

Modify

This PR separates local QP setup from send readiness. CONNECTED now means the local QPs have reached RTS, while ready_to_send_ means the peer has also confirmed its active-side QP setup is complete.

After the active side finishes doSetupConnection(), it sends an explicit RDMA ready ACK through the handshake path. The passive side only enables sending after receiving this ACK with matching peer QPNs.

sequenceDiagram
    participant B as "B active side"
    participant A as "A passive side"

    B->>A: Initial handshake with B QPNs
    A->>A: Passive setup succeeds
    A->>A: status = CONNECTED<br/>ready_to_send = false
    A-->>B: Response with A QPNs

    B->>B: Active setup succeeds<br/>B QPs -> RTR/RTS
    B->>A: Ready ACK with same B QPNs

    A->>A: Verify peer QPNs match
    A->>A: ready_to_send = true

    A->>B: Post RDMA WR safely
Loading

The worker now checks readyToSend() before posting WRs. If an endpoint is connected but not ready yet, slices remain queued instead of being treated as transfer failures, so waiting for the ACK does not consume retry count.

For rolling compatibility, peers that do not include the new ready_ack field keep the previous behavior.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ready ACK handshake mechanism for RDMA endpoints to ensure that passive endpoints confirm the peer has completed active setup before posting work requests. It adds tracking for ready ACK support in transfer metadata, implements readiness and timeout checks in RdmaEndPoint, and updates the worker pool to handle ready ACK timeouts. A critical race condition was identified in setupConnectionsByActive where ready_to_send_ is unconditionally set to true after sending a ready ACK without verifying if the endpoint remains in the CONNECTED state, which could bypass readiness checks on subsequent connection attempts.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +559 to +566
RWSpinlock::WriteGuard guard(lock_);
if (ack_ret) {
resetConnection("failed to send ready ACK");
return ack_ret;
}
ready_wait_start_ts_.store(0, std::memory_order_relaxed);
ready_to_send_.store(true, std::memory_order_relaxed);
return 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition here. Since sendReadyAck is called without holding the lock, another thread could concurrently call disconnect() or resetConnection(), which would transition the endpoint's status to UNCONNECTED or DESTROYING and reset ready_to_send_ to false.

When the current thread re-acquires the lock, it unconditionally sets ready_to_send_ to true without checking if the endpoint is still CONNECTED. If the endpoint was transitioned to UNCONNECTED, ready_to_send_ will remain true. On the next connection attempt, as soon as status_ becomes CONNECTED, readyToSend() will immediately return true before the new ready ACK is actually received, bypassing the gate entirely.

To prevent this, we should verify that the endpoint is still in the CONNECTED state after re-acquiring the lock.

            RWSpinlock::WriteGuard guard(lock_);
            if (status_.load(std::memory_order_relaxed) != CONNECTED) {
                LOG(WARNING) << "Endpoint is no longer CONNECTED after sending ready ACK: " << toString();
                return ERR_ENDPOINT;
            }
            if (ack_ret) {
                resetConnection("failed to send ready ACK");
                return ack_ret;
            }
            ready_wait_start_ts_.store(0, std::memory_order_relaxed);
            ready_to_send_.store(true, std::memory_order_relaxed);
            return 0;

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.27389% with 155 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ine/src/transport/rdma_transport/rdma_endpoint.cpp 1.42% 138 Missing ⚠️
...ngine/src/transport/rdma_transport/worker_pool.cpp 0.00% 10 Missing ⚠️
mooncake-transfer-engine/src/transfer_metadata.cpp 0.00% 5 Missing ⚠️
...e/include/transport/rdma_transport/rdma_endpoint.h 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants