MCP-374 Fix concurrent sendMessage race in ManagedStdioClientTransport#290
MCP-374 Fix concurrent sendMessage race in ManagedStdioClientTransport#290
Conversation
When two MCP tool calls arrive in parallel (e.g. get_current_architecture with depth=0 and depth=1), the SQ MCP server dispatches them on separate reactor threads that both call sendMessage() on the same transport. The unicast Reactor sink's SinkManySerialized wrapper uses a CAS-based guard that returns FAIL_NON_SERIALIZED when two threads call tryEmitNext concurrently, causing "Failed to enqueue message for 'sonar-cag'". This test reproduces the race: 19/20 repetitions fail before the fix.
SummaryFixes a production race condition in What reviewers should knowWhere to start: Read the concurrency test first (
|
57680e3 to
d3743bc
Compare
Replace tryEmitNext (fail-fast) with emitNext + busyLooping(100ms) in ManagedStdioClientTransport.sendMessage(). The unicast sink's SinkManySerialized wrapper returns FAIL_NON_SERIALIZED when two threads call tryEmitNext concurrently. busyLooping retries the CAS spin instead of immediately failing, making concurrent sends safe. The contention window is microseconds (single CAS operation), so the 100ms duration is just a generous upper bound for pathological cases like GC pauses. Before: 18/20 test repetitions fail After: 20/20 pass
d3743bc to
3c3784f
Compare
SonarQube reviewer guide
|
There was a problem hiding this comment.
Conclusion: Correct, minimal fix. The scope is right — only outboundSink needs busyLooping because it's the only sink with multiple producers; inboundSink and errorSink are exclusively written from their own single-threaded schedulers. The test is solid and the 100ms spin ceiling is a safe upper bound given the microsecond contention window.
nquinquenel
left a comment
There was a problem hiding this comment.
FYI, this class is a pure copy/paste of the SDK class that we extended to add a very small layer to improve shutdown mechanism.
I suggest also reaching out to the SDK project about this issue. I saw they have a similar comment about it: https://github.com/modelcontextprotocol/java-sdk/blob/main/mcp-core/src/main/java/io/modelcontextprotocol/client/transport/StdioClientTransport.java#L229
// TODO: essentially we could reschedule ourselves in some time and make
// another attempt with the already read data but pause reading until
// success
// In this approach we delegate the retry and the backpressure onto the
// caller. This might be enough for most cases.
|
Thanks @nicolas-gauthier-sonarsource for the pointer, and for the quick review. I will try to open a PR to them also. |
|
@nicolas-gauthier-sonarsource, FYI: PR created on their side: modelcontextprotocol/java-sdk#876 |




Problem
When an MCP client (e.g. Claude, Cursor) sends multiple tool calls in the same turn — which is standard MCP behavior — the SQ MCP server dispatches them on separate reactor
boundedElasticthreads. If two of those calls target the same proxied server (e.g.sonar-cag), both threads callsendMessage()on the sameManagedStdioClientTransportinstance concurrently. The first call succeeds; the second fails with:When does this happen?
This is triggered by normal MCP client behavior — not anything unusual. For example, an LLM calling
get_current_architecturetwice in parallel with differentdepthvalues (depth=0 and depth=1) in the same response. The two calls arrive ~9ms apart, both get dispatched to the proxiedsonar-cagserver, and one of them silently fails.This was observed in production during testing the CAG feature. the root cause is purely in the Java transport layer.
Root cause
ManagedStdioClientTransport(introduced ina1b2062, MCP-326) uses a Reactor unicast sink:The
sendMessage()method callstryEmitNext()on this sink:Reactor wraps unicast sinks in
SinkManySerialized, which uses a CAS-based guard (tryAcquire) to protect the underlying SPSC (Single-Producer, Single-Consumer) queue. When two threads calltryEmitNextsimultaneously, the CAS loser immediately getsFAIL_NON_SERIALIZED— the method does not retry, it just fails.Note: the upstream SDK's
StdioClientTransporthas the same pattern and same latent bug, but it was a missed opportunity to fix it when writing the custom transport.Timeline from production logs
tools/callid=647 (depth=0)tools/callid=648 (depth=1)sendMessage()for id=647sendMessage()for id=648FAIL_NON_SERIALIZED→ "Failed to enqueue message"Fix
Replace
tryEmitNext(fail-fast) withemitNext+busyLooping(100ms). ThebusyLoopinghandler spin-retries onFAIL_NON_SERIALIZEDuntil the competing thread finishes its emit, instead of immediately failing.Test results
The included test (
ManagedStdioClientTransportConcurrencyTest) reproduces the exact scenario: two threads sending messages concurrently through the same transport.FAIL_NON_SERIALIZEDAdditionally verified with a standalone Reactor test: 28/100 failure rate with
tryEmitNext, 0/100 withemitNext+busyLooping.