Skip to content

MCP-374 Fix concurrent sendMessage race in ManagedStdioClientTransport#290

Merged
AbbasNS merged 2 commits intomasterfrom
as/fix-concurrent-sendmessage-race
Mar 20, 2026
Merged

MCP-374 Fix concurrent sendMessage race in ManagedStdioClientTransport#290
AbbasNS merged 2 commits intomasterfrom
as/fix-concurrent-sendmessage-race

Conversation

@AbbasNS
Copy link
Copy Markdown
Contributor

@AbbasNS AbbasNS commented Mar 20, 2026

Problem

When an MCP client (e.g. Claude, Cursor) sends multiple tool calls in the same turn — which is standard MCP behavior — the SQ MCP server dispatches them on separate reactor boundedElastic threads. If two of those calls target the same proxied server (e.g. sonar-cag), both threads call sendMessage() on the same ManagedStdioClientTransport instance concurrently. The first call succeeds; the second fails with:

Failed to execute proxied tool 'get_current_architecture' on server 'sonar-cag': Failed to enqueue message for 'sonar-cag'

When does this happen?

This is triggered by normal MCP client behavior — not anything unusual. For example, an LLM calling get_current_architecture twice in parallel with different depth values (depth=0 and depth=1) in the same response. The two calls arrive ~9ms apart, both get dispatched to the proxied sonar-cag server, and one of them silently fails.

This was observed in production during testing the CAG feature. the root cause is purely in the Java transport layer.

Root cause

ManagedStdioClientTransport (introduced in a1b2062, MCP-326) uses a Reactor unicast sink:

this.outboundSink = Sinks.many().unicast().onBackpressureBuffer();

The sendMessage() method calls tryEmitNext() on this sink:

return outboundSink.tryEmitNext(message).isSuccess()
  ? Mono.empty()
  : Mono.error(new RuntimeException("Failed to enqueue message for '" + serverName + "'"));

Reactor wraps unicast sinks in SinkManySerialized, which uses a CAS-based guard (tryAcquire) to protect the underlying SPSC (Single-Producer, Single-Consumer) queue. When two threads call tryEmitNext simultaneously, the CAS loser immediately gets FAIL_NON_SERIALIZED — the method does not retry, it just fails.

Note: the upstream SDK's StdioClientTransport has the same pattern and same latent bug, but it was a missed opportunity to fix it when writing the custom transport.

Timeline from production logs

Time Thread Event
10:33:31.585 pool-8 Receives tools/call id=647 (depth=0)
10:33:31.594 pool-8 Receives tools/call id=648 (depth=1)
10:33:31.595 boundedElastic-5 sendMessage() for id=647
10:33:31.595 boundedElastic-6 sendMessage() for id=648
10:33:31.605 boundedElastic-5 FAIL_NON_SERIALIZED → "Failed to enqueue message"
10:33:31.614 sonar-cag receives only id=648
10:34:10.923 boundedElastic-6 id=648 completes (39s due to branch reload)

Fix

Replace tryEmitNext (fail-fast) with emitNext + busyLooping(100ms). The busyLooping handler spin-retries on FAIL_NON_SERIALIZED until the competing thread finishes its emit, instead of immediately failing.

outboundSink.emitNext(message, Sinks.EmitFailureHandler.busyLooping(Duration.ofMillis(100)));

Test results

The included test (ManagedStdioClientTransportConcurrencyTest) reproduces the exact scenario: two threads sending messages concurrently through the same transport.

  • Before fix: 19/20 repetitions fail with FAIL_NON_SERIALIZED
  • After fix: 20/20 pass

Additionally verified with a standalone Reactor test: 28/100 failure rate with tryEmitNext, 0/100 with emitNext + busyLooping.

When two MCP tool calls arrive in parallel (e.g. get_current_architecture
with depth=0 and depth=1), the SQ MCP server dispatches them on separate
reactor threads that both call sendMessage() on the same transport.

The unicast Reactor sink's SinkManySerialized wrapper uses a CAS-based
guard that returns FAIL_NON_SERIALIZED when two threads call tryEmitNext
concurrently, causing "Failed to enqueue message for 'sonar-cag'".

This test reproduces the race: 19/20 repetitions fail before the fix.
@sonar-review-alpha
Copy link
Copy Markdown

sonar-review-alpha bot commented Mar 20, 2026

Summary

Fixes a production race condition in ManagedStdioClientTransport.sendMessage() where concurrent calls from multiple threads were silently failing with 'Failed to enqueue message'. The fix replaces fail-fast tryEmitNext() with retry-on-contention emitNext() using Reactor's busyLooping handler, which spins briefly (100ms ceiling) instead of immediately erroring when the CAS guard detects concurrent access on the unicast sink. The included concurrency test validates the fix by reproducing the exact two-thread scenario observed in production.

What reviewers should know

Where to start: Read the concurrency test first (ManagedStdioClientTransportConcurrencyTest) — it documents the exact production failure and validates the fix. Then review the 8-line change in sendMessage(). What to understand: The unicast sink's FAIL_NON_SERIALIZED is not an error condition; it's expected when two threads race on the CAS guard. The fix accepts brief spinning (microseconds in practice) instead of rejecting the message. The 100ms duration is an upper bound for the retry loop; the actual contention window is single-digit microseconds. Why this is safe: The change is isolated to a single method and doesn't affect the transport's public interface or concurrency semantics — it just handles a latent bug in the SDK's pattern that the SonarQube wrapper inherited.


  • Generate Walkthrough
  • Generate Diagram

🗣️ Give feedback

@hashicorp-vault-sonar-prod hashicorp-vault-sonar-prod bot changed the title Fix concurrent sendMessage race in ManagedStdioClientTransport MCP-374 Fix concurrent sendMessage race in ManagedStdioClientTransport Mar 20, 2026
@hashicorp-vault-sonar-prod
Copy link
Copy Markdown

hashicorp-vault-sonar-prod bot commented Mar 20, 2026

MCP-374

@AbbasNS AbbasNS force-pushed the as/fix-concurrent-sendmessage-race branch from 57680e3 to d3743bc Compare March 20, 2026 12:57
Replace tryEmitNext (fail-fast) with emitNext + busyLooping(100ms)
in ManagedStdioClientTransport.sendMessage().

The unicast sink's SinkManySerialized wrapper returns FAIL_NON_SERIALIZED
when two threads call tryEmitNext concurrently. busyLooping retries the
CAS spin instead of immediately failing, making concurrent sends safe.
The contention window is microseconds (single CAS operation), so the
100ms duration is just a generous upper bound for pathological cases
like GC pauses.

Before: 18/20 test repetitions fail
After:  20/20 pass
@AbbasNS AbbasNS force-pushed the as/fix-concurrent-sendmessage-race branch from d3743bc to 3c3784f Compare March 20, 2026 13:01
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@sonar-review-alpha sonar-review-alpha bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: Correct, minimal fix. The scope is right — only outboundSink needs busyLooping because it's the only sink with multiple producers; inboundSink and errorSink are exclusively written from their own single-threaded schedulers. The test is solid and the 100ms spin ceiling is a safe upper bound given the microsecond contention window.

🗣️ Give feedback

@antonioaversa antonioaversa requested review from nquinquenel and removed request for nquinquenel March 20, 2026 13:48
Copy link
Copy Markdown
Member

@nquinquenel nquinquenel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this class is a pure copy/paste of the SDK class that we extended to add a very small layer to improve shutdown mechanism.

I suggest also reaching out to the SDK project about this issue. I saw they have a similar comment about it: https://github.com/modelcontextprotocol/java-sdk/blob/main/mcp-core/src/main/java/io/modelcontextprotocol/client/transport/StdioClientTransport.java#L229

			// TODO: essentially we could reschedule ourselves in some time and make
			// another attempt with the already read data but pause reading until
			// success
			// In this approach we delegate the retry and the backpressure onto the
			// caller. This might be enough for most cases.

@AbbasNS
Copy link
Copy Markdown
Contributor Author

AbbasNS commented Mar 20, 2026

Thanks @nicolas-gauthier-sonarsource for the pointer, and for the quick review. I will try to open a PR to them also.

@AbbasNS AbbasNS merged commit 6b84225 into master Mar 20, 2026
11 checks passed
@AbbasNS AbbasNS deleted the as/fix-concurrent-sendmessage-race branch March 20, 2026 14:05
@AbbasNS
Copy link
Copy Markdown
Contributor Author

AbbasNS commented Mar 20, 2026

@nicolas-gauthier-sonarsource, FYI: PR created on their side: modelcontextprotocol/java-sdk#876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants