Skip to content

fix(api): eliminate token-drop race in _send_text_generation_with_images#2044

Open
Drifter4242 wants to merge 2 commits into
exo-explore:mainfrom
Drifter4242:pr/fix-token-stream-race
Open

fix(api): eliminate token-drop race in _send_text_generation_with_images#2044
Drifter4242 wants to merge 2 commits into
exo-explore:mainfrom
Drifter4242:pr/fix-token-stream-race

Conversation

@Drifter4242
Copy link
Copy Markdown
Contributor

Problem

I've been experimenting and testing Kimi K2.6 on 2 M3 Mac Ultras using exo and ran into an issue. Sonnet identified it as a race condition between queue creation and the send function that uses the queue. The send can happen before the queue creation. I agree with Sonnet's assessment and worked with it to refine the code.

The rest is written by Sonnet (reviewed by myself):

The token queue was not registered until _token_chunk_stream's inner generator was first iterated by the HTTP consumer. On a fast worker or loaded event loop, ChunkGenerated events could be dispatched before the queue existed and were silently dropped.

Fix

Register the token queue in _send_text_generation_with_images -- synchronously, before the await self._send(command) call that dispatches the command to workers. This guarantees the queue is live before any worker can produce tokens.

_send_text_generation_with_images:
self._text_generation_queues[command.command_id], recv = channel...
token_stream = self._token_chunk_stream(command.command_id, recv)
await self._send(command) # queue is already registered

_token_chunk_stream is now a plain async def generator that receives the pre-registered recv channel as a parameter. The outer sync wrapper and nested _stream() function are gone -- the diff is straightforward.

All six API endpoints are updated: chat_completions, bench_chat_completions, claude_messages, openai_responses, ollama_chat, ollama_generate.

Motivation

_token_chunk_stream previously created and registered the token queue inside
the generator body, which only runs when the HTTP response handler begins
iterating it. On fast hardware or a loaded event loop, the worker can dispatch
ChunkGenerated events — including the final stop signal — before the HTTP
handler has scheduled its first iteration. Those events are dropped silently,
causing the streaming response to hang indefinitely.

Changes

  • In _send_text_generation_with_images: create the channel and register
    self._text_generation_queues[command.command_id] before
    await self._send(command), so the queue is live before any worker can
    produce tokens.
  • _token_chunk_stream is now a plain async def generator that receives the
    pre-registered recv channel as a parameter. The outer sync wrapper and
    nested _stream() function are removed.
  • All six streaming endpoints updated: chat_completions,
    bench_chat_completions, claude_messages, openai_responses,
    ollama_chat, ollama_generate.

Why It Works

By registering the queue synchronously before await self._send(command),
there is no window between dispatch and registration. Any ChunkGenerated
event fired by a worker lands in an already-live queue, regardless of when the
HTTP consumer starts iterating.

Test Plan

Manual Testing

Hardware: 2× Mac Studio M3 Ultra 512 GB, connected via Thunderbolt 5
(direct bridge), running MlxJaccl RDMA tensor-parallel inference
(moonshotai/Kimi-K2.6, 595 GB INT4, 61 layers split across both nodes).

  • Sent streaming chat/completions request via curl; confirmed full SSE token
    stream delivered without hang or truncation.
  • Previously, this hardware configuration (extremely low latency between nodes)
    is the most likely to trigger the race, as workers dispatch tokens faster
    than a typical event loop can schedule the HTTP consumer's first iteration.

Automated Testing

All existing tests pass: pytest src -m "not slow" --import-mode=importlib
— 422/422 passed. No new tests added; the fix is a structural refactor of the
registration ordering, not a behaviour change under normal conditions.

## Problem

The token queue was not registered until _token_chunk_stream's inner
generator was first iterated by the HTTP consumer. On a fast worker or
loaded event loop, ChunkGenerated events could be dispatched before the
queue existed and were silently dropped.

## Fix

Register the token queue in _send_text_generation_with_images --
synchronously, before the await self._send(command) call that dispatches
the command to workers. This guarantees the queue is live before any
worker can produce tokens.

  _send_text_generation_with_images:
    self._text_generation_queues[command.command_id], recv = channel[...]()
    token_stream = self._token_chunk_stream(command.command_id, recv)
    await self._send(command)   # queue is already registered

_token_chunk_stream is now a plain async def generator that receives the
pre-registered recv channel as a parameter. The outer sync wrapper and
nested _stream() function are gone -- the diff is straightforward.

All six API endpoints are updated: chat_completions,
bench_chat_completions, claude_messages, openai_responses,
ollama_chat, ollama_generate.
@Drifter4242 Drifter4242 force-pushed the pr/fix-token-stream-race branch from 4f4ea70 to 1230141 Compare May 4, 2026 15:02
@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented May 5, 2026

i think this issue is legitimate but im currently rewriting a significant portion of our networking to avoid this exact issue at a deeper level so im going to hold off on it for now - thanks for the submission though, i've never seen this particular combination of latency before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants