fix(api): eliminate token-drop race in _send_text_generation_with_images by Drifter4242 · Pull Request #2044 · exo-explore/exo

Drifter4242 · 2026-05-03T23:18:36Z

Problem

I've been experimenting and testing Kimi K2.6 on 2 M3 Mac Ultras using exo and ran into an issue. Sonnet identified it as a race condition between queue creation and the send function that uses the queue. The send can happen before the queue creation. I agree with Sonnet's assessment and worked with it to refine the code.

The rest is written by Sonnet (reviewed by myself):

The token queue was not registered until _token_chunk_stream's inner generator was first iterated by the HTTP consumer. On a fast worker or loaded event loop, ChunkGenerated events could be dispatched before the queue existed and were silently dropped.

Fix

Register the token queue in _send_text_generation_with_images -- synchronously, before the await self._send(command) call that dispatches the command to workers. This guarantees the queue is live before any worker can produce tokens.

_send_text_generation_with_images:
self._text_generation_queues[command.command_id], recv = channel...
token_stream = self._token_chunk_stream(command.command_id, recv)
await self._send(command) # queue is already registered

_token_chunk_stream is now a plain async def generator that receives the pre-registered recv channel as a parameter. The outer sync wrapper and nested _stream() function are gone -- the diff is straightforward.

All six API endpoints are updated: chat_completions, bench_chat_completions, claude_messages, openai_responses, ollama_chat, ollama_generate.

Motivation

_token_chunk_stream previously created and registered the token queue inside
the generator body, which only runs when the HTTP response handler begins
iterating it. On fast hardware or a loaded event loop, the worker can dispatch
ChunkGenerated events — including the final stop signal — before the HTTP
handler has scheduled its first iteration. Those events are dropped silently,
causing the streaming response to hang indefinitely.

Changes

In _send_text_generation_with_images: create the channel and register
self._text_generation_queues[command.command_id] before
await self._send(command), so the queue is live before any worker can
produce tokens.
_token_chunk_stream is now a plain async def generator that receives the
pre-registered recv channel as a parameter. The outer sync wrapper and
nested _stream() function are removed.
All six streaming endpoints updated: chat_completions,
bench_chat_completions, claude_messages, openai_responses,
ollama_chat, ollama_generate.

Why It Works

By registering the queue synchronously before await self._send(command),
there is no window between dispatch and registration. Any ChunkGenerated
event fired by a worker lands in an already-live queue, regardless of when the
HTTP consumer starts iterating.

Test Plan

Manual Testing

Hardware: 2× Mac Studio M3 Ultra 512 GB, connected via Thunderbolt 5
(direct bridge), running MlxJaccl RDMA tensor-parallel inference
(moonshotai/Kimi-K2.6, 595 GB INT4, 61 layers split across both nodes).

Sent streaming chat/completions request via curl; confirmed full SSE token
stream delivered without hang or truncation.
Previously, this hardware configuration (extremely low latency between nodes)
is the most likely to trigger the race, as workers dispatch tokens faster
than a typical event loop can schedule the HTTP consumer's first iteration.

Automated Testing

All existing tests pass: pytest src -m "not slow" --import-mode=importlib
— 422/422 passed. No new tests added; the fix is a structural refactor of the
registration ordering, not a behaviour change under normal conditions.

## Problem The token queue was not registered until _token_chunk_stream's inner generator was first iterated by the HTTP consumer. On a fast worker or loaded event loop, ChunkGenerated events could be dispatched before the queue existed and were silently dropped. ## Fix Register the token queue in _send_text_generation_with_images -- synchronously, before the await self._send(command) call that dispatches the command to workers. This guarantees the queue is live before any worker can produce tokens. _send_text_generation_with_images: self._text_generation_queues[command.command_id], recv = channel[...]() token_stream = self._token_chunk_stream(command.command_id, recv) await self._send(command) # queue is already registered _token_chunk_stream is now a plain async def generator that receives the pre-registered recv channel as a parameter. The outer sync wrapper and nested _stream() function are gone -- the diff is straightforward. All six API endpoints are updated: chat_completions, bench_chat_completions, claude_messages, openai_responses, ollama_chat, ollama_generate.

Evanev7 · 2026-05-05T15:03:26Z

i think this issue is legitimate but im currently rewriting a significant portion of our networking to avoid this exact issue at a deeper level so im going to hold off on it for now - thanks for the submission though, i've never seen this particular combination of latency before

Drifter4242 force-pushed the pr/fix-token-stream-race branch from 4f4ea70 to 1230141 Compare May 4, 2026 15:02

Merge branch 'main' into pr/fix-token-stream-race

3f83a35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): eliminate token-drop race in _send_text_generation_with_images#2044

fix(api): eliminate token-drop race in _send_text_generation_with_images#2044
Drifter4242 wants to merge 2 commits into
exo-explore:mainfrom
Drifter4242:pr/fix-token-stream-race

Drifter4242 commented May 3, 2026

Uh oh!

Evanev7 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Drifter4242 commented May 3, 2026

Problem

Fix

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

Evanev7 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants