fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121
fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121adurham wants to merge 1 commit into
Conversation
Upstream's 2026-05-25 refactor removed the 'if self.device_rank == 0:' guard around event_sender.send(ChunkGenerated(...)). The intent on upstream's side appears to be that runners outside rank 0 either don't reach this method or don't have an active event_sender. On OUR 2-rank TP setup that assumption breaks: both rank 0 AND rank 1 hit send_chunk on every accepted token, both emit ChunkGenerated events, and the API sees every token twice. Symptom: 'Repeat exactly: FALCON-MERCURY-7749' returns 'FALCONFALCON-MERCURY-MERCURY-7749-7749'. Quality probe needle_found=0/3. Restore the guard. If we ever switch to upstream's runner topology this might need a different mechanism, but for the current 2-Studio TP setup this is the right semantics.
|
Flagging this as a regression from #2000 (engine abstraction) for visibility — @Evanev7 since you authored that refactor. Before #2000, Minimal reproducer on a 2-rank TP setup: This PR restores the rank-0 guard (1 line + comment). Verified on our 2-node M4 Max RDMA cluster: output clean again, throughput unchanged (30.7 t/s), 100K-context quality probe back to 3/3 needles. Existing runner-supervisor tests still pass. Happy to adjust if you'd prefer the dedup to live somewhere else (supervisor or API layer) — but rank-0-at-emission seemed like the smallest-blast-radius fix that matches the pre-#2000 behavior. |
Repo: exo-explore/exo
Branch: adurham:pr/runner-send-chunk-rank-0-guard
Target: exo-explore/exo:main
Commits: 1 (
1e1943d6)Title
fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel executionBody
Summary
Add a
device_rank != 0early-return toRunner.send_chunk()so only rank 0 emitsChunkGeneratedevents. Without this guard, multi-rank tensor-parallel deployments emit every accepted token from every rank, and the client sees duplicated text.Reproducer
On a 2-node TP setup (e.g. JACCL across two Mac Studios), with current main:
Returns:
Each token emitted twice. With this guard:
Why
Runner.main()is invoked on every TP rank. Both ranks run the same forward pass and reachsend_chunk()for every accepted token. Theevent_senderchannel is shared with the supervisor and ultimately the API server's chunk stream. Both ranks emit → API server sees duplicates.Rank 0 is canonical for this purpose; deduplicating at the emission point is the minimal-blast-radius fix. Alternative places to dedupe (supervisor, API server) would require additional state to identify "which rank's chunk is canonical" — strictly more code for the same effect.
Affected Topologies
History
A similar guard existed prior to PR #2000 (engine abstraction refactor) and PR #1570 (runner split). The refactors removed it. I noticed because the cluster I'm running (DeepSeek-V4-Flash-8bit on 2× M4 Max with
MlxJacclTP backend) regressed after I pulled the recent batch of upstream changes; quality-probe needle retrieval went from 3/3 to 0/3 because all output tokens were duplicated.Test Plan
python -m pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.py— 2/2 pass"Repeat exactly: FALCON-MERCURY-7749"returns"FALCONFALCON-MERCURY-MERCURY-7749-7749""FALCON-MERCURY-7749"(clean)Diff