Skip to content

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121

Open
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:pr/runner-send-chunk-rank-0-guard
Open

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:pr/runner-send-chunk-rank-0-guard

Conversation

@adurham
Copy link
Copy Markdown
Contributor

@adurham adurham commented May 27, 2026

Repo: exo-explore/exo
Branch: adurham:pr/runner-send-chunk-rank-0-guard
Target: exo-explore/exo:main
Commits: 1 (1e1943d6)


Title

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution

Body

Summary

Add a device_rank != 0 early-return to Runner.send_chunk() so only rank 0 emits ChunkGenerated events. Without this guard, multi-rank tensor-parallel deployments emit every accepted token from every rank, and the client sees duplicated text.

Reproducer

On a 2-node TP setup (e.g. JACCL across two Mac Studios), with current main:

curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<any-TP-capable-model>",
    "messages": [{"role":"user","content":"Repeat exactly: FALCON-MERCURY-7749"}],
    "max_tokens": 30,
    "temperature": 0
  }'

Returns:

FALCONFALCON-MERCURY-MERCURY-7749-7749

Each token emitted twice. With this guard:

FALCON-MERCURY-7749

Why

Runner.main() is invoked on every TP rank. Both ranks run the same forward pass and reach send_chunk() for every accepted token. The event_sender channel is shared with the supervisor and ultimately the API server's chunk stream. Both ranks emit → API server sees duplicates.

Rank 0 is canonical for this purpose; deduplicating at the emission point is the minimal-blast-radius fix. Alternative places to dedupe (supervisor, API server) would require additional state to identify "which rank's chunk is canonical" — strictly more code for the same effect.

Affected Topologies

  • Single-node, single-rank: unchanged (rank is always 0)
  • Multi-node, pipeline-parallel: unchanged (only one rank generates per shard)
  • Multi-node, tensor-parallel (e.g. JACCL on 2× Mac Studio + Thunderbolt RDMA): was producing duplicates, now fixed

History

A similar guard existed prior to PR #2000 (engine abstraction refactor) and PR #1570 (runner split). The refactors removed it. I noticed because the cluster I'm running (DeepSeek-V4-Flash-8bit on 2× M4 Max with MlxJaccl TP backend) regressed after I pulled the recent batch of upstream changes; quality-probe needle retrieval went from 3/3 to 0/3 because all output tokens were duplicated.

Test Plan

  • python -m pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.py — 2/2 pass
  • Manual on 2-node M4 Max TP cluster:
    • Before: "Repeat exactly: FALCON-MERCURY-7749" returns "FALCONFALCON-MERCURY-MERCURY-7749-7749"
    • After: "FALCON-MERCURY-7749" (clean)
  • Quality probe at 100K context: 3/3 needles found, 0 special-token leaks, 0 bistability events
  • Throughput unchanged: 30.7 t/s vs 30.7 t/s pre/post

Diff

def send_chunk(
    self,
    chunk: Chunk,
    command_id: CommandId,
):
    assert isinstance(self.generator, Engine)
+   # Only rank 0 emits ChunkGenerated. Under tensor-parallel execution
+   # across multiple nodes (e.g. JACCL on 2 Mac Studios), every rank
+   # runs the same forward pass and reaches this method on every
+   # accepted token. Without this guard the API server's event channel
+   # receives the same ChunkGenerated event from each rank, and the
+   # client sees every token duplicated — e.g. asking the model to
+   # repeat "FALCON-MERCURY-7749" produces "FALCONFALCON-MERCURY-MERCURY-7749-7749".
+   # Rank 0 is canonical, so we de-duplicate at the emission point.
+   if self.device_rank != 0:
+       return
    self.event_sender.send(ChunkGenerated(command_id=command_id, chunk=chunk))

Upstream's 2026-05-25 refactor removed the 'if self.device_rank == 0:'
guard around event_sender.send(ChunkGenerated(...)). The intent on
upstream's side appears to be that runners outside rank 0 either don't
reach this method or don't have an active event_sender. On OUR 2-rank
TP setup that assumption breaks: both rank 0 AND rank 1 hit send_chunk
on every accepted token, both emit ChunkGenerated events, and the API
sees every token twice.

Symptom: 'Repeat exactly: FALCON-MERCURY-7749' returns
'FALCONFALCON-MERCURY-MERCURY-7749-7749'. Quality probe needle_found=0/3.

Restore the guard. If we ever switch to upstream's runner topology this
might need a different mechanism, but for the current 2-Studio TP setup
this is the right semantics.
@adurham
Copy link
Copy Markdown
Contributor Author

adurham commented May 28, 2026

Flagging this as a regression from #2000 (engine abstraction) for visibility — @Evanev7 since you authored that refactor.

Before #2000, Runner.send_chunk() had an if self.device_rank == 0: guard so only rank 0 emitted ChunkGenerated. The refactor dropped it. On a single-rank deployment that's harmless, but under multi-node tensor-parallel execution (e.g. JACCL across 2 Mac Studios) every rank runs the same forward pass and reaches send_chunk() for every accepted token — so the API server's event channel receives each token's chunk from both ranks and the client sees every token duplicated.

Minimal reproducer on a 2-rank TP setup:

"Repeat exactly: FALCON-MERCURY-7749"  ->  "FALCONFALCON-MERCURY-MERCURY-7749-7749"

This PR restores the rank-0 guard (1 line + comment). Verified on our 2-node M4 Max RDMA cluster: output clean again, throughput unchanged (30.7 t/s), 100K-context quality probe back to 3/3 needles. Existing runner-supervisor tests still pass.

Happy to adjust if you'd prefer the dedup to live somewhere else (supervisor or API layer) — but rank-0-at-emission seemed like the smallest-blast-radius fix that matches the pre-#2000 behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant