fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution by adurham · Pull Request #2121 · exo-explore/exo

adurham · 2026-05-27T18:14:53Z

Repo: exo-explore/exo
Branch: adurham:pr/runner-send-chunk-rank-0-guard
Target: exo-explore/exo:main
Commits: 1 (1e1943d6)

Title

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution

Body

Summary

Add a device_rank != 0 early-return to Runner.send_chunk() so only rank 0 emits ChunkGenerated events. Without this guard, multi-rank tensor-parallel deployments emit every accepted token from every rank, and the client sees duplicated text.

Reproducer

On a 2-node TP setup (e.g. JACCL across two Mac Studios), with current main:

curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<any-TP-capable-model>",
    "messages": [{"role":"user","content":"Repeat exactly: FALCON-MERCURY-7749"}],
    "max_tokens": 30,
    "temperature": 0
  }'

Returns:

FALCONFALCON-MERCURY-MERCURY-7749-7749

Each token emitted twice. With this guard:

FALCON-MERCURY-7749

Why

Runner.main() is invoked on every TP rank. Both ranks run the same forward pass and reach send_chunk() for every accepted token. The event_sender channel is shared with the supervisor and ultimately the API server's chunk stream. Both ranks emit → API server sees duplicates.

Rank 0 is canonical for this purpose; deduplicating at the emission point is the minimal-blast-radius fix. Alternative places to dedupe (supervisor, API server) would require additional state to identify "which rank's chunk is canonical" — strictly more code for the same effect.

Affected Topologies

✅ Single-node, single-rank: unchanged (rank is always 0)
✅ Multi-node, pipeline-parallel: unchanged (only one rank generates per shard)
❌ Multi-node, tensor-parallel (e.g. JACCL on 2× Mac Studio + Thunderbolt RDMA): was producing duplicates, now fixed

History

A similar guard existed prior to PR #2000 (engine abstraction refactor) and PR #1570 (runner split). The refactors removed it. I noticed because the cluster I'm running (DeepSeek-V4-Flash-8bit on 2× M4 Max with MlxJaccl TP backend) regressed after I pulled the recent batch of upstream changes; quality-probe needle retrieval went from 3/3 to 0/3 because all output tokens were duplicated.

Test Plan

python -m pytest src/exo/worker/tests/unittests/test_runner/test_runner_supervisor.py — 2/2 pass
Manual on 2-node M4 Max TP cluster:
- Before: "Repeat exactly: FALCON-MERCURY-7749" returns "FALCONFALCON-MERCURY-MERCURY-7749-7749"
- After: "FALCON-MERCURY-7749" (clean)
Quality probe at 100K context: 3/3 needles found, 0 special-token leaks, 0 bistability events
Throughput unchanged: 30.7 t/s vs 30.7 t/s pre/post

Diff

def send_chunk(
    self,
    chunk: Chunk,
    command_id: CommandId,
):
    assert isinstance(self.generator, Engine)
+   # Only rank 0 emits ChunkGenerated. Under tensor-parallel execution
+   # across multiple nodes (e.g. JACCL on 2 Mac Studios), every rank
+   # runs the same forward pass and reaches this method on every
+   # accepted token. Without this guard the API server's event channel
+   # receives the same ChunkGenerated event from each rank, and the
+   # client sees every token duplicated — e.g. asking the model to
+   # repeat "FALCON-MERCURY-7749" produces "FALCONFALCON-MERCURY-MERCURY-7749-7749".
+   # Rank 0 is canonical, so we de-duplicate at the emission point.
+   if self.device_rank != 0:
+       return
    self.event_sender.send(ChunkGenerated(command_id=command_id, chunk=chunk))

Upstream's 2026-05-25 refactor removed the 'if self.device_rank == 0:' guard around event_sender.send(ChunkGenerated(...)). The intent on upstream's side appears to be that runners outside rank 0 either don't reach this method or don't have an active event_sender. On OUR 2-rank TP setup that assumption breaks: both rank 0 AND rank 1 hit send_chunk on every accepted token, both emit ChunkGenerated events, and the API sees every token twice. Symptom: 'Repeat exactly: FALCON-MERCURY-7749' returns 'FALCONFALCON-MERCURY-MERCURY-7749-7749'. Quality probe needle_found=0/3. Restore the guard. If we ever switch to upstream's runner topology this might need a different mechanism, but for the current 2-Studio TP setup this is the right semantics.

adurham · 2026-05-28T16:26:12Z

Flagging this as a regression from #2000 (engine abstraction) for visibility — @Evanev7 since you authored that refactor.

Before #2000, Runner.send_chunk() had an if self.device_rank == 0: guard so only rank 0 emitted ChunkGenerated. The refactor dropped it. On a single-rank deployment that's harmless, but under multi-node tensor-parallel execution (e.g. JACCL across 2 Mac Studios) every rank runs the same forward pass and reaches send_chunk() for every accepted token — so the API server's event channel receives each token's chunk from both ranks and the client sees every token duplicated.

Minimal reproducer on a 2-rank TP setup:

"Repeat exactly: FALCON-MERCURY-7749"  ->  "FALCONFALCON-MERCURY-MERCURY-7749-7749"

This PR restores the rank-0 guard (1 line + comment). Verified on our 2-node M4 Max RDMA cluster: output clean again, throughput unchanged (30.7 t/s), 100K-context quality probe back to 3/3 needles. Existing runner-supervisor tests still pass.

Happy to adjust if you'd prefer the dedup to live somewhere else (supervisor or API layer) — but rank-0-at-emission seemed like the smallest-blast-radius fix that matches the pre-#2000 behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121

fix(runner): only rank 0 emits ChunkGenerated under tensor-parallel execution#2121
adurham wants to merge 1 commit into
exo-explore:mainfrom
adurham:pr/runner-send-chunk-rank-0-guard

adurham commented May 27, 2026

Uh oh!

adurham commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adurham commented May 27, 2026

Title

Body

Summary

Reproducer

Why

Affected Topologies

History

Test Plan

Diff

Uh oh!

adurham commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant