Cluster discovery + lifecycle stability fixes by team-wcv · Pull Request #17 · team-wcv/exo

team-wcv · 2026-05-09T03:55:46Z

Summary

Four orthogonal fixes that piggybacked on the gemma-4 drafter branch but logically belong on their own track. They're independent of the drafter work and can land in any order relative to the drafter PR stack.

Commits

persist node ids in .cache (85f3bcbe) - node ID stability across restarts so reconnect logic can identify the same physical node deterministically.
fix: make darwin mdns discovery reliable (701838aa) - adds mdns_announcer.py; works around darwin-specific mDNS announce dropouts that left some peers undiscoverable on Thunderbolt cluster.
Reconcile worker instance backoff from state (2aefa296) - worker rebuilds its in-memory backoff state from cluster state on resume, so a worker that came back after master indexing already saw failed launches doesn't immediately retry. Includes unit tests for KeyedBackoff and worker instance backoff.
Tune cluster liveness polling cadence (b3ab680a) - reduces master-side liveness polling overhead.

Risk

Low. Three of the four commits add tests; the fourth is a small constant tweak. No drafter or speculative-decoding code touched.

Test plan

uv run pytest src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py
Manual: 3-node cluster restart, confirm peers re-discover within ~3s.

brings back EXO_CACHE_HOME as always ~/.cache/exo/, and store the node id in there. no random copies now!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3ab680a20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T04:02:10Z


 # Identity (config)
-EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair"
+EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair"


Preserve legacy node-id key path when enabling persistence

Changing EXO_NODE_ID_KEYPAIR from config to cache without a fallback migration means any installation that already has node_id.keypair in the previous config location will generate a new peer identity after upgrade. That breaks node identity continuity across restarts/upgrades and can invalidate cluster membership/reconnect behavior for existing deployments. Load the legacy config-path key when the new cache-path key is absent, then migrate it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T04:02:10Z

+        self.event_sender.close()
+        self.command_sender.close()
+        self.download_command_sender.close()
+        for runner in self.runners.values():
+            runner.shutdown()


Delay sender closure until worker tasks finish shutting down

Closing event_sender/command_sender before canceling worker tasks and runner supervisors can drop terminal status events during shutdown. RunnerSupervisor attempts to emit failure/completion updates while winding down, but those sends will raise ClosedResourceError once this early close runs, so the master can keep stale running state until timeout-based cleanup. Cancel and drain runner/task shutdown first, then close senders after _stopped is set.

Useful? React with 👍 / 👎.

team-wcv · 2026-05-09T04:15:30Z

Closing in favour of #16, which is now the canonical upstream-import lane.

The unique commit from this PR (701838aa fix: make darwin mdns discovery reliable, +188/-2 across src/exo/main.py, src/exo/routing/mdns_announcer.py, src/exo/worker/main.py) has been cherry-picked into PR #16 as commit 33d72520. The 3 commits this PR shared with #16 by content (persist node ids in .cache, Reconcile worker instance backoff from state, Tune cluster liveness polling cadence) were already there.

PR #16 (15 commits total) now contains a strict superset of this PR's content plus broader upstream features (peer-to-peer downloads, KV cache quantization, --trust-remote-code, rdma_ctl gate, peer file server hardening).

No content lost; all 4 commits from this PR are reflected in PR #16.

Evanev7 and others added 4 commits May 8, 2026 20:54

persist node ids in .cache

85f3bcb

brings back EXO_CACHE_HOME as always ~/.cache/exo/, and store the node id in there. no random copies now!

fix: make darwin mdns discovery reliable

701838a

Reconcile worker instance backoff from state

2aefa29

Tune cluster liveness polling cadence

b3ab680

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

team-wcv mentioned this pull request May 9, 2026

Gemma-4 drafter tuning + DrafterTransport architecture (model | pipelined | ngram | none) #15

Closed

team-wcv closed this May 9, 2026

team-wcv mentioned this pull request May 9, 2026

Import useful upstream exo PRs #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster discovery + lifecycle stability fixes#17

Cluster discovery + lifecycle stability fixes#17
team-wcv wants to merge 4 commits into
mainfrom
chore/cluster-discovery-and-lifecycle

team-wcv commented May 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Uh oh!

team-wcv commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

team-wcv commented May 9, 2026

Summary

Commits

Risk

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

team-wcv commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants