Skip to content

Cluster discovery + lifecycle stability fixes#17

Closed
team-wcv wants to merge 4 commits into
mainfrom
chore/cluster-discovery-and-lifecycle
Closed

Cluster discovery + lifecycle stability fixes#17
team-wcv wants to merge 4 commits into
mainfrom
chore/cluster-discovery-and-lifecycle

Conversation

@team-wcv
Copy link
Copy Markdown
Owner

@team-wcv team-wcv commented May 9, 2026

Summary

Four orthogonal fixes that piggybacked on the gemma-4 drafter branch but logically belong on their own track. They're independent of the drafter work and can land in any order relative to the drafter PR stack.

Commits

  • persist node ids in .cache (85f3bcbe) - node ID stability across restarts so reconnect logic can identify the same physical node deterministically.
  • fix: make darwin mdns discovery reliable (701838aa) - adds mdns_announcer.py; works around darwin-specific mDNS announce dropouts that left some peers undiscoverable on Thunderbolt cluster.
  • Reconcile worker instance backoff from state (2aefa296) - worker rebuilds its in-memory backoff state from cluster state on resume, so a worker that came back after master indexing already saw failed launches doesn't immediately retry. Includes unit tests for KeyedBackoff and worker instance backoff.
  • Tune cluster liveness polling cadence (b3ab680a) - reduces master-side liveness polling overhead.

Risk

Low. Three of the four commits add tests; the fourth is a small constant tweak. No drafter or speculative-decoding code touched.

Test plan

  • uv run pytest src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py
  • Manual: 3-node cluster restart, confirm peers re-discover within ~3s.

Evanev7 and others added 4 commits May 8, 2026 20:54
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3ab680a20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


# Identity (config)
EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair"
EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve legacy node-id key path when enabling persistence

Changing EXO_NODE_ID_KEYPAIR from config to cache without a fallback migration means any installation that already has node_id.keypair in the previous config location will generate a new peer identity after upgrade. That breaks node identity continuity across restarts/upgrades and can invalidate cluster membership/reconnect behavior for existing deployments. Load the legacy config-path key when the new cache-path key is absent, then migrate it.

Useful? React with 👍 / 👎.

Comment thread src/exo/worker/main.py
Comment on lines +371 to +375
self.event_sender.close()
self.command_sender.close()
self.download_command_sender.close()
for runner in self.runners.values():
runner.shutdown()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delay sender closure until worker tasks finish shutting down

Closing event_sender/command_sender before canceling worker tasks and runner supervisors can drop terminal status events during shutdown. RunnerSupervisor attempts to emit failure/completion updates while winding down, but those sends will raise ClosedResourceError once this early close runs, so the master can keep stale running state until timeout-based cleanup. Cancel and drain runner/task shutdown first, then close senders after _stopped is set.

Useful? React with 👍 / 👎.

@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

Closing in favour of #16, which is now the canonical upstream-import lane.

The unique commit from this PR (701838aa fix: make darwin mdns discovery reliable, +188/-2 across src/exo/main.py, src/exo/routing/mdns_announcer.py, src/exo/worker/main.py) has been cherry-picked into PR #16 as commit 33d72520. The 3 commits this PR shared with #16 by content (persist node ids in .cache, Reconcile worker instance backoff from state, Tune cluster liveness polling cadence) were already there.

PR #16 (15 commits total) now contains a strict superset of this PR's content plus broader upstream features (peer-to-peer downloads, KV cache quantization, --trust-remote-code, rdma_ctl gate, peer file server hardening).

No content lost; all 4 commits from this PR are reflected in PR #16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants