Cluster discovery + lifecycle stability fixes#17
Conversation
brings back EXO_CACHE_HOME as always ~/.cache/exo/, and store the node id in there. no random copies now!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3ab680a20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Identity (config) | ||
| EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair" | ||
| EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair" |
There was a problem hiding this comment.
Preserve legacy node-id key path when enabling persistence
Changing EXO_NODE_ID_KEYPAIR from config to cache without a fallback migration means any installation that already has node_id.keypair in the previous config location will generate a new peer identity after upgrade. That breaks node identity continuity across restarts/upgrades and can invalidate cluster membership/reconnect behavior for existing deployments. Load the legacy config-path key when the new cache-path key is absent, then migrate it.
Useful? React with 👍 / 👎.
| self.event_sender.close() | ||
| self.command_sender.close() | ||
| self.download_command_sender.close() | ||
| for runner in self.runners.values(): | ||
| runner.shutdown() |
There was a problem hiding this comment.
Delay sender closure until worker tasks finish shutting down
Closing event_sender/command_sender before canceling worker tasks and runner supervisors can drop terminal status events during shutdown. RunnerSupervisor attempts to emit failure/completion updates while winding down, but those sends will raise ClosedResourceError once this early close runs, so the master can keep stale running state until timeout-based cleanup. Cancel and drain runner/task shutdown first, then close senders after _stopped is set.
Useful? React with 👍 / 👎.
|
Closing in favour of #16, which is now the canonical upstream-import lane. The unique commit from this PR ( PR #16 (15 commits total) now contains a strict superset of this PR's content plus broader upstream features (peer-to-peer downloads, KV cache quantization, No content lost; all 4 commits from this PR are reflected in PR #16. |
Summary
Four orthogonal fixes that piggybacked on the gemma-4 drafter branch but logically belong on their own track. They're independent of the drafter work and can land in any order relative to the drafter PR stack.
Commits
85f3bcbe) - node ID stability across restarts so reconnect logic can identify the same physical node deterministically.701838aa) - addsmdns_announcer.py; works around darwin-specific mDNS announce dropouts that left some peers undiscoverable on Thunderbolt cluster.2aefa296) - worker rebuilds its in-memory backoff state from cluster state on resume, so a worker that came back after master indexing already saw failed launches doesn't immediately retry. Includes unit tests forKeyedBackoffand worker instance backoff.b3ab680a) - reduces master-side liveness polling overhead.Risk
Low. Three of the four commits add tests; the fourth is a small constant tweak. No drafter or speculative-decoding code touched.
Test plan
uv run pytest src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py