Skip to content

Import useful upstream exo PRs#16

Merged
team-wcv merged 29 commits into
mainfrom
chore/import-useful-upstream-prs
May 9, 2026
Merged

Import useful upstream exo PRs#16
team-wcv merged 29 commits into
mainfrom
chore/import-useful-upstream-prs

Conversation

@team-wcv
Copy link
Copy Markdown
Owner

@team-wcv team-wcv commented May 7, 2026

Summary

  • port selected useful upstream exo PRs into the team-wcv fork
  • add RDMA placement gating from rdma_ctl state, worker backoff reconciliation, liveness cadence tuning, persistent node IDs, trust-remote-code CLI wiring, KV cache quantization controls, and peer-to-peer model downloads
  • adapt the imported code to the current team-wcv branch shape, including current model directory constants and strict typing cleanup

Upstream PRs included

Validation

  • uv run basedpyright
  • uv run ruff check
  • EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest -q -> 511 passed, 1 skipped, 199 deselected
  • focused port suite: EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest src/exo/shared/tests/test_xdg_paths.py src/exo/download/tests/test_peer_download.py src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py src/exo/shared/tests/test_apply/test_apply_rdma_gating.py src/exo/master/tests/test_placement.py -q -> 61 passed

Notes

  • nix fmt was not run because Nix is not installed on this Mac. Official multi-user install requires an interactive sudo password; official single-user install is no longer supported on Darwin. The only post-check change was Python-only and was formatted with uv run ruff format src/exo/main.py.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f4fb54649

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/download/peer_file_server.py Outdated
Comment thread src/exo/download/peer_shard_downloader.py Outdated
Comment thread src/exo/download/peer_file_server.py Outdated
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 7, 2026

Patched the Codex review findings in 575c595:

  • P1 path traversal: peer file requests now resolve requested model/file paths and reject anything outside the model cache before stat/open.
  • P2 shared peer list race: coordinator now passes the shard into PeerAwareShardDownloader, and peers are queued per shard/download instead of a single mutable field.
  • P2 nested status listing: peer status now walks model directories recursively and reports relative nested paths, including partial files.

Validation run locally:

  • uv run ruff check
  • uv run basedpyright
  • EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest -q -> 516 passed, 1 skipped, 199 deselected

@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

Cherry-picked 701838aa fix: make darwin mdns discovery reliable from PR #17 (since closed) onto this branch as commit 33d72520. Adds src/exo/routing/mdns_announcer.py (97 lines) and 81 lines of darwin-mDNS-reliability glue in src/exo/main.py + 12 lines in src/exo/worker/main.py.

Trivial conflict resolved: combined the new _libp2p_port: int field with this branch's peer_file_server: PeerFileServer | None = None field on the Node dataclass; both kept, constructor arg order matched.

Verified clean: basedpyright src/exo/main.py 0 errors, ruff check src/exo/main.py clean.

Branch is now 15 commits ahead of team-wcv/main. PR #17 is closed and points here.

@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33d725205a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/worker/engines/mlx/cache.py Outdated
Comment thread src/exo/download/peer_download.py
team-wcv pushed a commit that referenced this pull request May 9, 2026
Two fixes for PR #16 round 2 (Codex):

1. P1: Skip quantized caches for non-PP make_cache models.
   ``make_kv_cache`` had a single-node safeguard
   (``_model_is_pipeline_parallel``) on the make_cache-LESS branch,
   but the make_cache-aware branch above it unconditionally replaced
   ``KVCache`` entries with ``QuantizedKVCache`` whenever
   ``EXO_KV_CACHE_BITS`` was set. Models that expose ``make_cache()``
   (e.g. Gemma3 with mixed attention layers) and run single-node
   would therefore crash at runtime with::

       <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet
       support batching with history

   because mlx-lm's single-node ``BatchGenerator`` calls
   ``_merge_caches`` on every step and ``QuantizedKVCache`` doesn't
   implement ``merge``. Fix: apply the same ``_model_is_pipeline_parallel``
   guard to the make_cache branch and emit the same warning when
   the env var is ignored.

2. P2: Verify peer file integrity before marking download complete.
   ``download_one`` in ``peer_shard_downloader.py`` marked peer
   downloads successful as soon as ``n_read == expected_size``,
   with no content-integrity check. A peer serving wrong bytes
   with the right length (stale/corrupt/malicious) was therefore
   silently accepted as model data, causing hard-to-diagnose
   inference failures or bad outputs.

   Fix: after the peer download completes, fetch the authoritative
   etag/hash from HuggingFace via ``file_meta()`` and validate the
   downloaded file via ``calc_hash``. On mismatch the file is
   removed and the caller falls back to direct HF download. Trusts
   HF as canonical source rather than peer-advertised hash to
   defend against malicious peers that lie about both content and
   hash.

   ``file_meta`` adds one HEAD round-trip per file; ``fetch_file_list_with_cache``
   already requires HF connectivity at this code path so the
   network requirement is not new. The semantics now match the
   direct HuggingFace download path which has done identical
   validation since import.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6babb76dca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/download/peer_shard_downloader.py
Comment thread src/exo/download/peer_shard_downloader.py Outdated
team-wcv pushed a commit that referenced this pull request May 9, 2026
Address Codex round-2 P1s on PR #16.

P1: Reuse ``ignore_patterns`` when selecting peer download files
- ``download_shard`` (download_utils.py:983) excludes ``original/*`` and
  ``metal/*`` because HuggingFace never downloads them. The peer path
  applied ``allow_patterns`` only, so any repo containing those paths
  (e.g. Llama 3.x) had a required-files list that included files the
  peer never had locally; the strict ``peer_info missing => fail``
  check then aborted the entire transfer and forced a HF fallback for
  every download. Pass the same ``ignore_patterns=["original/*",
  "metal/*"]`` into ``filter_repo_objects`` to match selection.

P1: Honor offline mode in peer file-list fetch
- ``_try_peer_download`` hard-coded ``skip_internet=False`` when calling
  ``fetch_file_list_with_cache``, so offline-configured nodes still
  reached out to HuggingFace before downloading from a LAN peer. On
  cold/offline nodes without a cached file list this raised, the
  ``except: return None`` exited early, and the peer download could
  not even start. Add an ``offline`` parameter to
  ``PeerAwareShardDownloader.__init__`` (defaulting to ``False``) and
  thread the existing ``DownloadCoordinator.offline`` value through
  ``exo_shard_downloader`` so the peer file-list fetch honors the
  same offline contract as ``ResumableShardDownloader``.

Tests
- ``test_offline_flag_defaults_to_false`` / ``..._propagates`` cover
  the constructor wiring directly.
- ``test_try_peer_download_passes_offline_to_fetch_file_list`` patches
  the import binding and asserts ``skip_internet=True`` is forwarded
  when the downloader is constructed with ``offline=True``.
- ``test_try_peer_download_filters_ignore_patterns`` records every
  ``filter_repo_objects`` call and asserts that the peer path
  requested ``ignore_patterns=["original/*", "metal/*"]`` -- the
  exact set ``download_shard`` uses.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f48d5f001e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/download/peer_shard_downloader.py
Comment thread src/exo/main.py Outdated
team-wcv pushed a commit that referenced this pull request May 9, 2026
… port

Codex P1 (round 3): peer downloads were calling ``file_meta()`` for every
file even when the coordinator ran with ``--offline`` / ``EXO_OFFLINE=true``.
Any failure to reach HF (the entire point of offline mode) was treated as
an integrity-check failure, the peer-fetched bytes were deleted, and the
cold offline node was left with no path to complete model sync. When the
downloader is in offline mode we now trust the LAN peer's bytes (size
already enforced by ``download_file_from_peer``) and skip the HF
canonical-hash call entirely. Online mode still validates against HF.

Codex P2 (round 3): the peer-download listener was hard-coded to a
single module-level constant, so a same-host multi-node deployment
crashed on the second process with ``address already in use``. Add a
``--peer-download-port`` CLI flag (default ``EXO_PEER_DOWNLOAD_PORT``)
and thread it through ``Node`` -> ``Worker`` and ``PeerFileServer``,
replacing the import-time constant. Cluster-wide convention: every
node uses the same value (peer discovery still uses the local value as
the assumed remote port). Cross-node port advertisement via state is a
documented follow-up.

Tests:
- ``TestPeerDownloadIntegrityCheckRespectsOfflineMode`` covers both
  modes: offline must not call ``file_meta`` and must keep the bytes,
  online must still call ``file_meta``.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6154483f54

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/main.py Outdated
Comment on lines +117 to +120
peer_file_server = PeerFileServer(
host="0.0.0.0",
port=args.peer_download_port,
models_dir=EXO_DEFAULT_MODELS_DIR,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serve peer files from all writable model directories

The peer file server is hard-wired to EXO_DEFAULT_MODELS_DIR, but downloads are placed via select_download_dir_for_shard() across EXO_MODELS_DIRS; when a node stores a model in a non-default writable directory (custom models path, low-space fallback, or default excluded as read-only), /status and /files will not expose that model and peers will always fall back to HuggingFace. This silently disables the new peer-download path for valid multi-directory deployments.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit fe02cea6. The peer file server now serves files from every writable model directory via _locate_all_model_dirs; downloads landing in any of EXO_DEFAULT_MODELS_DIR / EXO_DOWNLOAD_DIRECTORIES are advertised to peers and downloadable from peers.

Comment thread src/exo/download/peer_state.py Outdated
Comment on lines +103 to +104
else:
return PeerEndpoint(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer RDMA edge before returning socket endpoint

This function returns immediately on the first non-RDMA edge for the peer, so if topology iteration yields a socket edge before an RDMA edge, the same peer is mislabeled as socket and never gets RDMA priority. That breaks the intended "RDMA first" peer ordering and can consistently select slower peers depending on edge insertion order.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit fe02cea6 (Pick RDMA edges and serve from every model dir for peer downloads). _resolve_endpoint_for_peer now collects ALL edges between the pair before choosing, prefers the first RDMA edge if one exists, and only falls back to a socket edge when no RDMA edge is present. See peer_state.py:97-126.

team-wcv pushed a commit that referenced this pull request May 9, 2026
This addresses two Codex P2 findings on PR #16.

`_resolve_peer_endpoint` returned on the *first* topology edge it
visited for a peer, so when ``out_edges`` happened to yield the
SocketConnection before the RDMAConnection (insertion order is not a
stable contract on the topology graph), the peer was silently labelled
``socket`` and lost its RDMA priority in the peer ordering. The lookup
now scans all edges for a peer, prefers RDMA whenever any RDMA edge
exists (using the companion socket address for the actual TCP connect),
and only falls back to ``socket`` when no RDMA edge is present. The
unused `_find_socket_ip` helper that was meant as a fallback for the
RDMA branch is removed -- its behaviour is folded into the new scan.

`PeerFileServer` was hard-wired to ``EXO_DEFAULT_MODELS_DIR``, but
``select_download_dir_for_shard`` legitimately writes shards across
``EXO_MODELS_DIRS`` (custom paths, low-disk fallback) and we also have
shards available in ``EXO_MODELS_READ_ONLY_DIRS`` mounts. Pre-fix, any
model that landed outside the default directory was invisible to /status
and /files, so peers always fell back to HuggingFace and the new peer
path was a no-op for valid multi-directory deployments. The server now
takes ``models_dirs`` (a sequence) and probes every configured root in
caller-specified priority, checking each candidate against
``_resolve_child`` for path-traversal safety.

Adds regression coverage:
* `test_peer_state.py` covers RDMA-first ordering for both edge insertion
  orders and the socket-only / RDMA-only edge cases.
* `TestPeerFileServerMultipleDirectories` covers serving from a secondary
  writable directory and a read-only mount, plus the constructor's
  empty-list rejection.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe02cea608

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/shared/constants.py Outdated

# Identity (config)
EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair"
EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep node identity keypair out of cache storage

Saving EXO_NODE_ID_KEYPAIR under EXO_CACHE_HOME makes node identity vulnerable to normal cache eviction/cleanup, which causes nodes to come up with a new peer ID and breaks the intended persistence of cluster identity across restarts. Identity material should remain in a persistent config/state location rather than cache.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit 09d0279b (Reject oversized peer partials; relocate node-ID keypair to config dir). EXO_NODE_ID_KEYPAIR now defaults under EXO_CONFIG_HOME rather than EXO_CACHE_HOME, so node identity is no longer subject to cache eviction. A migration helper (_migrate_legacy_node_id_keypair) moves any pre-existing keypair from the cache location to the config location on first startup.

chunk_size = 8 * 1024 * 1024 # 8MB, matching HF download

try:
while n_read < expected_size and poll_count < max_poll_attempts:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject oversized stale partials before peer resume

The resume loop only runs while n_read < expected_size, so if an existing .partial is already larger than expected_size (stale/corrupt prior download), this path skips downloading and later renames that file as a successful result. In offline mode (where hash verification is intentionally skipped), this can silently accept incorrect model bytes and produce broken inference artifacts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit 09d0279b. The peer-download resume loop now rejects oversized stale partial files: if partial_size > expected_size, the partial is deleted and the download restarts from byte 0 rather than appending a fresh body to a stale buffer.

team-wcv pushed a commit that referenced this pull request May 9, 2026
Two Codex round-(N+1) P1 fixes for PR #16:

1. peer_download.py:120 -- "Reject oversized stale partials before
   peer resume". The resume loop ran ``while n_read < expected_size``,
   so any pre-existing ``.partial`` larger than ``expected_size``
   skipped the loop entirely and the final ``rename`` accepted the
   bad bytes as the canonical download. In offline mode (where peer
   download integrity check is intentionally skipped) this would
   permanently poison the model cache. Discard the stale oversized
   partial up front and restart from zero on this peer.

2. constants.py:73 -- "Keep node identity keypair out of cache
   storage". ``EXO_NODE_ID_KEYPAIR`` was rooted at ``EXO_CACHE_HOME``,
   which is subject to normal cache cleanup (e.g. ``trash
   ~/.cache/exo``); a wipe silently regenerated a fresh peer ID and
   broke cluster membership / mDNS routes. Move it to
   ``EXO_CONFIG_HOME`` (matching the existing
   ``test_node_id_in_config_dir`` invariant) and add a one-shot
   transparent migrator in ``router._migrate_legacy_node_id_keypair``
   so existing nodes retain identity after the upgrade.

Regression tests:

* ``test_oversized_stale_partial_is_discarded_and_retransferred`` --
  pre-fix would rename the junk ``.partial`` as the downloaded
  weights file; post-fix the file is re-fetched and matches the
  canonical bytes.
* ``test_legacy_keypair_is_migrated_to_new_location``,
  ``test_migration_is_idempotent_when_new_location_already_present``,
  ``test_migration_skipped_when_no_legacy_file``,
  ``test_get_node_id_keypair_uses_migrated_legacy_keypair``.

X-Orchestraitor-Plan: ecosystem_repo_standardization_aeee88ba
X-Agent-Platform: cursor-claude-opus-4.7
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09d0279baf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/routing/router.py
Comment on lines 295 to +297
path: str | bytes | PathLike[str] | PathLike[bytes] = EXO_NODE_ID_KEYPAIR,
legacy_path: str | bytes | PathLike[str] | PathLike[bytes] | None = (
EXO_LEGACY_NODE_ID_KEYPAIR
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve per-process node identity isolation

Loading and reusing a single on-disk keypair from the default EXO_NODE_ID_KEYPAIR path makes concurrent exo processes on the same host share the same NodeId, which breaks cluster semantics (for example, peer discovery paths that exclude peer_node_id == node_id and routing that expects unique node IDs). This is a regression from the previous per-process Keypair.generate() behavior and directly conflicts with the new same-host multi-node workflow introduced in this commit (distinct ports but still one shared identity file).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit 231335a0 (Scope node-ID keypair per process and migrate inside the file lock). The default EXO_NODE_ID_KEYPAIR path is now scoped per-process via _node_id_keypair_scope(args), so concurrent exo processes on the same host load distinct keypair files.

Comment thread src/exo/routing/router.py Outdated
Comment on lines +321 to +322
if legacy_path is not None:
_migrate_legacy_node_id_keypair(resolved_path, Path(str(legacy_path)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move legacy keypair migration inside the file lock

Calling _migrate_legacy_node_id_keypair() before acquiring FileLock introduces a startup race: two processes can both pass the existence checks, one can generate/write a new keypair under the lock, and the other can still overwrite it via legacy replace() outside the lock. That leaves processes in the same launch with divergent in-memory identities vs. persisted identity, causing unstable peer IDs across restarts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit 231335a0. _migrate_legacy_node_id_keypair is now called inside the FileLock block, so two processes cannot both pass the existence check and both attempt the migration.

team-wcv pushed a commit that referenced this pull request May 9, 2026
Codex P1/P2 (PR #16 round-(N+2), router.py:297, router.py:322):
the same-host multi-node workflow this PR introduces (distinct
peer-download ports per process) requires distinct NodeIds per
process so peer-discovery's self-skip and routing's unique-NodeId
invariants hold. Concurrent startups also raced on the legacy
cache->config keypair migration because it ran before FileLock.

- Add ``process_scope`` parameter to ``get_node_id_keypair`` that
  is folded into the on-disk filename (``node_id.<scope>.keypair``).
  Single-process deployments default to None and keep the existing
  shared file; ``main.py`` passes ``args.peer_download_port`` so
  multi-process same-host runs land on distinct identities.
- Move the legacy migration call inside the ``FileLock`` so two
  processes can't both pass the existence check and race into
  divergent in-memory vs. on-disk identities.
- Legacy file remains unscoped: the first process to migrate
  adopts the operator's existing identity; later processes (other
  scopes) start with fresh keypairs, which is exactly what
  per-process isolation requires.
- Add regression tests covering distinct scopes producing
  distinct keypairs, scope stability across calls, scoped legacy
  adoption, and a structural check that the migration runs inside
  the FileLock.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 231335a096

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/main.py Outdated
# ``NodeId``s) so peer-discovery's ``peer_node_id ==
# node_id`` self-skip and routing's unique-NodeId
# invariants continue to hold.
keypair = get_node_id_keypair(process_scope=args.peer_download_port)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop reusing one scoped keypair when peer download is disabled

This always scopes get_node_id_keypair by args.peer_download_port, even when --no-downloads or --no-peer-download disables the peer file server. In that mode, operators can legitimately run multiple same-host nodes with the default peer-download port (no bind conflict), and both processes will load the same node_id.<port>.keypair, producing identical NodeIds and breaking election/routing assumptions that node IDs are unique. Scope identity with a value that is unique in all multi-process modes (or only use peer-download-port scoping when peer download is actually enabled).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by the existing scope construction at main.py:472-479: the scope concatenates libp2p_port, api_port, AND peer_download_port, and at least one of {api_port, libp2p_port} must differ between two same-host processes (each is a distinct socket bind). Two --no-peer-download processes with the same default api_port/libp2p_port would conflict at the api or libp2p socket bind first, surfacing the conflict before they ever reach the keypair load.

Comment on lines +159 to +162
elif r.status in (200, 206):
async with aiofiles.open(
partial_path, "ab" if n_read > 0 else "wb"
) as f:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require 206 for ranged peer-download resumes

When resuming (n_read > 0), the client sends a Range header but still accepts HTTP 200 and appends the body to the existing partial file. A server is allowed to ignore Range and return full content with 200, which here will duplicate bytes, push n_read past expected_size, and still rename the oversized file as success; in offline mode this can silently poison model files because hash verification is skipped. On resume, treat non-206 as a restart/failure (or truncate and rewrite from zero) instead of appending.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit f34534c2 (Combine listening ports for keypair scope; restart on 200-on-resume). When n_read > 0 (resume) the client now requires HTTP 206; a 200 response triggers the partial file to be deleted and the download to restart from byte 0, preventing prepended-old + appended-new corruption.

team-wcv pushed a commit that referenced this pull request May 9, 2026
Codex P1 (PR #16 round-(N+3), main.py:74): the previous scope used
``args.peer_download_port`` only. With ``--no-downloads`` /
``--no-peer-download`` the peer file server doesn't bind, so two
same-host processes can both keep the default
``peer_download_port`` and would then load the same scoped
keypair file -- producing identical ``NodeId``s and breaking
election/routing's unique-NodeId invariants. The new
``_node_id_keypair_scope`` helper combines libp2p, api, and
peer-download ports: at least one of those MUST differ between
two same-host processes (each is a distinct local socket bind),
so the resulting scope is always per-process unique while
remaining stable across restarts of the same configuration.

Codex P1 (PR #16 round-(N+3), peer_download.py:162): on resume
``download_file_from_peer`` sends a ``Range`` header but accepted
HTTP 200 and appended to the existing partial. A non-compliant
peer server is allowed to ignore Range and return full content
with 200, which would duplicate bytes, push ``n_read`` past
``expected_size``, and -- because offline mode skips hash
verification -- silently poison the model cache by renaming the
oversized file as success. Now we treat 200-on-resume as a
restart: discard the partial, reset ``n_read = 0``, and the next
loop iteration re-fetches from zero.

Add regression tests:
- ``TestNodeIdKeypairScope`` covers the per-process scope helper:
  distinct libp2p / api / peer-download ports each yield distinct
  scopes; the same args yield the same scope; and the original
  bug (same default peer_download_port with peer-download
  disabled) is now isolated by libp2p_port differences.
- ``test_resume_with_200_response_discards_partial_and_restarts``
  stands up a tiny aiohttp server that always returns 200 (even
  for ranged requests), primes a partial file, and asserts the
  client discards the partial, restarts from zero, and lands the
  canonical bytes matching ``expected_size``.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f34534c23b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/exo/main.py
main.py:74).
"""
return (
f"libp2p-{args.libp2p_port}.api-{args.api_port}.peer-{args.peer_download_port}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include resolved libp2p port in node-ID scope

The scope string is derived from configured ports only, so --libp2p-port 0 always contributes the literal 0 even though each process actually binds a different ephemeral port at runtime. In worker-only runs where API and peer-download listeners are disabled, two same-host processes can share identical scope strings and therefore load the same keypair file, causing duplicate NodeIds and breaking routing/election assumptions. Use the resolved bound libp2p port (or another guaranteed per-process discriminator) when the configured port is 0.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: address libp2p-port=0 scope collision). _node_id_keypair_scope now folds os.getpid() into the scope when args.libp2p_port == 0, so two same-host processes binding ephemeral libp2p ports get distinct scopes (and therefore distinct node-ID keypair files).

Comment thread src/exo/download/peer_download.py Outdated
Comment on lines +183 to +187
chunk = await r.content.read(chunk_size)
if not chunk:
break
written = await f.write(chunk)
n_read += written
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject peer responses that exceed expected file size

The download loop keeps appending bytes until EOF and only checks n_read < expected_size afterward, so an oversized peer response is accepted as success (n_read >= expected_size) and then renamed into the model cache. In offline mode this bypasses hash verification entirely, so a peer serving a larger stale/wrong blob can silently poison local weights. Stop writing once expected_size is reached and treat any extra bytes as a failure.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: oversized peer responses). The download loop now compares n_read + len(chunk) > expected_size during the read and aborts immediately with RuntimeError, so an oversized peer response cannot trickle past the post-loop guard.

team-wcv pushed a commit that referenced this pull request May 9, 2026
… peer responses

main.py: when --libp2p-port 0 is set, the configured value is the
literal 0 even though each process binds a different ephemeral port at
runtime. Two same-host worker-only processes (no API, no peer
download) sharing the default api/peer ports would otherwise produce
identical scope strings and load the same on-disk keypair file,
breaking the unique-NodeId invariant. Fold os.getpid() into the scope
when libp2p_port == 0; the trade-off (ephemeral identity for
ephemeral ports) is the right semantic since the operator opted into
ephemeral binding by setting libp2p_port=0.

peer_download.py: bound the inner read by 'expected_size - n_read' and
treat any extra bytes as a peer protocol violation. Pre-fix the loop
kept appending until EOF and only checked n_read < expected_size
afterwards, so an oversized response (peer serving a stale/wrong blob)
was accepted as success and renamed into the cache. In offline mode
hash verification is skipped, so this silently poisoned local weights.

New tests:
- test_libp2p_port_zero_uses_pid_for_per_process_isolation: verifies
  the scope contains 'pid-<getpid>' when libp2p_port=0.
- test_libp2p_port_zero_in_two_processes_yield_distinct_scopes:
  monkeypatches os.getpid to simulate two same-host processes both
  binding libp2p_port=0 with identical api/peer ports and asserts
  the scopes diverge.
- test_oversized_peer_response_is_rejected_and_restarted: stands up a
  bad aiohttp peer that always serves canonical+'POISONED' bytes and
  asserts the client never lands the trailing junk in the cache.
@team-wcv
Copy link
Copy Markdown
Owner Author

team-wcv commented May 9, 2026

@codex review

AlexCheema and others added 23 commits May 9, 2026 14:25
The constant is the default for built-in models with known model cards,
which are trusted. Custom models added via API already default to
trust_remote_code=False in ModelCard.fetch_from_hf(). The CLI flag
overrides custom models only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
X-Orchestraitor-Task: exo-upstream-pr-import

X-Orchestraitor-Plan: import-useful-upstream-prs

X-Agent-Platform: codex
Cherry-picked from upstream 701838a (Alex Cheema, exo-explore/exo).

Resolves trivial dataclass conflict in src/exo/main.py: combined
`peer_file_server: PeerFileServer | None = None` (this branch's
peer-to-peer download field) with `_libp2p_port: int` (mDNS commit's
new field). Both fields kept; constructor positional args reordered
to match.

Type-checked clean (basedpyright 0 errors). Ruff clean.
Two fixes for PR #16 round 2 (Codex):

1. P1: Skip quantized caches for non-PP make_cache models.
   ``make_kv_cache`` had a single-node safeguard
   (``_model_is_pipeline_parallel``) on the make_cache-LESS branch,
   but the make_cache-aware branch above it unconditionally replaced
   ``KVCache`` entries with ``QuantizedKVCache`` whenever
   ``EXO_KV_CACHE_BITS`` was set. Models that expose ``make_cache()``
   (e.g. Gemma3 with mixed attention layers) and run single-node
   would therefore crash at runtime with::

       <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet
       support batching with history

   because mlx-lm's single-node ``BatchGenerator`` calls
   ``_merge_caches`` on every step and ``QuantizedKVCache`` doesn't
   implement ``merge``. Fix: apply the same ``_model_is_pipeline_parallel``
   guard to the make_cache branch and emit the same warning when
   the env var is ignored.

2. P2: Verify peer file integrity before marking download complete.
   ``download_one`` in ``peer_shard_downloader.py`` marked peer
   downloads successful as soon as ``n_read == expected_size``,
   with no content-integrity check. A peer serving wrong bytes
   with the right length (stale/corrupt/malicious) was therefore
   silently accepted as model data, causing hard-to-diagnose
   inference failures or bad outputs.

   Fix: after the peer download completes, fetch the authoritative
   etag/hash from HuggingFace via ``file_meta()`` and validate the
   downloaded file via ``calc_hash``. On mismatch the file is
   removed and the caller falls back to direct HF download. Trusts
   HF as canonical source rather than peer-advertised hash to
   defend against malicious peers that lie about both content and
   hash.

   ``file_meta`` adds one HEAD round-trip per file; ``fetch_file_list_with_cache``
   already requires HF connectivity at this code path so the
   network requirement is not new. The semantics now match the
   direct HuggingFace download path which has done identical
   validation since import.
Address Codex round-2 P1s on PR #16.

P1: Reuse ``ignore_patterns`` when selecting peer download files
- ``download_shard`` (download_utils.py:983) excludes ``original/*`` and
  ``metal/*`` because HuggingFace never downloads them. The peer path
  applied ``allow_patterns`` only, so any repo containing those paths
  (e.g. Llama 3.x) had a required-files list that included files the
  peer never had locally; the strict ``peer_info missing => fail``
  check then aborted the entire transfer and forced a HF fallback for
  every download. Pass the same ``ignore_patterns=["original/*",
  "metal/*"]`` into ``filter_repo_objects`` to match selection.

P1: Honor offline mode in peer file-list fetch
- ``_try_peer_download`` hard-coded ``skip_internet=False`` when calling
  ``fetch_file_list_with_cache``, so offline-configured nodes still
  reached out to HuggingFace before downloading from a LAN peer. On
  cold/offline nodes without a cached file list this raised, the
  ``except: return None`` exited early, and the peer download could
  not even start. Add an ``offline`` parameter to
  ``PeerAwareShardDownloader.__init__`` (defaulting to ``False``) and
  thread the existing ``DownloadCoordinator.offline`` value through
  ``exo_shard_downloader`` so the peer file-list fetch honors the
  same offline contract as ``ResumableShardDownloader``.

Tests
- ``test_offline_flag_defaults_to_false`` / ``..._propagates`` cover
  the constructor wiring directly.
- ``test_try_peer_download_passes_offline_to_fetch_file_list`` patches
  the import binding and asserts ``skip_internet=True`` is forwarded
  when the downloader is constructed with ``offline=True``.
- ``test_try_peer_download_filters_ignore_patterns`` records every
  ``filter_repo_objects`` call and asserts that the peer path
  requested ``ignore_patterns=["original/*", "metal/*"]`` -- the
  exact set ``download_shard`` uses.
… port

Codex P1 (round 3): peer downloads were calling ``file_meta()`` for every
file even when the coordinator ran with ``--offline`` / ``EXO_OFFLINE=true``.
Any failure to reach HF (the entire point of offline mode) was treated as
an integrity-check failure, the peer-fetched bytes were deleted, and the
cold offline node was left with no path to complete model sync. When the
downloader is in offline mode we now trust the LAN peer's bytes (size
already enforced by ``download_file_from_peer``) and skip the HF
canonical-hash call entirely. Online mode still validates against HF.

Codex P2 (round 3): the peer-download listener was hard-coded to a
single module-level constant, so a same-host multi-node deployment
crashed on the second process with ``address already in use``. Add a
``--peer-download-port`` CLI flag (default ``EXO_PEER_DOWNLOAD_PORT``)
and thread it through ``Node`` -> ``Worker`` and ``PeerFileServer``,
replacing the import-time constant. Cluster-wide convention: every
node uses the same value (peer discovery still uses the local value as
the assumed remote port). Cross-node port advertisement via state is a
documented follow-up.

Tests:
- ``TestPeerDownloadIntegrityCheckRespectsOfflineMode`` covers both
  modes: offline must not call ``file_meta`` and must keep the bytes,
  online must still call ``file_meta``.
This addresses two Codex P2 findings on PR #16.

`_resolve_peer_endpoint` returned on the *first* topology edge it
visited for a peer, so when ``out_edges`` happened to yield the
SocketConnection before the RDMAConnection (insertion order is not a
stable contract on the topology graph), the peer was silently labelled
``socket`` and lost its RDMA priority in the peer ordering. The lookup
now scans all edges for a peer, prefers RDMA whenever any RDMA edge
exists (using the companion socket address for the actual TCP connect),
and only falls back to ``socket`` when no RDMA edge is present. The
unused `_find_socket_ip` helper that was meant as a fallback for the
RDMA branch is removed -- its behaviour is folded into the new scan.

`PeerFileServer` was hard-wired to ``EXO_DEFAULT_MODELS_DIR``, but
``select_download_dir_for_shard`` legitimately writes shards across
``EXO_MODELS_DIRS`` (custom paths, low-disk fallback) and we also have
shards available in ``EXO_MODELS_READ_ONLY_DIRS`` mounts. Pre-fix, any
model that landed outside the default directory was invisible to /status
and /files, so peers always fell back to HuggingFace and the new peer
path was a no-op for valid multi-directory deployments. The server now
takes ``models_dirs`` (a sequence) and probes every configured root in
caller-specified priority, checking each candidate against
``_resolve_child`` for path-traversal safety.

Adds regression coverage:
* `test_peer_state.py` covers RDMA-first ordering for both edge insertion
  orders and the socket-only / RDMA-only edge cases.
* `TestPeerFileServerMultipleDirectories` covers serving from a secondary
  writable directory and a read-only mount, plus the constructor's
  empty-list rejection.
Two Codex round-(N+1) P1 fixes for PR #16:

1. peer_download.py:120 -- "Reject oversized stale partials before
   peer resume". The resume loop ran ``while n_read < expected_size``,
   so any pre-existing ``.partial`` larger than ``expected_size``
   skipped the loop entirely and the final ``rename`` accepted the
   bad bytes as the canonical download. In offline mode (where peer
   download integrity check is intentionally skipped) this would
   permanently poison the model cache. Discard the stale oversized
   partial up front and restart from zero on this peer.

2. constants.py:73 -- "Keep node identity keypair out of cache
   storage". ``EXO_NODE_ID_KEYPAIR`` was rooted at ``EXO_CACHE_HOME``,
   which is subject to normal cache cleanup (e.g. ``trash
   ~/.cache/exo``); a wipe silently regenerated a fresh peer ID and
   broke cluster membership / mDNS routes. Move it to
   ``EXO_CONFIG_HOME`` (matching the existing
   ``test_node_id_in_config_dir`` invariant) and add a one-shot
   transparent migrator in ``router._migrate_legacy_node_id_keypair``
   so existing nodes retain identity after the upgrade.

Regression tests:

* ``test_oversized_stale_partial_is_discarded_and_retransferred`` --
  pre-fix would rename the junk ``.partial`` as the downloaded
  weights file; post-fix the file is re-fetched and matches the
  canonical bytes.
* ``test_legacy_keypair_is_migrated_to_new_location``,
  ``test_migration_is_idempotent_when_new_location_already_present``,
  ``test_migration_skipped_when_no_legacy_file``,
  ``test_get_node_id_keypair_uses_migrated_legacy_keypair``.

X-Orchestraitor-Plan: ecosystem_repo_standardization_aeee88ba
X-Agent-Platform: cursor-claude-opus-4.7
Codex P1/P2 (PR #16 round-(N+2), router.py:297, router.py:322):
the same-host multi-node workflow this PR introduces (distinct
peer-download ports per process) requires distinct NodeIds per
process so peer-discovery's self-skip and routing's unique-NodeId
invariants hold. Concurrent startups also raced on the legacy
cache->config keypair migration because it ran before FileLock.

- Add ``process_scope`` parameter to ``get_node_id_keypair`` that
  is folded into the on-disk filename (``node_id.<scope>.keypair``).
  Single-process deployments default to None and keep the existing
  shared file; ``main.py`` passes ``args.peer_download_port`` so
  multi-process same-host runs land on distinct identities.
- Move the legacy migration call inside the ``FileLock`` so two
  processes can't both pass the existence check and race into
  divergent in-memory vs. on-disk identities.
- Legacy file remains unscoped: the first process to migrate
  adopts the operator's existing identity; later processes (other
  scopes) start with fresh keypairs, which is exactly what
  per-process isolation requires.
- Add regression tests covering distinct scopes producing
  distinct keypairs, scope stability across calls, scoped legacy
  adoption, and a structural check that the migration runs inside
  the FileLock.
Codex P1 (PR #16 round-(N+3), main.py:74): the previous scope used
``args.peer_download_port`` only. With ``--no-downloads`` /
``--no-peer-download`` the peer file server doesn't bind, so two
same-host processes can both keep the default
``peer_download_port`` and would then load the same scoped
keypair file -- producing identical ``NodeId``s and breaking
election/routing's unique-NodeId invariants. The new
``_node_id_keypair_scope`` helper combines libp2p, api, and
peer-download ports: at least one of those MUST differ between
two same-host processes (each is a distinct local socket bind),
so the resulting scope is always per-process unique while
remaining stable across restarts of the same configuration.

Codex P1 (PR #16 round-(N+3), peer_download.py:162): on resume
``download_file_from_peer`` sends a ``Range`` header but accepted
HTTP 200 and appended to the existing partial. A non-compliant
peer server is allowed to ignore Range and return full content
with 200, which would duplicate bytes, push ``n_read`` past
``expected_size``, and -- because offline mode skips hash
verification -- silently poison the model cache by renaming the
oversized file as success. Now we treat 200-on-resume as a
restart: discard the partial, reset ``n_read = 0``, and the next
loop iteration re-fetches from zero.

Add regression tests:
- ``TestNodeIdKeypairScope`` covers the per-process scope helper:
  distinct libp2p / api / peer-download ports each yield distinct
  scopes; the same args yield the same scope; and the original
  bug (same default peer_download_port with peer-download
  disabled) is now isolated by libp2p_port differences.
- ``test_resume_with_200_response_discards_partial_and_restarts``
  stands up a tiny aiohttp server that always returns 200 (even
  for ranged requests), primes a partial file, and asserts the
  client discards the partial, restarts from zero, and lands the
  canonical bytes matching ``expected_size``.
… peer responses

main.py: when --libp2p-port 0 is set, the configured value is the
literal 0 even though each process binds a different ephemeral port at
runtime. Two same-host worker-only processes (no API, no peer
download) sharing the default api/peer ports would otherwise produce
identical scope strings and load the same on-disk keypair file,
breaking the unique-NodeId invariant. Fold os.getpid() into the scope
when libp2p_port == 0; the trade-off (ephemeral identity for
ephemeral ports) is the right semantic since the operator opted into
ephemeral binding by setting libp2p_port=0.

peer_download.py: bound the inner read by 'expected_size - n_read' and
treat any extra bytes as a peer protocol violation. Pre-fix the loop
kept appending until EOF and only checked n_read < expected_size
afterwards, so an oversized response (peer serving a stale/wrong blob)
was accepted as success and renamed into the cache. In offline mode
hash verification is skipped, so this silently poisoned local weights.

New tests:
- test_libp2p_port_zero_uses_pid_for_per_process_isolation: verifies
  the scope contains 'pid-<getpid>' when libp2p_port=0.
- test_libp2p_port_zero_in_two_processes_yield_distinct_scopes:
  monkeypatches os.getpid to simulate two same-host processes both
  binding libp2p_port=0 with identical api/peer ports and asserts
  the scopes diverge.
- test_oversized_peer_response_is_rejected_and_restarted: stands up a
  bad aiohttp peer that always serves canonical+'POISONED' bytes and
  asserts the client never lands the trailing junk in the cache.
…roots

master/main.py P1: revert the 5s node_inactivity_timeout to 30s. Pre-fix
the 5s window was too tight: any node that didn't publish
NodeGatheredInfo within 5s (e.g. when fast probes are unavailable or
delayed) was marked timed out and had its instances deleted in the
same _plan loop. Because this loop now ticks every second, normal
telemetry jitter caused repeated false-positive NodeTimedOut events
and unnecessary instance churn. The 1s tick stays so the master
reacts quickly when a node *does* genuinely time out.

peer_file_server.py P2: search every configured root before selecting
the model dir to serve. Pre-fix _locate_model_dir returned the first
root that *contained* the model directory regardless of completeness.
When an earlier writable root held a partial download and a later
read-only mount held a complete copy, /status and /files only saw
the partial tree -- peers thought the node had no canonical copy and
fell back to HuggingFace despite a complete local copy on a different
mount. New behavior:
- /status unions across all matching roots; for duplicate filenames,
  complete files dominate partials, larger partials dominate smaller.
- /files prefers the root holding a complete copy of the requested
  file; falls back to the largest partial when no root has it
  complete; returns 404 only when every root truly lacks the file.

Added _locate_all_model_dirs helper that returns every root holding
the model in priority order (writable before read-only).

New tests:
- test_status_unions_partial_in_first_root_with_complete_in_second:
  verifies /status surfaces the complete file from a later root and
  marks it as complete (not partial).
- test_files_serves_complete_copy_when_first_root_has_only_partial:
  end-to-end via aiohttp client, verifies /files returns 200 with
  canonical bytes and X-Exo-Complete=true.
PeerFileServer.run() returned immediately after site.start(), so the
task spawned by Node.run() (tg.start_soon(self.peer_file_server.run))
completed on the first event-loop tick. The parent task group
considered the server 'done' the moment the listener bound, so when
the node was cancelled there was no live coroutine for the task group
to drive teardown -- the aiohttp listener kept its TCP socket open
until process exit. That manifested as
'OSError: [Errno 48] address already in use' whenever a node was
stopped/restarted in the same process (tests, embedded runs,
systemd-style restart loops).

Make run() block on anyio.sleep_forever() after starting the listener
and run runner.cleanup() in a shielded finally on cancellation. The
shield prevents the cancellation from killing cleanup itself (which
would re-introduce the leak); the cast on self._runner placates the
type-checker without weakening the runtime guard against double-drive
when an external shutdown() call has already torn things down.

Add lifecycle tests that verify (1) run() does not exit on its own
after site.start(), and (2) the listening port is reusable
immediately after task-group cancellation. Pre-fix the second test
fails with EADDRINUSE; post-fix it passes.
The peer transfer path skipped every file whose declared size was 0
(e.g. .gitattributes markers, empty __init__.py shims, empty config
sentinels), so the local snapshot diverged from the filtered file
list HF would have produced. DownloadCompleted was published with
an incomplete model directory and downstream loaders that probe for
those marker files (chat-template adapters, processor configs that
expect an empty sentinel) failed in ways that didn't point back at
the peer step.

After the canonical (non-empty) peer transfers succeed, materialize
each zero-byte marker as a local empty file using aiofiles in append
mode (so a resumed-from-partial marker isn't truncated). Marker
materialization is intentionally deferred until the canonical
transfer succeeds: a partial peer transfer must NOT leave behind
orphan empty files that masquerade as a complete download and
confuse the HF fallback's already-downloaded probe. If marker
creation itself fails (filesystem permissions, etc.) we fall back
to HF for the full snapshot integrity guarantee.

Add two regression tests:
- test_zero_byte_marker_files_materialized_after_peer_transfer:
  asserts both root- and nested-zero-byte markers land on disk
  with size 0 after a successful peer transfer.
- test_zero_byte_files_not_created_when_canonical_transfer_fails:
  asserts a failing canonical transfer leaves the markers absent
  so the HF fallback starts from a clean directory state.
…ansfer

Round-(N+10)'s fix lumped FileListEntry(size=None) together with
size=0 markers and materialized both as empty local files. But
fetch_file_list_with_cache returns size=None for files discovered
via the safetensors index whose size wasn't in the HF API response
-- those are *real weight shards*, not markers. Treating them as
empty produced 'DownloadCompleted' snapshots with corrupted weights
that failed only at load/inference time, far from the peer step
that caused the corruption.

Split the two cases:
- size == 0 stays the marker materialization path (touch on disk).
- size is None aborts the peer transfer with a logged warning so
  the HF fallback gets a real download path.

A pre-pass over filtered_file_list detects size=None and missing
peer_info BEFORE any download_one coroutines are constructed, so
the early bail-out can't leak un-awaited coroutines.

Add a regression test that builds a file list with one canonical
real-size file plus one size=None weight shard and asserts the
peer transfer aborts (returns None), the unknown-size file is NOT
created locally, and download_file_from_peer is never called.
Codex flagged a P1 finding on PR #16 head 8a9bc7c at
``src/exo/routing/router.py:359``:

> ``get_node_id_keypair()`` locks on ``resolved_path``, so two
> same-host processes with different ``process_scope`` values
> acquire different lock files and can run legacy migration
> concurrently. In the cross-device fallback path (``replace()``
> fails, then copy bytes), both processes can read the same legacy
> keypair before it is unlinked and each write it to its own
> scoped file, producing duplicate node identities despite
> different scopes. This breaks the unique-NodeId assumptions in
> routing/election during concurrent startup when cache/config
> live on different filesystems.

Root cause: ``_scoped_keypair_path`` (R(N+2)) intentionally folds
the process scope into the on-disk filename so each process gets a
distinct keypair, and the existing FileLock at
``router.py:357`` is keyed on that scoped path. Result: two
concurrent same-host processes with distinct scopes acquire
DIFFERENT lock files and never serialize against each other -- so
both can enter ``_migrate_legacy_node_id_keypair`` concurrently
before either has a chance to unlink the legacy file. ``replace()``
on the legacy path is atomic (only one wins), but the cross-device
fallback (the ``OSError`` branch on Linux when ``XDG_*`` dirs span
filesystems) reads-then-writes-then-unlinks, and that read-then-
write window is wide enough for a second scope to read the same
legacy bytes and copy them into its own scoped file. Two scoped
keypairs end up holding identical bytes, so two same-host nodes
boot with the same ``NodeId`` -- this is exactly the routing /
election unique-identity assumption Codex called out.

Fix: wrap ``_migrate_legacy_node_id_keypair`` in a second
``FileLock`` keyed on the **legacy** path. The legacy path is
intentionally NOT scope-suffixed (it pre-dates per-process
scoping), so the legacy-keyed lock is the single global
serialization point shared across every scope. Lock ordering is
``resolved_path`` (outer) -> ``resolved_legacy`` (inner): no
deadlock is possible because the legacy lock is only acquired
while holding the per-scope lock and is released before keypair
I/O resumes, so unrelated scopes' keypair I/O isn't blocked on
identity housekeeping.

Documented behaviour preserved: the docstring's "first process
boots wins" semantic is now actually enforced -- the winner of
the legacy lock unlinks the legacy file, the loser's migrator
no-ops on the absent legacy and proceeds to generate a fresh
keypair as the per-process isolation invariant requires.

Regression: ``test_legacy_migration_serialized_across_process_scopes``.
Forces the cross-device fallback by monkey-patching ``Path.replace``
to raise ``OSError`` on the legacy path, then pauses inside the
byte-copy ``write_bytes`` for thread A while thread B starts up
with a distinct scope. Pre-fix both threads slip through their
per-scope locks and end up with identical scoped keypairs (the
test's ``scope_a_bytes != scope_b_bytes`` assertion fails with
identical byte strings -- verified by running the test against the
pre-fix code via ``git stash``); post-fix the legacy lock blocks
thread B until thread A finishes adoption, and exactly one scope
ends up holding the legacy bytes while the other generates a
fresh identity.
Codex flagged a P2 finding on PR #16 head a1374b4 at
``src/exo/download/peer_shard_downloader.py:407``:

> When a repo includes zero-byte files, this branch skips
> ``download_one`` and later materializes marker files on disk,
> but it never updates ``file_progress`` for those paths. As a
> result, ``calculate_repo_progress()`` can leave the overall
> status as ``not_started`` even after all bytes are present, so
> ``_download_progress_callback`` never emits ``DownloadCompleted``
> immediately and the model can remain stuck in ``DownloadOngoing``
> until the periodic reconciliation loop runs.

Root cause: the round-(N+10) zero-byte materialization path
explicitly skips ``download_one`` for zero-byte files, but
``download_one`` is the SOLE writer of the per-file
``status="complete"`` transition (via its inner ``on_file_progress``
callback). The seeded entry for a zero-byte file at line 338
defaults to ``status="not_started"`` and stays that way through
the materialization step, so the final
``calculate_repo_progress`` rollup -- which uses per-file statuses
to derive the overall repo status -- emits
``RepoDownloadProgress(status="not_started")`` even though every
file is on disk. ``_download_progress_callback`` in the
``DownloadCoordinator`` only publishes ``DownloadCompleted`` on
``status="complete"``, so the model's effective state stays at
``DownloadOngoing`` until the periodic
``_emit_existing_download_progress`` reconciliation loop notices
the on-disk snapshot and force-promotes it. That delay can
silently break test fixtures that expect synchronous completion
(the bench harness, the API's polling
``StartDownload`` -> ``DownloadCompleted`` waiter) and forces an
unnecessary HF re-validation step for offline / air-gapped users
who restart between transfer and reconciliation.

Fix: after materializing each zero-byte marker on disk, replace
its seeded ``not_started`` entry in ``file_progress`` with a
fully-complete ``RepoFileDownloadProgress(status="complete")``.
``RepoFileDownloadProgress`` is frozen so we replace the dict
slot rather than mutating in place. The pattern mirrors the
regular file completion path in ``download_one``'s
``on_file_progress`` callback (``status="complete" if is_renamed
else "in_progress"`` resolves to ``complete`` for the final
emission), preserving the documented progress-rollup invariant.

Regression:
``test_zero_byte_files_marked_complete_in_progress_map`` exercises
the zero-byte materialization fixture from the round-(N+10) test
and additionally captures the final progress callback emission.
The canonical safetensor's progress callback is now invoked by
the test fake (matching the production ``download_file_from_peer``
contract), so the canonical entry's status flips to ``complete``
correctly. Pre-fix the rolled-up status is ``not_started``
because the zero-byte entries never transition; post-fix every
per-file entry is ``complete`` and the rollup status is
``complete`` -- the regression Codex called out is locked in.
Verified the test fails on the pre-fix coordinator via
``git stash``.
@team-wcv team-wcv force-pushed the chore/import-useful-upstream-prs branch from 12d9aa3 to 19d6ce1 Compare May 9, 2026 21:26
@team-wcv team-wcv merged commit c519525 into main May 9, 2026
team-wcv pushed a commit that referenced this pull request May 9, 2026
Squashes 30 commits from feature/drafter-in-process-tuning onto
post-#16 main. Cumulative behavior changes:

- Add Drafter abstraction + n-gram drafting strategy as a non-model
  alternative for speculative decoding (model | pipelined | ngram | none)
- Wire drafter KV prefix cache + manual drafter prefill on the spec path
- Bypass exo prefill+prefix cache for drafters: mlx_lm prefills target
  and drafter natively via speculative_generate_step
- Add drafter benchmark harness (bench/drafter_bench.py) for A/B
  comparing draft modes
- Forward draft_mode override and record n-gram acceptance for adaptive K
- Per-request drafter overrides + warmup with configured K
- Add drafter tuning, builder, and adaptive-K-gate test coverage

Conflict resolution in src/exo/worker/engines/mlx/cache.py
(make_kv_cache):
- Keep #16's PP-gated KV_CACHE_BITS quantization block (operator-facing
  EXO_KV_CACHE_BITS feature for pipeline-parallel deployments) and the
  #16 PP gate in the no-make_cache() branch (prevents the documented
  BatchGenerator + QuantizedKVCache.merge() crash in single-node mode)
- Adopt #19's docstring focusing on mlx_lm.speculative_generate_step
  compatibility
- Drop the unconditional c.step = 16384 mutation on plain KVCache
  instances returned by model.make_cache() so the spec-decoding path
  receives caches whose allocation policy matches each architecture's
  declared shape (consistent with the V2 walk-back rationale in commit
  241fe2c "Use mlx_lm native caches on spec-decoding path")
- Retain qc.step = 16384 on QuantizedKVCache instances we construct
  ourselves in the PP path (internal to that allocation, doesn't affect
  spec-decoding which never reaches that branch)
team-wcv added a commit that referenced this pull request May 9, 2026
Squashes 30 commits from feature/drafter-in-process-tuning onto
post-#16 main. Cumulative behavior changes:

- Add Drafter abstraction + n-gram drafting strategy as a non-model
  alternative for speculative decoding (model | pipelined | ngram | none)
- Wire drafter KV prefix cache + manual drafter prefill on the spec path
- Bypass exo prefill+prefix cache for drafters: mlx_lm prefills target
  and drafter natively via speculative_generate_step
- Add drafter benchmark harness (bench/drafter_bench.py) for A/B
  comparing draft modes
- Forward draft_mode override and record n-gram acceptance for adaptive K
- Per-request drafter overrides + warmup with configured K
- Add drafter tuning, builder, and adaptive-K-gate test coverage

Conflict resolution in src/exo/worker/engines/mlx/cache.py
(make_kv_cache):
- Keep #16's PP-gated KV_CACHE_BITS quantization block (operator-facing
  EXO_KV_CACHE_BITS feature for pipeline-parallel deployments) and the
  #16 PP gate in the no-make_cache() branch (prevents the documented
  BatchGenerator + QuantizedKVCache.merge() crash in single-node mode)
- Adopt #19's docstring focusing on mlx_lm.speculative_generate_step
  compatibility
- Drop the unconditional c.step = 16384 mutation on plain KVCache
  instances returned by model.make_cache() so the spec-decoding path
  receives caches whose allocation policy matches each architecture's
  declared shape (consistent with the V2 walk-back rationale in commit
  241fe2c "Use mlx_lm native caches on spec-decoding path")
- Retain qc.step = 16384 on QuantizedKVCache instances we construct
  ourselves in the PP path (internal to that allocation, doesn't affect
  spec-decoding which never reaches that branch)

Co-authored-by: jw-wcv <101585096+jw-wcv@users.noreply.github.com>
team-wcv pushed a commit that referenced this pull request May 9, 2026
Cumulative #20 changes rebuilt onto post-#19 main via 3-way merge of
#20's tip (ec255d6) into team-wcv/main. Replaces the earlier
pure-diff approach which silently reverted ~12 #16 surfaces
(peer_download.py, peer_file_server.py, mdns_announcer.py,
router.py, main.py, constants.py, coordinator.py, etc.) -- all of
those are now preserved at their main state.

57 files changed, +14,662 / -531. No conflict markers.

Codex P1 (remote_drafter.py:673): symmetric drift guard.

The original guard only rejected ``num_forwards > drafts_buffer_size``
(target K > drafter K) and missed the reverse direction. With drafter
K > target K, the drafter would still send ``drafts_buffer_size``
uint32s while the target read only its smaller ``K + 1``, leaving
trailing bytes in the socket and corrupting the next command frame.

Fix carries ``target_drafts_buffer_size`` on every command frame
(slot 7, previously reserved 0), and the drafter validates
``target_drafts_buffer_size == drafts_buffer_size`` BEFORE the
session-cache lookup so a desynced wire fails with a protocol-level
error instead of an incidental "unknown session" caused by garbage
in slot 6. The original ``num_forwards > drafts_buffer_size`` guard
is retained as defense in depth against an out-of-bounds slice in
``_run_drafter_*``.

Codex P2 (batch_generator.py SequentialGenerator.close): explicit
``del self.draft_model`` + ``self.draft_model = None`` before the
existing ``del self.model, self.tokenizer, self.group`` so MLX
reclaims drafter weight buffers on close.

Tests added by this PR:
  test_command_frame_round_trip: parametrized with new buffer field.
  test_command_frame_rejects_target_buffer_out_of_uint32_range: new.
  test_drafter_serve_loop_rejects_reverse_k_drift: regression test
  for the previously-uncovered direction.
  test_placement_drafter_asymmetric.py: 19 placement tests covering
  asymmetric drafter eligibility, RDMA reachability, socket-only
  fallback, and degradation events.

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes
  ruff check: clean
  ruff format: clean
  pytest: 905 passed, 16 pre-existing test_placement.py failures
    (same on team-wcv/main; not introduced by this PR)
team-wcv pushed a commit that referenced this pull request May 9, 2026
Cumulative #21 changes rebuilt onto post-#20 main (the freshly-rebuilt
clean #20 base) via 3-way merge of #21's tip (628e497) into a fresh
branch off the new clean #20 (71ef3ad). Replaces the earlier
pure-diff approach which sat on top of the regression-laden old #20
and inherited the same silent reverts of #16 surfaces (peer_download,
peer_file_server, mdns_announcer, router, main, constants, coordinator).
All those #16 surfaces are preserved at their main/post-#19 state.

23 files changed, +3,867 / -575. No conflict markers.

Codex P1 fixes (preserved):
  - worker/main.py: _should_drop_generation_task_at_drafter predicate
    guards generation tasks from DrafterRunner.
  - generate.py: clamp_num_draft_tokens_to_transport accepts both
    DrafterTransport and RemoteTransport via HasNumDraftTokens Protocol.
  - placement.py: directional reachability check (drafter -> target
    rank 0) via _drafter_node_is_reachable.
  - utils_mlx.py: mx_all_gather_tasks splits leftover into
    front-of-queue + back-of-queue so admission can't starve a fresh
    peer's delayed first page.
  - pipelined_drafter.py: _get_tokenizer_vocab_size walks four
    fallbacks (added_vocab + len, len, max id, base).
  - instances.py: target_peer_socket_port is Optional with default=None
    for backward compat with replay events.

Production hardening additions:
  - TargetPeerFanout dedicated TCP fanout (sidesteps JACCL int/float
    wire-conflation bug entirely).
  - Backend-aware mx_broadcast_int_list (ring uses all_sum, JACCL
    uses send/recv).
  - DrafterAbortedError + DRAFT_ABORT_SENTINEL + _broadcast_abort
    resilience path.
  - Per-request metrics dict (proposed/accepted/rounds) +
    EXO_SPEC_DIAG-gated _spec_diag hooks.
  - Pre-allocated listener port + reserved_ports set threading so
    drafter/target-peer/jaccl-coordinator all draw distinct ports.
  - Kernel-vetted random_ephemeral_port_excluding(reserved) helper
    layered on the existing kernel-vetted random_ephemeral_port.
  - Spec-decode telemetry on ChatCompletionResponse
    (proposed_draft_tokens, spec_decode_rounds,
    accepted_prediction_tokens, rejected_prediction_tokens).
  - bench_compare.py and bench_concurrent.py for spec-decode
    benchmarking sweeps.

Tests added:
  test_placement_drafter_asymmetric.py:
    test_asymmetric_drafter_and_target_peer_ports_are_distinct
    test_drafter_and_target_peer_avoid_jaccl_coordinator_port
    (both updated with explicit node_rdma_ctl=NodeRdmaCtlStatus(enabled=True)
    setup since post-#19 main's _all_rdma_ctl_enabled filter requires it).
  test_pipelined_drafter.py: TestDrafterAbortRecovery,
    TestGetTokenizerVocabSize (4-fallback walk regression coverage).
  test_remote_drafter.py: resilience tests for sticky _is_failed flag
    + fail-fast open_session check + try/except OSError wrappers.
  test_utils_mlx_broadcast.py: task-ID encoding, _MX_AGREE_MAX_TASKS
    admission, leftover-front/back-of-queue split coverage.
  test_spec_diag_gating.py: EXO_SPEC_DIAG diagnostic toggle.
  bench_compare.py / bench_concurrent.py: '# type: ignore' headers.

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes.
  ruff check: clean.
  ruff format: clean.
  pytest: 945 passed, 16 pre-existing test_placement.py failures
    (same on team-wcv/main; not introduced by this PR).
team-wcv added a commit that referenced this pull request May 10, 2026
…ocket transports (#20)

* Asymmetric pipelined drafter: V1/V2/V3 transports + reverse-K-drift fix

Cumulative #20 changes rebuilt onto post-#19 main via 3-way merge of
#20's tip (ec255d6) into team-wcv/main. Replaces the earlier
pure-diff approach which silently reverted ~12 #16 surfaces
(peer_download.py, peer_file_server.py, mdns_announcer.py,
router.py, main.py, constants.py, coordinator.py, etc.) -- all of
those are now preserved at their main state.

57 files changed, +14,662 / -531. No conflict markers.

Codex P1 (remote_drafter.py:673): symmetric drift guard.

The original guard only rejected ``num_forwards > drafts_buffer_size``
(target K > drafter K) and missed the reverse direction. With drafter
K > target K, the drafter would still send ``drafts_buffer_size``
uint32s while the target read only its smaller ``K + 1``, leaving
trailing bytes in the socket and corrupting the next command frame.

Fix carries ``target_drafts_buffer_size`` on every command frame
(slot 7, previously reserved 0), and the drafter validates
``target_drafts_buffer_size == drafts_buffer_size`` BEFORE the
session-cache lookup so a desynced wire fails with a protocol-level
error instead of an incidental "unknown session" caused by garbage
in slot 6. The original ``num_forwards > drafts_buffer_size`` guard
is retained as defense in depth against an out-of-bounds slice in
``_run_drafter_*``.

Codex P2 (batch_generator.py SequentialGenerator.close): explicit
``del self.draft_model`` + ``self.draft_model = None`` before the
existing ``del self.model, self.tokenizer, self.group`` so MLX
reclaims drafter weight buffers on close.

Tests added by this PR:
  test_command_frame_round_trip: parametrized with new buffer field.
  test_command_frame_rejects_target_buffer_out_of_uint32_range: new.
  test_drafter_serve_loop_rejects_reverse_k_drift: regression test
  for the previously-uncovered direction.
  test_placement_drafter_asymmetric.py: 19 placement tests covering
  asymmetric drafter eligibility, RDMA reachability, socket-only
  fallback, and degradation events.

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes
  ruff check: clean
  ruff format: clean
  pytest: 905 passed, 16 pre-existing test_placement.py failures
    (same on team-wcv/main; not introduced by this PR)

* Fix 16 pre-existing JACCL placement tests by populating node_rdma_ctl

These tests existed on team-wcv/main as known failures: they call
place_instance(...) without populating the node_rdma_ctl parameter,
which causes the JACCL placement filter _all_rdma_ctl_enabled
(placement.py:309) to reject all RDMA cycles because
rdma_ctl_status.get(node_id) returns None for every node.

Fix: add node_rdma_ctl={node: NodeRdmaCtlStatus(enabled=True), ...}
to each test's place_instance call so the JACCL request fails (or
succeeds) for the right reason -- the Thunderbolt IPv4 preflight
behavior the test was designed to exercise.

No test logic, assertions, or topology setup modified -- only the
missing keyword argument added.

Tests fixed (all in src/exo/master/tests/test_placement.py):
  test_jaccl_placement_skips_thunderbolt_preflight_for_single_node_fallback
  test_jaccl_placement_accepts_maybe_ethernet_thunderbolt_bridge
  test_jaccl_placement_requires_repaired_thunderbolt_ipv4_paths
  test_jaccl_placement_falls_back_to_eligible_cycle_when_another_cycle_has_invalid_path
  test_jaccl_placement_prefers_eligible_cycle_among_multiple_size_2_cycles
  test_jaccl_placement_singleton_fallback_picks_best_node_regardless_of_tb
  test_jaccl_placement_allows_nodes_with_unknown_network_info
  test_jaccl_placement_allows_nodes_with_unclassified_interface_typing
  test_jaccl_placement_still_rejects_nodes_with_known_non_tb_paths
  test_jaccl_placement_rejects_nodes_with_only_loopback_unknown_typing
  test_jaccl_placement_allows_nodes_with_partial_interface_typing
  test_jaccl_placement_allows_bridge0_thunderbolt_with_unknown_typing
  test_jaccl_placement_allows_non_zero_bridge_index_thunderbolt
  test_jaccl_placement_rejects_nodes_with_vm_stack_bridges_and_primary_en
  test_jaccl_placement_rejects_nodes_with_unknown_en0_and_typed_wifi
  test_jaccl_placement_rejects_nodes_with_only_vpn_tunnel_unknown_typing

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes
  ruff check: clean
  ruff format: clean
  pytest: 961 passed, 1 skipped (was 945 passed + 16 failed)

* Address Codex re-review findings on rebuilt PR #20

Restores one regression that was lost during the 3-way merge rebuild
plus 9 distinct findings from Codex's re-review.

LOST REGRESSION (must restore from OLD #20 = ec255d6):
  - remote_drafter.py: symmetric K-drift guard via
    target_drafts_buffer_size in command frame slot 7. The new clean
    #20 base only had the forward-direction guard (num_forwards >
    drafts_buffer_size) and missed the reverse direction (drafter K >
    target K). target_drafts_buffer_size is now validated BEFORE the
    session-cache lookup in drafter_serve_loop so a desynced wire fails
    with a protocol error instead of an incidental "unknown session".

P1 fixes:
  - placement.py: warm-drafter selection. Two-pass loop -- first pass
    picks an eligible candidate that has drafter weights on disk via
    _node_has_any_drafter_on_disk, second pass falls back to first
    eligible. Fixes FileNotFoundError in DrafterRunner._handle_load
    when first memory-eligible node is cold but a later one is warm.
  - utils_mlx.py: drafter listener bind retry. EADDRINUSE on target
    rank 0 is now retried via _bind_drafter_listener_with_retry
    (budget 8) with a fresh random_ephemeral_port draw each time;
    prior single-shot bind would hang ConnectToGroup on stale listeners.
  - placement.py: drafter-cycle reservation graceful degrade. Probe
    cycles_excluding_drafters against filter_cycles_by_memory first;
    fall back to unfiltered candidates if no reserved cycle has enough
    memory, so we degrade drafter rather than failing the instance.
  - placement.py: single-rank target stays on MlxRing. Pin
    Sharding.Pipeline + InstanceMeta.MlxRing for len(cycle) == 1
    BEFORE the JACCL upgrade gate, so single-node ring + asymmetric
    drafter doesn't trip the JACCL TB-IPv4 preflight.
  - remote_drafter.py: assert OP_FORWARD reply width invariant
    (len(outputs) == num_forwards AND len(outputs) <=
    drafts_buffer_size) before padding so a regressed
    _run_drafter_forwards_remote can't silently desync the wire.

P2 fixes:
  - api.py: false positive (cap is already 1024).
  - worker/main.py: drop path now calls
    runner.mark_task_dropped_locally so the planner doesn't re-select
    the same task every 100 ms tick.
  - generate.py: gate asymmetric drafter session-open on
    asymmetric_drafter_active AND draft_mode != "none" so demoted-
    to-none requests don't open sockets.
  - utils_mlx.py: single-rank asymmetric path now sets
    drafter_id from drafter_placement.drafter_model_id so
    GenerationStats.drafter_model_id is populated for telemetry.

Tests added:
  - test_drafter_serve_loop_rejects_reverse_k_drift: regression test
    for the lost symmetric K-drift guard (drafter K > target K
    direction).
  - test_command_frame_round_trip / test_command_frame_rejects_*:
    coverage for the new target_drafts_buffer_size slot.
  - Two existing asymmetric placement tests renamed/updated to expect
    MlxRingInstance for single-rank targets (the P1.4 fix changes
    placement behavior).

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes.
  ruff check: clean.
  ruff format: clean.
  pytest: 923 passed, 1 skipped (was 921 + 2 new tests).

* Address Codex round-2 feedback on PR #20 bind-retry helper

Codex flagged two issues with the round-1 _bind_drafter_listener_with_retry:

P1 (utils_mlx.py:452): re-rolling the port broke the placement
contract. The drafter dials DrafterPlacement.drafter_socket_port,
which the master announced; switching the listener to a fresh
random port made the listener accept on a port the drafter never
tries, leaving the connection stalled until accept_drafter's 180s
timeout. The retry must keep the SAME port -- the realistic
EADDRINUSE case in cross-host deploys is TIME_WAIT residue from a
previous instance on the same port, which clears within ~100 ms.

P2 (utils_mlx.py:440): the retry caught every OSError, so non-
collision errors (EAFNOSUPPORT for an IPv6 wildcard on an IPv4-only
host, EACCES for a privileged port) got retried as if transient
and surfaced after the budget as a misleading "ephemeral port range
exhausted" error. Only EADDRINUSE is transient; everything else
must surface immediately so the operator sees the real root cause.

Changes:
  - utils_mlx.py: rename _bind_drafter_listener_with_retry to
    _bind_drafter_listener_same_port_retry; remove
    random_ephemeral_port re-roll; filter on errno.EADDRINUSE only;
    add 100ms sleep between retries for TIME_WAIT residue absorption.
    Final error message references the placement-announced port so
    the operator knows to re-place rather than guess at the bind
    address.
  - test_utils_mlx_bind_retry.py: 6 new unit tests covering
    same-port retry success, persistent EADDRINUSE budget exhaust,
    non-collision OSError immediate surface, and ordering when
    EADDRINUSE precedes a non-collision error.

Validation:
  basedpyright: 0 errors, 0 warnings, 0 notes.
  ruff check: clean. ruff format: clean.
  pytest: 929 passed, 1 skipped, 201 deselected, 0 failures (+6).

---------

Co-authored-by: jw-wcv <101585096+jw-wcv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants