Import useful upstream exo PRs by team-wcv · Pull Request #16 · team-wcv/exo

team-wcv · 2026-05-07T06:48:55Z

Summary

port selected useful upstream exo PRs into the team-wcv fork
add RDMA placement gating from rdma_ctl state, worker backoff reconciliation, liveness cadence tuning, persistent node IDs, trust-remote-code CLI wiring, KV cache quantization controls, and peer-to-peer model downloads
adapt the imported code to the current team-wcv branch shape, including current model directory constants and strict typing cleanup

Upstream PRs included

Validation

uv run basedpyright
uv run ruff check
EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest -q -> 511 passed, 1 skipped, 199 deselected
focused port suite: EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest src/exo/shared/tests/test_xdg_paths.py src/exo/download/tests/test_peer_download.py src/exo/utils/tests/test_keyed_backoff.py src/exo/worker/tests/unittests/test_worker_instance_backoff.py src/exo/shared/tests/test_apply/test_apply_rdma_gating.py src/exo/master/tests/test_placement.py -q -> 61 passed

Notes

nix fmt was not run because Nix is not installed on this Mac. Official multi-user install requires an interactive sudo password; official single-user install is no longer supported on Darwin. The only post-check change was Python-only and was formatted with uv run ruff format src/exo/main.py.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f4fb54649

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

team-wcv · 2026-05-07T07:33:14Z

Patched the Codex review findings in 575c595:

P1 path traversal: peer file requests now resolve requested model/file paths and reject anything outside the model cache before stat/open.
P2 shared peer list race: coordinator now passes the shard into PeerAwareShardDownloader, and peers are queued per shard/download instead of a single mutable field.
P2 nested status listing: peer status now walks model directories recursively and reports relative nested paths, including partial files.

Validation run locally:

uv run ruff check
uv run basedpyright
EXO_DASHBOARD_DIR=Development/Tooling/exo/dashboard/build uv run pytest -q -> 516 passed, 1 skipped, 199 deselected

team-wcv · 2026-05-09T04:15:32Z

Cherry-picked 701838aa fix: make darwin mdns discovery reliable from PR #17 (since closed) onto this branch as commit 33d72520. Adds src/exo/routing/mdns_announcer.py (97 lines) and 81 lines of darwin-mDNS-reliability glue in src/exo/main.py + 12 lines in src/exo/worker/main.py.

Trivial conflict resolved: combined the new _libp2p_port: int field with this branch's peer_file_server: PeerFileServer | None = None field on the Node dataclass; both kept, constructor arg order matched.

Verified clean: basedpyright src/exo/main.py 0 errors, ruff check src/exo/main.py clean.

Branch is now 15 commits ahead of team-wcv/main. PR #17 is closed and points here.

team-wcv · 2026-05-09T04:48:32Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 33d725205a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Two fixes for PR #16 round 2 (Codex): 1. P1: Skip quantized caches for non-PP make_cache models. ``make_kv_cache`` had a single-node safeguard (``_model_is_pipeline_parallel``) on the make_cache-LESS branch, but the make_cache-aware branch above it unconditionally replaced ``KVCache`` entries with ``QuantizedKVCache`` whenever ``EXO_KV_CACHE_BITS`` was set. Models that expose ``make_cache()`` (e.g. Gemma3 with mixed attention layers) and run single-node would therefore crash at runtime with:: <class 'mlx_lm.models.cache.QuantizedKVCache'> does not yet support batching with history because mlx-lm's single-node ``BatchGenerator`` calls ``_merge_caches`` on every step and ``QuantizedKVCache`` doesn't implement ``merge``. Fix: apply the same ``_model_is_pipeline_parallel`` guard to the make_cache branch and emit the same warning when the env var is ignored. 2. P2: Verify peer file integrity before marking download complete. ``download_one`` in ``peer_shard_downloader.py`` marked peer downloads successful as soon as ``n_read == expected_size``, with no content-integrity check. A peer serving wrong bytes with the right length (stale/corrupt/malicious) was therefore silently accepted as model data, causing hard-to-diagnose inference failures or bad outputs. Fix: after the peer download completes, fetch the authoritative etag/hash from HuggingFace via ``file_meta()`` and validate the downloaded file via ``calc_hash``. On mismatch the file is removed and the caller falls back to direct HF download. Trusts HF as canonical source rather than peer-advertised hash to defend against malicious peers that lie about both content and hash. ``file_meta`` adds one HEAD round-trip per file; ``fetch_file_list_with_cache`` already requires HF connectivity at this code path so the network requirement is not new. The semantics now match the direct HuggingFace download path which has done identical validation since import.

team-wcv · 2026-05-09T05:21:54Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6babb76dca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Address Codex round-2 P1s on PR #16. P1: Reuse ``ignore_patterns`` when selecting peer download files - ``download_shard`` (download_utils.py:983) excludes ``original/*`` and ``metal/*`` because HuggingFace never downloads them. The peer path applied ``allow_patterns`` only, so any repo containing those paths (e.g. Llama 3.x) had a required-files list that included files the peer never had locally; the strict ``peer_info missing => fail`` check then aborted the entire transfer and forced a HF fallback for every download. Pass the same ``ignore_patterns=["original/*", "metal/*"]`` into ``filter_repo_objects`` to match selection. P1: Honor offline mode in peer file-list fetch - ``_try_peer_download`` hard-coded ``skip_internet=False`` when calling ``fetch_file_list_with_cache``, so offline-configured nodes still reached out to HuggingFace before downloading from a LAN peer. On cold/offline nodes without a cached file list this raised, the ``except: return None`` exited early, and the peer download could not even start. Add an ``offline`` parameter to ``PeerAwareShardDownloader.__init__`` (defaulting to ``False``) and thread the existing ``DownloadCoordinator.offline`` value through ``exo_shard_downloader`` so the peer file-list fetch honors the same offline contract as ``ResumableShardDownloader``. Tests - ``test_offline_flag_defaults_to_false`` / ``..._propagates`` cover the constructor wiring directly. - ``test_try_peer_download_passes_offline_to_fetch_file_list`` patches the import binding and asserts ``skip_internet=True`` is forwarded when the downloader is constructed with ``offline=True``. - ``test_try_peer_download_filters_ignore_patterns`` records every ``filter_repo_objects`` call and asserts that the peer path requested ``ignore_patterns=["original/*", "metal/*"]`` -- the exact set ``download_shard`` uses.

team-wcv · 2026-05-09T05:42:45Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f48d5f001e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

… port Codex P1 (round 3): peer downloads were calling ``file_meta()`` for every file even when the coordinator ran with ``--offline`` / ``EXO_OFFLINE=true``. Any failure to reach HF (the entire point of offline mode) was treated as an integrity-check failure, the peer-fetched bytes were deleted, and the cold offline node was left with no path to complete model sync. When the downloader is in offline mode we now trust the LAN peer's bytes (size already enforced by ``download_file_from_peer``) and skip the HF canonical-hash call entirely. Online mode still validates against HF. Codex P2 (round 3): the peer-download listener was hard-coded to a single module-level constant, so a same-host multi-node deployment crashed on the second process with ``address already in use``. Add a ``--peer-download-port`` CLI flag (default ``EXO_PEER_DOWNLOAD_PORT``) and thread it through ``Node`` -> ``Worker`` and ``PeerFileServer``, replacing the import-time constant. Cluster-wide convention: every node uses the same value (peer discovery still uses the local value as the assumed remote port). Cross-node port advertisement via state is a documented follow-up. Tests: - ``TestPeerDownloadIntegrityCheckRespectsOfflineMode`` covers both modes: offline must not call ``file_meta`` and must keep the bytes, online must still call ``file_meta``.

team-wcv · 2026-05-09T06:14:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6154483f54

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T06:22:16Z

+            peer_file_server = PeerFileServer(
+                host="0.0.0.0",
+                port=args.peer_download_port,
+                models_dir=EXO_DEFAULT_MODELS_DIR,


Serve peer files from all writable model directories

The peer file server is hard-wired to EXO_DEFAULT_MODELS_DIR, but downloads are placed via select_download_dir_for_shard() across EXO_MODELS_DIRS; when a node stores a model in a non-default writable directory (custom models path, low-space fallback, or default excluded as read-only), /status and /files will not expose that model and peers will always fall back to HuggingFace. This silently disables the new peer-download path for valid multi-directory deployments.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit fe02cea6. The peer file server now serves files from every writable model directory via _locate_all_model_dirs; downloads landing in any of EXO_DEFAULT_MODELS_DIR / EXO_DOWNLOAD_DIRECTORIES are advertised to peers and downloadable from peers.

chatgpt-codex-connector · 2026-05-09T06:22:17Z

+            else:
+                return PeerEndpoint(


Prefer RDMA edge before returning socket endpoint

This function returns immediately on the first non-RDMA edge for the peer, so if topology iteration yields a socket edge before an RDMA edge, the same peer is mislabeled as socket and never gets RDMA priority. That breaks the intended "RDMA first" peer ordering and can consistently select slower peers depending on edge insertion order.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit fe02cea6 (Pick RDMA edges and serve from every model dir for peer downloads). _resolve_endpoint_for_peer now collects ALL edges between the pair before choosing, prefers the first RDMA edge if one exists, and only falls back to a socket edge when no RDMA edge is present. See peer_state.py:97-126.

This addresses two Codex P2 findings on PR #16. `_resolve_peer_endpoint` returned on the *first* topology edge it visited for a peer, so when ``out_edges`` happened to yield the SocketConnection before the RDMAConnection (insertion order is not a stable contract on the topology graph), the peer was silently labelled ``socket`` and lost its RDMA priority in the peer ordering. The lookup now scans all edges for a peer, prefers RDMA whenever any RDMA edge exists (using the companion socket address for the actual TCP connect), and only falls back to ``socket`` when no RDMA edge is present. The unused `_find_socket_ip` helper that was meant as a fallback for the RDMA branch is removed -- its behaviour is folded into the new scan. `PeerFileServer` was hard-wired to ``EXO_DEFAULT_MODELS_DIR``, but ``select_download_dir_for_shard`` legitimately writes shards across ``EXO_MODELS_DIRS`` (custom paths, low-disk fallback) and we also have shards available in ``EXO_MODELS_READ_ONLY_DIRS`` mounts. Pre-fix, any model that landed outside the default directory was invisible to /status and /files, so peers always fell back to HuggingFace and the new peer path was a no-op for valid multi-directory deployments. The server now takes ``models_dirs`` (a sequence) and probes every configured root in caller-specified priority, checking each candidate against ``_resolve_child`` for path-traversal safety. Adds regression coverage: * `test_peer_state.py` covers RDMA-first ordering for both edge insertion orders and the socket-only / RDMA-only edge cases. * `TestPeerFileServerMultipleDirectories` covers serving from a secondary writable directory and a read-only mount, plus the constructor's empty-list rejection.

team-wcv · 2026-05-09T07:03:27Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe02cea608

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T07:12:50Z


 # Identity (config)
-EXO_NODE_ID_KEYPAIR = EXO_CONFIG_HOME / "node_id.keypair"
+EXO_NODE_ID_KEYPAIR = EXO_CACHE_HOME / "node_id.keypair"


Keep node identity keypair out of cache storage

Saving EXO_NODE_ID_KEYPAIR under EXO_CACHE_HOME makes node identity vulnerable to normal cache eviction/cleanup, which causes nodes to come up with a new peer ID and breaks the intended persistence of cluster identity across restarts. Identity material should remain in a persistent config/state location rather than cache.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit 09d0279b (Reject oversized peer partials; relocate node-ID keypair to config dir). EXO_NODE_ID_KEYPAIR now defaults under EXO_CONFIG_HOME rather than EXO_CACHE_HOME, so node identity is no longer subject to cache eviction. A migration helper (_migrate_legacy_node_id_keypair) moves any pre-existing keypair from the cache location to the config location on first startup.

chatgpt-codex-connector · 2026-05-09T07:12:50Z

+    chunk_size = 8 * 1024 * 1024  # 8MB, matching HF download
+
+    try:
+        while n_read < expected_size and poll_count < max_poll_attempts:


Reject oversized stale partials before peer resume

The resume loop only runs while n_read < expected_size, so if an existing .partial is already larger than expected_size (stale/corrupt prior download), this path skips downloading and later renames that file as a successful result. In offline mode (where hash verification is intentionally skipped), this can silently accept incorrect model bytes and produce broken inference artifacts.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit 09d0279b. The peer-download resume loop now rejects oversized stale partial files: if partial_size > expected_size, the partial is deleted and the download restarts from byte 0 rather than appending a fresh body to a stale buffer.

Two Codex round-(N+1) P1 fixes for PR #16: 1. peer_download.py:120 -- "Reject oversized stale partials before peer resume". The resume loop ran ``while n_read < expected_size``, so any pre-existing ``.partial`` larger than ``expected_size`` skipped the loop entirely and the final ``rename`` accepted the bad bytes as the canonical download. In offline mode (where peer download integrity check is intentionally skipped) this would permanently poison the model cache. Discard the stale oversized partial up front and restart from zero on this peer. 2. constants.py:73 -- "Keep node identity keypair out of cache storage". ``EXO_NODE_ID_KEYPAIR`` was rooted at ``EXO_CACHE_HOME``, which is subject to normal cache cleanup (e.g. ``trash ~/.cache/exo``); a wipe silently regenerated a fresh peer ID and broke cluster membership / mDNS routes. Move it to ``EXO_CONFIG_HOME`` (matching the existing ``test_node_id_in_config_dir`` invariant) and add a one-shot transparent migrator in ``router._migrate_legacy_node_id_keypair`` so existing nodes retain identity after the upgrade. Regression tests: * ``test_oversized_stale_partial_is_discarded_and_retransferred`` -- pre-fix would rename the junk ``.partial`` as the downloaded weights file; post-fix the file is re-fetched and matches the canonical bytes. * ``test_legacy_keypair_is_migrated_to_new_location``, ``test_migration_is_idempotent_when_new_location_already_present``, ``test_migration_skipped_when_no_legacy_file``, ``test_get_node_id_keypair_uses_migrated_legacy_keypair``. X-Orchestraitor-Plan: ecosystem_repo_standardization_aeee88ba X-Agent-Platform: cursor-claude-opus-4.7

team-wcv · 2026-05-09T08:29:48Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09d0279baf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T08:37:47Z

    path: str | bytes | PathLike[str] | PathLike[bytes] = EXO_NODE_ID_KEYPAIR,
+    legacy_path: str | bytes | PathLike[str] | PathLike[bytes] | None = (
+        EXO_LEGACY_NODE_ID_KEYPAIR


Preserve per-process node identity isolation

Loading and reusing a single on-disk keypair from the default EXO_NODE_ID_KEYPAIR path makes concurrent exo processes on the same host share the same NodeId, which breaks cluster semantics (for example, peer discovery paths that exclude peer_node_id == node_id and routing that expects unique node IDs). This is a regression from the previous per-process Keypair.generate() behavior and directly conflicts with the new same-host multi-node workflow introduced in this commit (distinct ports but still one shared identity file).

Useful? React with 👍 / 👎.

Resolved. Resolved by commit 231335a0 (Scope node-ID keypair per process and migrate inside the file lock). The default EXO_NODE_ID_KEYPAIR path is now scoped per-process via _node_id_keypair_scope(args), so concurrent exo processes on the same host load distinct keypair files.

chatgpt-codex-connector · 2026-05-09T08:37:47Z

+    if legacy_path is not None:
+        _migrate_legacy_node_id_keypair(resolved_path, Path(str(legacy_path)))


Move legacy keypair migration inside the file lock

Calling _migrate_legacy_node_id_keypair() before acquiring FileLock introduces a startup race: two processes can both pass the existence checks, one can generate/write a new keypair under the lock, and the other can still overwrite it via legacy replace() outside the lock. That leaves processes in the same launch with divergent in-memory identities vs. persisted identity, causing unstable peer IDs across restarts.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit 231335a0. _migrate_legacy_node_id_keypair is now called inside the FileLock block, so two processes cannot both pass the existence check and both attempt the migration.

Codex P1/P2 (PR #16 round-(N+2), router.py:297, router.py:322): the same-host multi-node workflow this PR introduces (distinct peer-download ports per process) requires distinct NodeIds per process so peer-discovery's self-skip and routing's unique-NodeId invariants hold. Concurrent startups also raced on the legacy cache->config keypair migration because it ran before FileLock. - Add ``process_scope`` parameter to ``get_node_id_keypair`` that is folded into the on-disk filename (``node_id.<scope>.keypair``). Single-process deployments default to None and keep the existing shared file; ``main.py`` passes ``args.peer_download_port`` so multi-process same-host runs land on distinct identities. - Move the legacy migration call inside the ``FileLock`` so two processes can't both pass the existence check and race into divergent in-memory vs. on-disk identities. - Legacy file remains unscoped: the first process to migrate adopts the operator's existing identity; later processes (other scopes) start with fresh keypairs, which is exactly what per-process isolation requires. - Add regression tests covering distinct scopes producing distinct keypairs, scope stability across calls, scoped legacy adoption, and a structural check that the migration runs inside the FileLock.

team-wcv · 2026-05-09T08:55:40Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 231335a096

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T09:02:00Z

+        # ``NodeId``s) so peer-discovery's ``peer_node_id ==
+        # node_id`` self-skip and routing's unique-NodeId
+        # invariants continue to hold.
+        keypair = get_node_id_keypair(process_scope=args.peer_download_port)


Stop reusing one scoped keypair when peer download is disabled

This always scopes get_node_id_keypair by args.peer_download_port, even when --no-downloads or --no-peer-download disables the peer file server. In that mode, operators can legitimately run multiple same-host nodes with the default peer-download port (no bind conflict), and both processes will load the same node_id.<port>.keypair, producing identical NodeIds and breaking election/routing assumptions that node IDs are unique. Scope identity with a value that is unique in all multi-process modes (or only use peer-download-port scoping when peer download is actually enabled).

Useful? React with 👍 / 👎.

Resolved. Resolved by the existing scope construction at main.py:472-479: the scope concatenates libp2p_port, api_port, AND peer_download_port, and at least one of {api_port, libp2p_port} must differ between two same-host processes (each is a distinct socket bind). Two --no-peer-download processes with the same default api_port/libp2p_port would conflict at the api or libp2p socket bind first, surfacing the conflict before they ever reach the keypair load.

chatgpt-codex-connector · 2026-05-09T09:02:00Z

+                elif r.status in (200, 206):
+                    async with aiofiles.open(
+                        partial_path, "ab" if n_read > 0 else "wb"
+                    ) as f:


Require 206 for ranged peer-download resumes

When resuming (n_read > 0), the client sends a Range header but still accepts HTTP 200 and appends the body to the existing partial file. A server is allowed to ignore Range and return full content with 200, which here will duplicate bytes, push n_read past expected_size, and still rename the oversized file as success; in offline mode this can silently poison model files because hash verification is skipped. On resume, treat non-206 as a restart/failure (or truncate and rewrite from zero) instead of appending.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit f34534c2 (Combine listening ports for keypair scope; restart on 200-on-resume). When n_read > 0 (resume) the client now requires HTTP 206; a 200 response triggers the partial file to be deleted and the download to restart from byte 0, preventing prepended-old + appended-new corruption.

Codex P1 (PR #16 round-(N+3), main.py:74): the previous scope used ``args.peer_download_port`` only. With ``--no-downloads`` / ``--no-peer-download`` the peer file server doesn't bind, so two same-host processes can both keep the default ``peer_download_port`` and would then load the same scoped keypair file -- producing identical ``NodeId``s and breaking election/routing's unique-NodeId invariants. The new ``_node_id_keypair_scope`` helper combines libp2p, api, and peer-download ports: at least one of those MUST differ between two same-host processes (each is a distinct local socket bind), so the resulting scope is always per-process unique while remaining stable across restarts of the same configuration. Codex P1 (PR #16 round-(N+3), peer_download.py:162): on resume ``download_file_from_peer`` sends a ``Range`` header but accepted HTTP 200 and appended to the existing partial. A non-compliant peer server is allowed to ignore Range and return full content with 200, which would duplicate bytes, push ``n_read`` past ``expected_size``, and -- because offline mode skips hash verification -- silently poison the model cache by renaming the oversized file as success. Now we treat 200-on-resume as a restart: discard the partial, reset ``n_read = 0``, and the next loop iteration re-fetches from zero. Add regression tests: - ``TestNodeIdKeypairScope`` covers the per-process scope helper: distinct libp2p / api / peer-download ports each yield distinct scopes; the same args yield the same scope; and the original bug (same default peer_download_port with peer-download disabled) is now isolated by libp2p_port differences. - ``test_resume_with_200_response_discards_partial_and_restarts`` stands up a tiny aiohttp server that always returns 200 (even for ranged requests), primes a partial file, and asserts the client discards the partial, restarts from zero, and lands the canonical bytes matching ``expected_size``.

team-wcv · 2026-05-09T09:32:04Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f34534c23b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T09:38:37Z

+    main.py:74).
+    """
+    return (
+        f"libp2p-{args.libp2p_port}.api-{args.api_port}.peer-{args.peer_download_port}"


Include resolved libp2p port in node-ID scope

The scope string is derived from configured ports only, so --libp2p-port 0 always contributes the literal 0 even though each process actually binds a different ephemeral port at runtime. In worker-only runs where API and peer-download listeners are disabled, two same-host processes can share identical scope strings and therefore load the same keypair file, causing duplicate NodeIds and breaking routing/election assumptions. Use the resolved bound libp2p port (or another guaranteed per-process discriminator) when the configured port is 0.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: address libp2p-port=0 scope collision). _node_id_keypair_scope now folds os.getpid() into the scope when args.libp2p_port == 0, so two same-host processes binding ephemeral libp2p ports get distinct scopes (and therefore distinct node-ID keypair files).

chatgpt-codex-connector · 2026-05-09T09:38:37Z

+                            chunk = await r.content.read(chunk_size)
+                            if not chunk:
+                                break
+                            written = await f.write(chunk)
+                            n_read += written


Reject peer responses that exceed expected file size

The download loop keeps appending bytes until EOF and only checks n_read < expected_size afterward, so an oversized peer response is accepted as success (n_read >= expected_size) and then renamed into the model cache. In offline mode this bypasses hash verification entirely, so a peer serving a larger stale/wrong blob can silently poison local weights. Stop writing once expected_size is reached and treat any extra bytes as a failure.

Useful? React with 👍 / 👎.

Resolved. Resolved by commit ffcb67f4 (PR #16 R(N+8) P1: oversized peer responses). The download loop now compares n_read + len(chunk) > expected_size during the read and aborts immediately with RuntimeError, so an oversized peer response cannot trickle past the post-loop guard.

… peer responses main.py: when --libp2p-port 0 is set, the configured value is the literal 0 even though each process binds a different ephemeral port at runtime. Two same-host worker-only processes (no API, no peer download) sharing the default api/peer ports would otherwise produce identical scope strings and load the same on-disk keypair file, breaking the unique-NodeId invariant. Fold os.getpid() into the scope when libp2p_port == 0; the trade-off (ephemeral identity for ephemeral ports) is the right semantic since the operator opted into ephemeral binding by setting libp2p_port=0. peer_download.py: bound the inner read by 'expected_size - n_read' and treat any extra bytes as a peer protocol violation. Pre-fix the loop kept appending until EOF and only checked n_read < expected_size afterwards, so an oversized response (peer serving a stale/wrong blob) was accepted as success and renamed into the cache. In offline mode hash verification is skipped, so this silently poisoned local weights. New tests: - test_libp2p_port_zero_uses_pid_for_per_process_isolation: verifies the scope contains 'pid-<getpid>' when libp2p_port=0. - test_libp2p_port_zero_in_two_processes_yield_distinct_scopes: monkeypatches os.getpid to simulate two same-host processes both binding libp2p_port=0 with identical api/peer ports and asserts the scopes diverge. - test_oversized_peer_response_is_rejected_and_restarted: stands up a bad aiohttp peer that always serves canonical+'POISONED' bytes and asserts the client never lands the trailing junk in the cache.

team-wcv · 2026-05-09T12:19:42Z

@codex review

The constant is the default for built-in models with known model cards, which are trusted. Custom models added via API already default to trust_remote_code=False in ModelCard.fetch_from_hf(). The CLI flag overrides custom models only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

X-Orchestraitor-Task: exo-upstream-pr-import X-Orchestraitor-Plan: import-useful-upstream-prs X-Agent-Platform: codex

Cherry-picked from upstream 701838a (Alex Cheema, exo-explore/exo). Resolves trivial dataclass conflict in src/exo/main.py: combined `peer_file_server: PeerFileServer | None = None` (this branch's peer-to-peer download field) with `_libp2p_port: int` (mDNS commit's new field). Both fields kept; constructor positional args reordered to match. Type-checked clean (basedpyright 0 errors). Ruff clean.