Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection by team-wcv · Pull Request #2063 · exo-explore/exo

team-wcv · 2026-05-07T00:06:23Z

Motivation

A bundle of correctness fixes for running exo on a Thunderbolt 4/5 RDMA fabric. We hit each of these on a 3-node Apple Silicon RDMA cluster (2x M5 Max 128GB + 1x M5 Max 48GB) plus a TCP-only 4th node (1x M1 Ultra 128GB), chained over a TB5 fabric that includes one iVANKY Fusiondock Ultra hub leg.

Nested Thunderbolt _items aren't walked, so peers attached behind a TB hub never register on the master and the cluster sees them as offline.
JACCL init is single-shot, so if any worker engine isn't yet registered when the master begins the collective handshake, the whole boot path fails. With several nodes coming up in parallel this happens probabilistically.
RDMA placement picks an arbitrary IP per node instead of preferring the LAN/RFC1918 address that's actually on the RDMA fabric, so once we have multiple network interfaces (Tailscale, LAN, TB Bridge, etc.) the master ends up picking the wrong one.
Dashboard treats every TB-detected interface as TB5, prompting users to enable RDMA on TB4 hardware that can't run rdma_ctl in TB5 mode.

Changes

1. `shared/types/thunderbolt.py` — walk nested `_items` (commit 1)

Recurse into the _items array on every Thunderbolt port entry. Hub-attached peers were previously dropped because the parser only looked at the top level.

2. `master/placement_utils.py` — RDMA host selection (commit 1, refined in commits 4 and 6)

When a peer is socket-reachable on multiple addresses (e.g. both LAN 192.168.1.x and Tailscale 100.64/10), break the tie by IP class so the LAN-class address wins:

0: RFC1918 (192.168/16, 10/8, 172.16-31)
1: anything else

Interface type still drives the primary preference (TB first for ring=True, ethernet first for the RDMA coordinator with ring=False); the IP class is only a secondary tiebreaker. This avoids inverting the ring-prefers-TB intent when a Thunderbolt Bridge happens to use a link-local (169.254/16) IP.

Update vs the original revision of this PR: the earlier draft also added a _fallback_interface_ips branch that fell back to advertised node IPs when the topology had RDMA edges but no SocketConnection edges. Per maintainer feedback (#2063 thread, "im not keen on the fallback ips as they stand … i plan to replace discovery somewhat soon"), that branch has been dropped. The 10-second check_reachable socket-discovery cycle plus the new JACCL retry loop below already cover the startup race the fallback was papering over.

3. `master/placement.py` — rotate cycle so rank 0 is the most socket-reachable node (commit 1)

_prefer_socket_reachable_rank_zero rotates the chosen cycle so the node with the most inbound SocketConnection edges sits at rank 0 (the listener for both MLX ring and JACCL). Discovery can produce asymmetric edges in practice, and this avoids assigning the listener role to a node that peers cannot dial.

4. `worker/engines/mlx/utils_mlx.py` — JACCL init retry loop (commit 2)

Wrap mx.distributed.init(backend="jaccl", strict=True) in a retry loop with backoff (8 attempts, ~30s total). Re-raise on the final attempt.

5. `dashboard/src/routes/+page.svelte` — TB5 detection (commit 3)

Filter TB nodes by link speed > 40 Gb/s rather than "any TB interface". TB5 negotiates 80 Gb/s symmetric or 120/40 Gb/s asymmetric; TB4 stays at 40 Gb/s.

6. Tests

master/tests/test_placement.py: test_ring_placement_prefers_lan_ip_over_tailscale_ip, test_jaccl_placement_prefers_lan_ip_over_tailscale_ip, test_placement_prefers_socket_reachable_rank_zero.
utils/info_gatherer/tests/test_tb_parsing.py: test_conn_resolves_peer_through_intermediate_hub, test_conn_returns_first_peer_for_direct_link, test_conn_returns_none_when_no_peer_present.

Why It Works

The TB _items parser fix is just walking a tree the right way; system_profiler has reported Thunderbolt topology this way as long as macOS has had hubs.

JACCL init retry is the standard pattern for an n-node init handshake where boot order isn't synchronized: instead of failing on the first race, retry with backoff until the network stabilizes. The retry budget (~30s) is well under the cluster's idle timeout, so it doesn't mask real failures.

The address-class tiebreaker only kicks in when a peer's SocketConnection set is non-empty and contains multiple same-type IPs, which is the empirical pain point: nodes that have both LAN and Tailscale interfaces advertise both, and we want the LAN one for RDMA.

The dashboard TB5 filter just reads the negotiated link speed from system_profiler output rather than assuming any TB interface is TB5.

Test Plan

Automated Testing

$ uv run pytest src/exo/master/tests/test_placement.py src/exo/utils/info_gatherer/tests/test_tb_parsing.py
............................                                             [100%]
28 passed in 0.72s

uv run basedpyright on the changed files: 0 errors. uv run ruff check + ruff format --check: clean.

Manual Testing

Hardware: 3-node M5 Max RDMA cluster (wc-smbp 128GB master + wc-smbpt 128GB worker + wc-bmbp 48GB worker) chained over a TB5 fabric. wc-smbp ↔ wc-smbpt is a direct TB5 cable; wc-smbp ↔ wc-bmbp runs through one iVANKY Fusiondock Ultra, which is the hub leg that exercises the nested-_items parser. (TCP-only wc-studio M1 Ultra is on the cluster but not on the RDMA fabric.)

TB hub topology — nested-_items fix: with the hub on the wc-smbp ↔ wc-bmbp leg, wc-bmbp's system_profiler output shows the peer Mac one level below an iVANKY entry that has no domain_uuid_key of its own. Running the new parser against live system_profiler -json output on each node:

wc-bmbp (peer Mac sits behind iVANKY hub):
  OLD parser → DROPPED bus_0 (no top-level domain_uuid_key in _items)
  NEW parser → src=CFB268CB  sink=F389A8EB

i.e. the old parser silently lost the hub-attached peer on bmbp's side of the connection; the new one walks through the hub and resolves the actual peer.

Hub latency / throughput vs direct TB5 (real numbers, since you asked):

Path	RTT min / avg / max / σ (ms)	iperf3 (4 streams, 10s)
`wc-smbp ↔ wc-smbpt` direct TB5	0.341 / 0.575 / 0.698 / 0.083	52.6 → 56.1 Gbit/s (re-runs)
`wc-smbp ↔ wc-bmbp` through iVANKY	0.354 / 0.482 / 0.588 / 0.076	63.6 Gbit/s (smbp→bmbp)
`wc-bmbp ↔ wc-smbp` reverse, hub	0.345 / 0.616 / 2.699 / 0.487	60.9 Gbit/s
`wc-smbpt ↔ wc-smbp` reverse, direct	(asym route — separate issue)	60.7 Gbit/s

receptacle_1_tag.current_speed_key is 80 Gb/s on every connected port; the hub does not knock the link down. Within run-to-run noise the hub leg is indistinguishable from a direct TB5 cable.

JACCL init: held one worker engine 8s behind master startup. Master used to fail outright on the first mx.distributed.init attempt; with the retry loop the cluster boots cleanly.

RDMA placement: with both LAN (192.168.1.0/24) and Tailscale (100.64.0.0/10) sockets advertised, master picks the LAN address for both ring hosts and JACCL coordinators (covered by the two new placement tests).

Dashboard: TB4-only ports no longer show the "enable RDMA" prompt; TB5 ports still do.

Notes for reviewers

Earlier internal RCA writeup (RCA-3node-shard-fix.md) was dropped from history since it's not relevant upstream, and two no-op pairs (apply.py change-then-revert and a duplicate TB4/TB5 dashboard cherry-pick) were squashed into clean diffs against current upstream main.
Commit 5 (chore: ruff/basedpyright cleanup) is SIM103/PIE810/reportPossiblyUnboundVariable resolution; no behavior change.
Commit 6 (Address PR feedback) drops _fallback_interface_ips, reorders the priority tuple, and simplifies _address_priority per the maintainer comment thread on this PR.

Evanev7 · 2026-05-07T08:04:54Z

have you tested this setup? do you have any idea of the latency impact of the fusiondock ultra? the testing here seems to all be theoretical

team-wcv · 2026-05-10T00:26:34Z

have you tested this setup? do you have any idea of the latency impact of the fusiondock ultra? the testing here seems to all be theoretical

yes I have, particularly because the dock is such a PITA and I wantd to try to avoid additional connections cluttering my setup if I can! I can get you some metrics if you want

`system_profiler SPThunderboltDataType -json` represents transparent TB hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`, nesting the peer Mac one level deeper than a direct cable would. The previous parser only inspected the top-level `_items`, so on the side that enumerated through the dock the peer's `domain_uuid_key` was never found and the corresponding RDMA edge silently went missing - producing an asymmetric mesh where placement could see the link in one direction but not the other. Extend `_ConnectivityItem` to recursively model `_items` and walk the tree depth-first for the first descendant `domain_uuid_key`. The link is the actual peer endpoint regardless of how many transparent switches sit between the local receptacle and the peer Mac. Cover the new behaviour with three deterministic unit tests: the iVANKY hub case observed in production, a direct-cable sanity case, and an empty-receptacle case.

Retry the JACCL initialization handshake when peers come online out of order, which we frequently observed when a 4-node Apple Silicon cluster booted on a Thunderbolt RDMA fabric — the master would attempt the JACCL collective before all worker engines had registered, yielding a single-shot failure that took the whole inference path with it. Adds a small retry loop with backoff in mlx utils so the master will reattempt init for ~30s before giving up.

Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s (TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric). Without this, the dashboard was prompting users to enable RDMA on TB4-only nodes that can't actually run rdma_ctl in TB5 mode.

- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810 ruff fixes (no behavior change). - utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the JACCL retry-loop return; group is always bound when the loop exits with break, but the static analyzer can't reason across the raise-on-final-attempt branch.

Evanev7 · 2026-05-10T17:00:01Z

sweet - last time i asked about this pr the response made it out that this code was essentially unused. i'll check this out after #2061 merges

Evanev7 · 2026-05-10T17:03:48Z

broadly speaking the sys profiler changes look good, im not keen on the fallback ips as they stand. what problem are you solving? is our discovery service not finding a real connection that exists?

i plan to replace discovery somewhat soon (#2076 and beyond) but that will take a while to iron out

Maintainer feedback on exo-explore#2063 was that the `_fallback_interface_ips` path was papering over a discovery race instead of solving a concrete bug: "is our discovery service not finding a real connection that exists? i plan to replace discovery somewhat soon (exo-explore#2076 and beyond)." Agreed on review. Changes: - `find_ip_prioritised`: restore `if not ips: return None`. The 20s socket-discovery window is short, placement decisions can wait, and the JACCL retry loop added in this same PR already absorbs the startup race at the next layer down. - Drop `_fallback_interface_ips` and `_is_candidate_host_ip` helpers. - Reorder the `min(...)` tuple so interface type is the primary key and `_address_priority` is only a tiebreaker. Without this, a `ring=True` ring placement could prefer a LAN-class ethernet IP over a Thunderbolt IP that happened to be link-local (169.254/16), which inverts the ring-prefers-TB intent. - Simplify `_address_priority` to a two-class RFC1918/everything-else split. The previous CGNAT vs link-local vs catch-all ranking only mattered together with the dropped fallback; with type as primary key the simpler split is enough to keep LAN ahead of Tailscale at parity. Test impact: - Drop `test_ring_placement_uses_advertised_lan_ips_for_rdma_only_topology` and `test_jaccl_placement_uses_advertised_lan_ip_for_rdma_coordinator`. Both depended on the fallback by constructing RDMA-only topologies. - Replace them with `test_ring_placement_prefers_lan_ip_over_tailscale_ip` and `test_jaccl_placement_prefers_lan_ip_over_tailscale_ip`, which exercise the address-priority tiebreaker against realistic dual-homed (LAN + Tailscale) socket-reachable topologies. - `test_placement_prefers_socket_reachable_rank_zero` now seeds an explicit listener->peer socket edge (so placement can resolve without the fallback) and a second peer->listener edge (so the rank-zero rotation has a strict winner).

team-wcv · 2026-05-11T19:19:41Z

Thanks for the careful review — addressed both threads:

On the fallback IPs (your second comment). Want to surface the original rationale before agreeing to drop it, since "is our discovery service not finding a real connection that exists?" is a fair question that deserves a real answer.

The fallback was solving a timing race, not a correctness gap in discovery. At master startup the Thunderbolt RDMA topology is populated immediately from system_profiler (sync, single call), but SocketConnection edges only land after check_reachable's 10-second polling cycle in worker/main.py:_poll_connection_updates HTTP-pings every peer on :52415. If a place command lands inside that 10s window — which we hit in practice during bring-up benches — _find_connection_ip returned [] for every (i, j) pair, even though the master did know from system_profiler that the nodes were physically connected. The fallback was: if topology says these peers are RDMA-linked but socket-discovery hasn't observed them yet, dial the advertised LAN/RFC1918 interfaces and let TCP fail loudly if that guess is wrong. With our cluster wired statically on a known subnet that guess is essentially always right; on a more general fabric it isn't.

That said, you're right that the right place to fix this is in discovery itself, and I shouldn't be papering over it here:

The race window is small. 10s of master-startup latency is well inside RunnerSupervisor's tolerance and inside the JACCL retry budget that this same PR adds.
The JACCL init retry loop already absorbs the bring-up race at the next layer down. If placement does fire before socket-discovery has a TCP edge for some (i, j), the worker that ends up trying to dial will just hit the retry path until discovery catches up. No information was being added by the fallback that JACCL retry wasn't already covering downstream.
The fallback can't actually validate dialability. It's picking an IP off the advertised interface list, not off a successful TCP probe. If discovery is genuinely broken (not just slow), the fallback picks an undialable address and we fail in JACCL init anyway — same end state, more code path.
libp2p -> zenoh pure #2076 is the right home for this. A redesigned discovery service that emits socket edges synchronously (or that admits "not yet discovered" as a placement-blocking state) makes the fallback unnecessary by construction. Layering a workaround in placement_utils.py while you're rewriting the layer below it is exactly the kind of debt that bites later — you'd have to consciously remove it.

So on net the fallback was buying ~10 seconds of bring-up time on cold clusters in exchange for an extra branch in placement and a couple of brittle tests. Worth the trade if discovery were stable long-term; not worth it if you're about to replace discovery anyway. Dropped in ce31509:

Drops _fallback_interface_ips and _is_candidate_host_ip entirely.
Restores if not ips: return None in find_ip_prioritised.
Reorders the min(...) key so interface type stays primary and the IP-class score is only a tiebreaker. This also fixes a separate latent ordering bug: with the old (address_priority, type_priority) ordering a TB Bridge on link-local could lose to a LAN ethernet IP even for ring=True, which is the opposite of what the ring placement wants.
Simplifies _address_priority to a two-class RFC1918/everything-else split. The CGNAT-vs-link-local-vs-catch-all gradient only had a use case together with the fallback; with the fallback gone and type as primary key, RFC1918 vs not-RFC1918 is all that needs to be there to keep LAN ahead of Tailscale at parity.
Replaces the two RDMA-only-topology tests with tests that exercise the tiebreaker against realistic dual-homed (LAN + Tailscale) socket-reachable topologies, since the original tests strictly depended on the fallback path.

Net diff vs current upstream main: +377 / -21 → now +209 / -21 after dropping the fallback. Happy to revisit the IP-class tiebreaker too once #2076 lands if discovery starts pre-scoring its own outputs.

On the FusionDock Ultra latency / hardware (your first comment). Got numbers from the actual cluster. The fabric I tested on is 3-node M5 Max RDMA (smbp 128GB master + smbpt 128GB + bmbp 48GB; one TCP-only M1 Ultra 128GB on the side, not on the RDMA fabric). I corrected the hardware claims in the PR body which previously said "3x M5 Max 128GB + 1x M1 Ultra 192GB" — those numbers were sloppy. Real spec: 2× M5 Max 128GB, 1× M5 Max 48GB, 1× M1 Ultra 128GB.

The hub sits between smbp and bmbp. smbp ↔ smbpt is direct TB5 for comparison. Measured numbers:

Path	Ping RTT (20 packets, ms)	iperf3 (4 streams, 10s)
`smbp ↔ smbpt` direct TB5	min 0.341 / avg 0.575	52.6–56.1 Gbit/s (two runs)
`smbp ↔ bmbp` through iVANKY hub	min 0.354 / avg 0.482	63.6 Gbit/s
`bmbp ↔ smbp` reverse, hub leg	min 0.345 / avg 0.616	60.9 Gbit/s
`smbpt ↔ smbp` reverse, direct	(asym route, separate)	60.7 Gbit/s

receptacle_1_tag.current_speed_key reports 80 Gb/s on every connected port (TB5 spec, symmetric). Within run-to-run noise the hub leg is indistinguishable from a direct TB5 cable. So at least on this specific hub the latency cost is in the noise floor for both control-plane and data-plane traffic.

Concrete proof that the nested-_items fix matters on this hardware: with the hub on smbp ↔ bmbp, bmbp's system_profiler output puts the peer Mac inside an iVANKY entry that has no domain_uuid_key of its own. Running the parser against the live JSON:

wc-bmbp (peer Mac one level below iVANKY hub on bus_0):
  OLD parser → DROPPED bus_0 (no top-level domain_uuid_key in _items)
  NEW parser → src=CFB268CB  sink=F389A8EB

So the old parser silently lost the hub-attached peer from bmbp's side and master would have seen the connection as one-way (smbp sees bmbp because smbp's side has the peer at top level; bmbp doesn't see smbp). Full benchmark table and the full audit are in the updated PR body.

Happy to wait until after #2061 merges before you take another pass.

team-wcv · 2026-05-11T21:23:34Z

Heads up: #2061 just merged, so the dependency you mentioned is cleared. Ready for the next pass when you have a moment.

Quick TL;DR of where we are post-review:

Nested _items parser: unchanged from the original commit; live system_profiler output on the working cluster confirms the recursive walk picks up hub-attached peers that the old top-level-only parser dropped (parser run printed src=CFB268CB sink=F389A8EB for the formerly-dropped peer).
JACCL init retry: unchanged. Retry loop with backoff around mx.distributed.init; absorbs the same startup race the dropped fallback IPs were papering over, so this is the path of record for that race condition.
RDMA host selection: refined per your feedback. Dropped _fallback_interface_ips + _is_candidate_host_ip entirely. Restored the early return None in find_ip_prioritised. Reordered the min(...) key so interface type stays primary and IP class is only a tiebreaker (this also fixes a separate ordering bug: with (address_priority, type_priority), a TB Bridge on link-local could lose to a LAN ethernet IP for ring=True, which is the opposite of what ring placement wants). Simplified _address_priority to RFC1918 / not-RFC1918.
Dashboard TB5 detection: unchanged. Filters by linkSpeed > 40 Gb/s.

Net diff dropped from +377 / −21 to +209 / −21 after removing the fallback. All four existing tests on test_placement.py pass; the two RDMA-only-topology tests that strictly depended on the fallback path were replaced with tests exercising the type/IP-class tiebreaker on realistic dual-homed (LAN + Tailscale) socket-reachable topologies.

Also added a hub-vs-direct bench table to the PR body (above) — happy to re-run on a different fabric if you want comparable numbers.

team-wcv mentioned this pull request May 7, 2026

Asymmetric tensor parallelism for uneven-memory clusters #2064

Open

jw-wcv and others added 5 commits May 10, 2026 01:52

Fix RDMA placement host selection

370979e

team-wcv force-pushed the feature/thunderbolt-rdma branch from c90aa0e to f469af1 Compare May 10, 2026 08:56

davidtkeane mentioned this pull request May 12, 2026

[BUG] 2-node Mac cluster: asymmetric topology blocks 2-node placements #2077

Open

Merge branch 'main' into feature/thunderbolt-rdma

d9d7f74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063

Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063
team-wcv wants to merge 7 commits into
exo-explore:mainfrom
team-wcv:feature/thunderbolt-rdma

team-wcv commented May 7, 2026 •

edited

Loading

Uh oh!

Evanev7 commented May 7, 2026

Uh oh!

team-wcv commented May 10, 2026

Uh oh!

Evanev7 commented May 10, 2026

Uh oh!

Evanev7 commented May 10, 2026

Uh oh!

team-wcv commented May 11, 2026 •

edited

Loading

Uh oh!

team-wcv commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

team-wcv commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

1. shared/types/thunderbolt.py — walk nested _items (commit 1)

2. master/placement_utils.py — RDMA host selection (commit 1, refined in commits 4 and 6)

3. master/placement.py — rotate cycle so rank 0 is the most socket-reachable node (commit 1)

4. worker/engines/mlx/utils_mlx.py — JACCL init retry loop (commit 2)

5. dashboard/src/routes/+page.svelte — TB5 detection (commit 3)

6. Tests

Why It Works

Test Plan

Automated Testing

Manual Testing

Notes for reviewers

Uh oh!

Evanev7 commented May 7, 2026

Uh oh!

team-wcv commented May 10, 2026

Uh oh!

Evanev7 commented May 10, 2026

Uh oh!

Evanev7 commented May 10, 2026

Uh oh!

team-wcv commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

team-wcv commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

team-wcv commented May 7, 2026 •

edited

Loading

1. `shared/types/thunderbolt.py` — walk nested `_items` (commit 1)

2. `master/placement_utils.py` — RDMA host selection (commit 1, refined in commits 4 and 6)

3. `master/placement.py` — rotate cycle so rank 0 is the most socket-reachable node (commit 1)

4. `worker/engines/mlx/utils_mlx.py` — JACCL init retry loop (commit 2)

5. `dashboard/src/routes/+page.svelte` — TB5 detection (commit 3)

team-wcv commented May 11, 2026 •

edited

Loading