Skip to content

Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063

Open
team-wcv wants to merge 7 commits into
exo-explore:mainfrom
team-wcv:feature/thunderbolt-rdma
Open

Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063
team-wcv wants to merge 7 commits into
exo-explore:mainfrom
team-wcv:feature/thunderbolt-rdma

Conversation

@team-wcv
Copy link
Copy Markdown
Contributor

@team-wcv team-wcv commented May 7, 2026

Motivation

A bundle of correctness fixes for running exo on a Thunderbolt 4/5 RDMA fabric. We hit each of these on a 3-node Apple Silicon RDMA cluster (2x M5 Max 128GB + 1x M5 Max 48GB) plus a TCP-only 4th node (1x M1 Ultra 128GB), chained over a TB5 fabric that includes one iVANKY Fusiondock Ultra hub leg.

  1. Nested Thunderbolt _items aren't walked, so peers attached behind a TB hub never register on the master and the cluster sees them as offline.
  2. JACCL init is single-shot, so if any worker engine isn't yet registered when the master begins the collective handshake, the whole boot path fails. With several nodes coming up in parallel this happens probabilistically.
  3. RDMA placement picks an arbitrary IP per node instead of preferring the LAN/RFC1918 address that's actually on the RDMA fabric, so once we have multiple network interfaces (Tailscale, LAN, TB Bridge, etc.) the master ends up picking the wrong one.
  4. Dashboard treats every TB-detected interface as TB5, prompting users to enable RDMA on TB4 hardware that can't run rdma_ctl in TB5 mode.

Changes

1. shared/types/thunderbolt.py — walk nested _items (commit 1)

Recurse into the _items array on every Thunderbolt port entry. Hub-attached peers were previously dropped because the parser only looked at the top level.

2. master/placement_utils.py — RDMA host selection (commit 1, refined in commits 4 and 6)

When a peer is socket-reachable on multiple addresses (e.g. both LAN 192.168.1.x and Tailscale 100.64/10), break the tie by IP class so the LAN-class address wins:

  • 0: RFC1918 (192.168/16, 10/8, 172.16-31)
  • 1: anything else

Interface type still drives the primary preference (TB first for ring=True, ethernet first for the RDMA coordinator with ring=False); the IP class is only a secondary tiebreaker. This avoids inverting the ring-prefers-TB intent when a Thunderbolt Bridge happens to use a link-local (169.254/16) IP.

Update vs the original revision of this PR: the earlier draft also added a _fallback_interface_ips branch that fell back to advertised node IPs when the topology had RDMA edges but no SocketConnection edges. Per maintainer feedback (#2063 thread, "im not keen on the fallback ips as they stand … i plan to replace discovery somewhat soon"), that branch has been dropped. The 10-second check_reachable socket-discovery cycle plus the new JACCL retry loop below already cover the startup race the fallback was papering over.

3. master/placement.py — rotate cycle so rank 0 is the most socket-reachable node (commit 1)

_prefer_socket_reachable_rank_zero rotates the chosen cycle so the node with the most inbound SocketConnection edges sits at rank 0 (the listener for both MLX ring and JACCL). Discovery can produce asymmetric edges in practice, and this avoids assigning the listener role to a node that peers cannot dial.

4. worker/engines/mlx/utils_mlx.py — JACCL init retry loop (commit 2)

Wrap mx.distributed.init(backend="jaccl", strict=True) in a retry loop with backoff (8 attempts, ~30s total). Re-raise on the final attempt.

5. dashboard/src/routes/+page.svelte — TB5 detection (commit 3)

Filter TB nodes by link speed > 40 Gb/s rather than "any TB interface". TB5 negotiates 80 Gb/s symmetric or 120/40 Gb/s asymmetric; TB4 stays at 40 Gb/s.

6. Tests

  • master/tests/test_placement.py: test_ring_placement_prefers_lan_ip_over_tailscale_ip, test_jaccl_placement_prefers_lan_ip_over_tailscale_ip, test_placement_prefers_socket_reachable_rank_zero.
  • utils/info_gatherer/tests/test_tb_parsing.py: test_conn_resolves_peer_through_intermediate_hub, test_conn_returns_first_peer_for_direct_link, test_conn_returns_none_when_no_peer_present.

Why It Works

The TB _items parser fix is just walking a tree the right way; system_profiler has reported Thunderbolt topology this way as long as macOS has had hubs.

JACCL init retry is the standard pattern for an n-node init handshake where boot order isn't synchronized: instead of failing on the first race, retry with backoff until the network stabilizes. The retry budget (~30s) is well under the cluster's idle timeout, so it doesn't mask real failures.

The address-class tiebreaker only kicks in when a peer's SocketConnection set is non-empty and contains multiple same-type IPs, which is the empirical pain point: nodes that have both LAN and Tailscale interfaces advertise both, and we want the LAN one for RDMA.

The dashboard TB5 filter just reads the negotiated link speed from system_profiler output rather than assuming any TB interface is TB5.

Test Plan

Automated Testing

$ uv run pytest src/exo/master/tests/test_placement.py src/exo/utils/info_gatherer/tests/test_tb_parsing.py
............................                                             [100%]
28 passed in 0.72s

uv run basedpyright on the changed files: 0 errors. uv run ruff check + ruff format --check: clean.

Manual Testing

Hardware: 3-node M5 Max RDMA cluster (wc-smbp 128GB master + wc-smbpt 128GB worker + wc-bmbp 48GB worker) chained over a TB5 fabric. wc-smbp ↔ wc-smbpt is a direct TB5 cable; wc-smbp ↔ wc-bmbp runs through one iVANKY Fusiondock Ultra, which is the hub leg that exercises the nested-_items parser. (TCP-only wc-studio M1 Ultra is on the cluster but not on the RDMA fabric.)

TB hub topology — nested-_items fix: with the hub on the wc-smbp ↔ wc-bmbp leg, wc-bmbp's system_profiler output shows the peer Mac one level below an iVANKY entry that has no domain_uuid_key of its own. Running the new parser against live system_profiler -json output on each node:

wc-bmbp (peer Mac sits behind iVANKY hub):
  OLD parser → DROPPED bus_0 (no top-level domain_uuid_key in _items)
  NEW parser → src=CFB268CB  sink=F389A8EB

i.e. the old parser silently lost the hub-attached peer on bmbp's side of the connection; the new one walks through the hub and resolves the actual peer.

Hub latency / throughput vs direct TB5 (real numbers, since you asked):

Path RTT min / avg / max / σ (ms) iperf3 (4 streams, 10s)
wc-smbp ↔ wc-smbpt direct TB5 0.341 / 0.575 / 0.698 / 0.083 52.6 → 56.1 Gbit/s (re-runs)
wc-smbp ↔ wc-bmbp through iVANKY 0.354 / 0.482 / 0.588 / 0.076 63.6 Gbit/s (smbp→bmbp)
wc-bmbp ↔ wc-smbp reverse, hub 0.345 / 0.616 / 2.699 / 0.487 60.9 Gbit/s
wc-smbpt ↔ wc-smbp reverse, direct (asym route — separate issue) 60.7 Gbit/s

receptacle_1_tag.current_speed_key is 80 Gb/s on every connected port; the hub does not knock the link down. Within run-to-run noise the hub leg is indistinguishable from a direct TB5 cable.

JACCL init: held one worker engine 8s behind master startup. Master used to fail outright on the first mx.distributed.init attempt; with the retry loop the cluster boots cleanly.

RDMA placement: with both LAN (192.168.1.0/24) and Tailscale (100.64.0.0/10) sockets advertised, master picks the LAN address for both ring hosts and JACCL coordinators (covered by the two new placement tests).

Dashboard: TB4-only ports no longer show the "enable RDMA" prompt; TB5 ports still do.

Notes for reviewers

  • Earlier internal RCA writeup (RCA-3node-shard-fix.md) was dropped from history since it's not relevant upstream, and two no-op pairs (apply.py change-then-revert and a duplicate TB4/TB5 dashboard cherry-pick) were squashed into clean diffs against current upstream main.
  • Commit 5 (chore: ruff/basedpyright cleanup) is SIM103/PIE810/reportPossiblyUnboundVariable resolution; no behavior change.
  • Commit 6 (Address PR feedback) drops _fallback_interface_ips, reorders the priority tuple, and simplifies _address_priority per the maintainer comment thread on this PR.

@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented May 7, 2026

have you tested this setup? do you have any idea of the latency impact of the fusiondock ultra? the testing here seems to all be theoretical

@team-wcv
Copy link
Copy Markdown
Contributor Author

have you tested this setup? do you have any idea of the latency impact of the fusiondock ultra? the testing here seems to all be theoretical

yes I have, particularly because the dock is such a PITA and I wantd to try to avoid additional connections cluttering my setup if I can! I can get you some metrics if you want

jw-wcv and others added 5 commits May 10, 2026 01:52
`system_profiler SPThunderboltDataType -json` represents transparent TB
hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`,
nesting the peer Mac one level deeper than a direct cable would. The
previous parser only inspected the top-level `_items`, so on the side
that enumerated through the dock the peer's `domain_uuid_key` was never
found and the corresponding RDMA edge silently went missing - producing
an asymmetric mesh where placement could see the link in one direction
but not the other.

Extend `_ConnectivityItem` to recursively model `_items` and walk the
tree depth-first for the first descendant `domain_uuid_key`. The link is
the actual peer endpoint regardless of how many transparent switches sit
between the local receptacle and the peer Mac.

Cover the new behaviour with three deterministic unit tests: the iVANKY
hub case observed in production, a direct-cable sanity case, and an
empty-receptacle case.
Retry the JACCL initialization handshake when peers come online out of order,
which we frequently observed when a 4-node Apple Silicon cluster booted on a
Thunderbolt RDMA fabric — the master would attempt the JACCL collective
before all worker engines had registered, yielding a single-shot failure
that took the whole inference path with it.

Adds a small retry loop with backoff in mlx utils so the master will reattempt
init for ~30s before giving up.
Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s
(TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric).
Without this, the dashboard was prompting users to enable RDMA on
TB4-only nodes that can't actually run rdma_ctl in TB5 mode.
- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810
  ruff fixes (no behavior change).
- utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the
  JACCL retry-loop return; group is always bound when the loop exits with
  break, but the static analyzer can't reason across the raise-on-final-attempt
  branch.
@team-wcv team-wcv force-pushed the feature/thunderbolt-rdma branch from c90aa0e to f469af1 Compare May 10, 2026 08:56
@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented May 10, 2026

sweet - last time i asked about this pr the response made it out that this code was essentially unused. i'll check this out after #2061 merges

@Evanev7
Copy link
Copy Markdown
Member

Evanev7 commented May 10, 2026

broadly speaking the sys profiler changes look good, im not keen on the fallback ips as they stand. what problem are you solving? is our discovery service not finding a real connection that exists?

i plan to replace discovery somewhat soon (#2076 and beyond) but that will take a while to iron out

Maintainer feedback on exo-explore#2063 was that the `_fallback_interface_ips` path
was papering over a discovery race instead of solving a concrete bug:
"is our discovery service not finding a real connection that exists?
i plan to replace discovery somewhat soon (exo-explore#2076 and beyond)." Agreed
on review.

Changes:

- `find_ip_prioritised`: restore `if not ips: return None`. The
  20s socket-discovery window is short, placement decisions can wait,
  and the JACCL retry loop added in this same PR already absorbs the
  startup race at the next layer down.
- Drop `_fallback_interface_ips` and `_is_candidate_host_ip` helpers.
- Reorder the `min(...)` tuple so interface type is the primary key and
  `_address_priority` is only a tiebreaker. Without this, a `ring=True`
  ring placement could prefer a LAN-class ethernet IP over a Thunderbolt
  IP that happened to be link-local (169.254/16), which inverts the
  ring-prefers-TB intent.
- Simplify `_address_priority` to a two-class RFC1918/everything-else
  split. The previous CGNAT vs link-local vs catch-all ranking only
  mattered together with the dropped fallback; with type as primary key
  the simpler split is enough to keep LAN ahead of Tailscale at parity.

Test impact:

- Drop `test_ring_placement_uses_advertised_lan_ips_for_rdma_only_topology`
  and `test_jaccl_placement_uses_advertised_lan_ip_for_rdma_coordinator`.
  Both depended on the fallback by constructing RDMA-only topologies.
- Replace them with
  `test_ring_placement_prefers_lan_ip_over_tailscale_ip` and
  `test_jaccl_placement_prefers_lan_ip_over_tailscale_ip`, which exercise
  the address-priority tiebreaker against realistic dual-homed
  (LAN + Tailscale) socket-reachable topologies.
- `test_placement_prefers_socket_reachable_rank_zero` now seeds an
  explicit listener->peer socket edge (so placement can resolve without
  the fallback) and a second peer->listener edge (so the rank-zero
  rotation has a strict winner).
@team-wcv
Copy link
Copy Markdown
Contributor Author

team-wcv commented May 11, 2026

Thanks for the careful review — addressed both threads:

On the fallback IPs (your second comment). Want to surface the original rationale before agreeing to drop it, since "is our discovery service not finding a real connection that exists?" is a fair question that deserves a real answer.

The fallback was solving a timing race, not a correctness gap in discovery. At master startup the Thunderbolt RDMA topology is populated immediately from system_profiler (sync, single call), but SocketConnection edges only land after check_reachable's 10-second polling cycle in worker/main.py:_poll_connection_updates HTTP-pings every peer on :52415. If a place command lands inside that 10s window — which we hit in practice during bring-up benches — _find_connection_ip returned [] for every (i, j) pair, even though the master did know from system_profiler that the nodes were physically connected. The fallback was: if topology says these peers are RDMA-linked but socket-discovery hasn't observed them yet, dial the advertised LAN/RFC1918 interfaces and let TCP fail loudly if that guess is wrong. With our cluster wired statically on a known subnet that guess is essentially always right; on a more general fabric it isn't.

That said, you're right that the right place to fix this is in discovery itself, and I shouldn't be papering over it here:

  1. The race window is small. 10s of master-startup latency is well inside RunnerSupervisor's tolerance and inside the JACCL retry budget that this same PR adds.
  2. The JACCL init retry loop already absorbs the bring-up race at the next layer down. If placement does fire before socket-discovery has a TCP edge for some (i, j), the worker that ends up trying to dial will just hit the retry path until discovery catches up. No information was being added by the fallback that JACCL retry wasn't already covering downstream.
  3. The fallback can't actually validate dialability. It's picking an IP off the advertised interface list, not off a successful TCP probe. If discovery is genuinely broken (not just slow), the fallback picks an undialable address and we fail in JACCL init anyway — same end state, more code path.
  4. libp2p -> zenoh pure #2076 is the right home for this. A redesigned discovery service that emits socket edges synchronously (or that admits "not yet discovered" as a placement-blocking state) makes the fallback unnecessary by construction. Layering a workaround in placement_utils.py while you're rewriting the layer below it is exactly the kind of debt that bites later — you'd have to consciously remove it.

So on net the fallback was buying ~10 seconds of bring-up time on cold clusters in exchange for an extra branch in placement and a couple of brittle tests. Worth the trade if discovery were stable long-term; not worth it if you're about to replace discovery anyway. Dropped in ce31509:

  • Drops _fallback_interface_ips and _is_candidate_host_ip entirely.
  • Restores if not ips: return None in find_ip_prioritised.
  • Reorders the min(...) key so interface type stays primary and the IP-class score is only a tiebreaker. This also fixes a separate latent ordering bug: with the old (address_priority, type_priority) ordering a TB Bridge on link-local could lose to a LAN ethernet IP even for ring=True, which is the opposite of what the ring placement wants.
  • Simplifies _address_priority to a two-class RFC1918/everything-else split. The CGNAT-vs-link-local-vs-catch-all gradient only had a use case together with the fallback; with the fallback gone and type as primary key, RFC1918 vs not-RFC1918 is all that needs to be there to keep LAN ahead of Tailscale at parity.
  • Replaces the two RDMA-only-topology tests with tests that exercise the tiebreaker against realistic dual-homed (LAN + Tailscale) socket-reachable topologies, since the original tests strictly depended on the fallback path.

Net diff vs current upstream main: +377 / -21 → now +209 / -21 after dropping the fallback. Happy to revisit the IP-class tiebreaker too once #2076 lands if discovery starts pre-scoring its own outputs.

On the FusionDock Ultra latency / hardware (your first comment). Got numbers from the actual cluster. The fabric I tested on is 3-node M5 Max RDMA (smbp 128GB master + smbpt 128GB + bmbp 48GB; one TCP-only M1 Ultra 128GB on the side, not on the RDMA fabric). I corrected the hardware claims in the PR body which previously said "3x M5 Max 128GB + 1x M1 Ultra 192GB" — those numbers were sloppy. Real spec: 2× M5 Max 128GB, 1× M5 Max 48GB, 1× M1 Ultra 128GB.

The hub sits between smbp and bmbp. smbp ↔ smbpt is direct TB5 for comparison. Measured numbers:

Path Ping RTT (20 packets, ms) iperf3 (4 streams, 10s)
smbp ↔ smbpt direct TB5 min 0.341 / avg 0.575 52.6–56.1 Gbit/s (two runs)
smbp ↔ bmbp through iVANKY hub min 0.354 / avg 0.482 63.6 Gbit/s
bmbp ↔ smbp reverse, hub leg min 0.345 / avg 0.616 60.9 Gbit/s
smbpt ↔ smbp reverse, direct (asym route, separate) 60.7 Gbit/s

receptacle_1_tag.current_speed_key reports 80 Gb/s on every connected port (TB5 spec, symmetric). Within run-to-run noise the hub leg is indistinguishable from a direct TB5 cable. So at least on this specific hub the latency cost is in the noise floor for both control-plane and data-plane traffic.

Concrete proof that the nested-_items fix matters on this hardware: with the hub on smbp ↔ bmbp, bmbp's system_profiler output puts the peer Mac inside an iVANKY entry that has no domain_uuid_key of its own. Running the parser against the live JSON:

wc-bmbp (peer Mac one level below iVANKY hub on bus_0):
  OLD parser → DROPPED bus_0 (no top-level domain_uuid_key in _items)
  NEW parser → src=CFB268CB  sink=F389A8EB

So the old parser silently lost the hub-attached peer from bmbp's side and master would have seen the connection as one-way (smbp sees bmbp because smbp's side has the peer at top level; bmbp doesn't see smbp). Full benchmark table and the full audit are in the updated PR body.

Happy to wait until after #2061 merges before you take another pass.

@team-wcv
Copy link
Copy Markdown
Contributor Author

Heads up: #2061 just merged, so the dependency you mentioned is cleared. Ready for the next pass when you have a moment.

Quick TL;DR of where we are post-review:

  • Nested _items parser: unchanged from the original commit; live system_profiler output on the working cluster confirms the recursive walk picks up hub-attached peers that the old top-level-only parser dropped (parser run printed src=CFB268CB sink=F389A8EB for the formerly-dropped peer).
  • JACCL init retry: unchanged. Retry loop with backoff around mx.distributed.init; absorbs the same startup race the dropped fallback IPs were papering over, so this is the path of record for that race condition.
  • RDMA host selection: refined per your feedback. Dropped _fallback_interface_ips + _is_candidate_host_ip entirely. Restored the early return None in find_ip_prioritised. Reordered the min(...) key so interface type stays primary and IP class is only a tiebreaker (this also fixes a separate ordering bug: with (address_priority, type_priority), a TB Bridge on link-local could lose to a LAN ethernet IP for ring=True, which is the opposite of what ring placement wants). Simplified _address_priority to RFC1918 / not-RFC1918.
  • Dashboard TB5 detection: unchanged. Filters by linkSpeed > 40 Gb/s.

Net diff dropped from +377 / −21 to +209 / −21 after removing the fallback. All four existing tests on test_placement.py pass; the two RDMA-only-topology tests that strictly depended on the fallback path were replaced with tests exercising the type/IP-class tiebreaker on realistic dual-homed (LAN + Tailscale) socket-reachable topologies.

Also added a hub-vs-direct bench table to the PR body (above) — happy to re-run on a different fabric if you want comparable numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants