Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063
Thunderbolt 5 + nested-hub discovery; JACCL init retry; RDMA host selection#2063team-wcv wants to merge 7 commits into
Conversation
|
have you tested this setup? do you have any idea of the latency impact of the fusiondock ultra? the testing here seems to all be theoretical |
yes I have, particularly because the dock is such a PITA and I wantd to try to avoid additional connections cluttering my setup if I can! I can get you some metrics if you want |
`system_profiler SPThunderboltDataType -json` represents transparent TB hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`, nesting the peer Mac one level deeper than a direct cable would. The previous parser only inspected the top-level `_items`, so on the side that enumerated through the dock the peer's `domain_uuid_key` was never found and the corresponding RDMA edge silently went missing - producing an asymmetric mesh where placement could see the link in one direction but not the other. Extend `_ConnectivityItem` to recursively model `_items` and walk the tree depth-first for the first descendant `domain_uuid_key`. The link is the actual peer endpoint regardless of how many transparent switches sit between the local receptacle and the peer Mac. Cover the new behaviour with three deterministic unit tests: the iVANKY hub case observed in production, a direct-cable sanity case, and an empty-receptacle case.
Retry the JACCL initialization handshake when peers come online out of order, which we frequently observed when a 4-node Apple Silicon cluster booted on a Thunderbolt RDMA fabric — the master would attempt the JACCL collective before all worker engines had registered, yielding a single-shot failure that took the whole inference path with it. Adds a small retry loop with backoff in mlx utils so the master will reattempt init for ~30s before giving up.
Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s (TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric). Without this, the dashboard was prompting users to enable RDMA on TB4-only nodes that can't actually run rdma_ctl in TB5 mode.
- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810 ruff fixes (no behavior change). - utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the JACCL retry-loop return; group is always bound when the loop exits with break, but the static analyzer can't reason across the raise-on-final-attempt branch.
c90aa0e to
f469af1
Compare
|
sweet - last time i asked about this pr the response made it out that this code was essentially unused. i'll check this out after #2061 merges |
|
broadly speaking the sys profiler changes look good, im not keen on the fallback ips as they stand. what problem are you solving? is our discovery service not finding a real connection that exists? i plan to replace discovery somewhat soon (#2076 and beyond) but that will take a while to iron out |
Maintainer feedback on exo-explore#2063 was that the `_fallback_interface_ips` path was papering over a discovery race instead of solving a concrete bug: "is our discovery service not finding a real connection that exists? i plan to replace discovery somewhat soon (exo-explore#2076 and beyond)." Agreed on review. Changes: - `find_ip_prioritised`: restore `if not ips: return None`. The 20s socket-discovery window is short, placement decisions can wait, and the JACCL retry loop added in this same PR already absorbs the startup race at the next layer down. - Drop `_fallback_interface_ips` and `_is_candidate_host_ip` helpers. - Reorder the `min(...)` tuple so interface type is the primary key and `_address_priority` is only a tiebreaker. Without this, a `ring=True` ring placement could prefer a LAN-class ethernet IP over a Thunderbolt IP that happened to be link-local (169.254/16), which inverts the ring-prefers-TB intent. - Simplify `_address_priority` to a two-class RFC1918/everything-else split. The previous CGNAT vs link-local vs catch-all ranking only mattered together with the dropped fallback; with type as primary key the simpler split is enough to keep LAN ahead of Tailscale at parity. Test impact: - Drop `test_ring_placement_uses_advertised_lan_ips_for_rdma_only_topology` and `test_jaccl_placement_uses_advertised_lan_ip_for_rdma_coordinator`. Both depended on the fallback by constructing RDMA-only topologies. - Replace them with `test_ring_placement_prefers_lan_ip_over_tailscale_ip` and `test_jaccl_placement_prefers_lan_ip_over_tailscale_ip`, which exercise the address-priority tiebreaker against realistic dual-homed (LAN + Tailscale) socket-reachable topologies. - `test_placement_prefers_socket_reachable_rank_zero` now seeds an explicit listener->peer socket edge (so placement can resolve without the fallback) and a second peer->listener edge (so the rank-zero rotation has a strict winner).
|
Thanks for the careful review — addressed both threads: On the fallback IPs (your second comment). Want to surface the original rationale before agreeing to drop it, since "is our discovery service not finding a real connection that exists?" is a fair question that deserves a real answer. The fallback was solving a timing race, not a correctness gap in discovery. At master startup the Thunderbolt RDMA topology is populated immediately from That said, you're right that the right place to fix this is in discovery itself, and I shouldn't be papering over it here:
So on net the fallback was buying ~10 seconds of bring-up time on cold clusters in exchange for an extra branch in placement and a couple of brittle tests. Worth the trade if discovery were stable long-term; not worth it if you're about to replace discovery anyway. Dropped in
Net diff vs current upstream On the FusionDock Ultra latency / hardware (your first comment). Got numbers from the actual cluster. The fabric I tested on is 3-node M5 Max RDMA ( The hub sits between
Concrete proof that the nested- So the old parser silently lost the hub-attached peer from bmbp's side and master would have seen the connection as one-way (smbp sees bmbp because smbp's side has the peer at top level; bmbp doesn't see smbp). Full benchmark table and the full audit are in the updated PR body. Happy to wait until after #2061 merges before you take another pass. |
|
Heads up: #2061 just merged, so the dependency you mentioned is cleared. Ready for the next pass when you have a moment. Quick TL;DR of where we are post-review:
Net diff dropped from Also added a hub-vs-direct bench table to the PR body (above) — happy to re-run on a different fabric if you want comparable numbers. |
Motivation
A bundle of correctness fixes for running exo on a Thunderbolt 4/5 RDMA fabric. We hit each of these on a 3-node Apple Silicon RDMA cluster (2x M5 Max 128GB + 1x M5 Max 48GB) plus a TCP-only 4th node (1x M1 Ultra 128GB), chained over a TB5 fabric that includes one iVANKY Fusiondock Ultra hub leg.
_itemsaren't walked, so peers attached behind a TB hub never register on the master and the cluster sees them as offline.rdma_ctlin TB5 mode.Changes
1.
shared/types/thunderbolt.py— walk nested_items(commit 1)Recurse into the
_itemsarray on every Thunderbolt port entry. Hub-attached peers were previously dropped because the parser only looked at the top level.2.
master/placement_utils.py— RDMA host selection (commit 1, refined in commits 4 and 6)When a peer is socket-reachable on multiple addresses (e.g. both LAN
192.168.1.xand Tailscale100.64/10), break the tie by IP class so the LAN-class address wins:0: RFC1918 (192.168/16,10/8,172.16-31)1: anything elseInterface type still drives the primary preference (TB first for
ring=True, ethernet first for the RDMA coordinator withring=False); the IP class is only a secondary tiebreaker. This avoids inverting the ring-prefers-TB intent when a Thunderbolt Bridge happens to use a link-local (169.254/16) IP.3.
master/placement.py— rotate cycle so rank 0 is the most socket-reachable node (commit 1)_prefer_socket_reachable_rank_zerorotates the chosen cycle so the node with the most inboundSocketConnectionedges sits at rank 0 (the listener for both MLX ring and JACCL). Discovery can produce asymmetric edges in practice, and this avoids assigning the listener role to a node that peers cannot dial.4.
worker/engines/mlx/utils_mlx.py— JACCL init retry loop (commit 2)Wrap
mx.distributed.init(backend="jaccl", strict=True)in a retry loop with backoff (8 attempts, ~30s total). Re-raise on the final attempt.5.
dashboard/src/routes/+page.svelte— TB5 detection (commit 3)Filter TB nodes by link speed > 40 Gb/s rather than "any TB interface". TB5 negotiates 80 Gb/s symmetric or 120/40 Gb/s asymmetric; TB4 stays at 40 Gb/s.
6. Tests
master/tests/test_placement.py:test_ring_placement_prefers_lan_ip_over_tailscale_ip,test_jaccl_placement_prefers_lan_ip_over_tailscale_ip,test_placement_prefers_socket_reachable_rank_zero.utils/info_gatherer/tests/test_tb_parsing.py:test_conn_resolves_peer_through_intermediate_hub,test_conn_returns_first_peer_for_direct_link,test_conn_returns_none_when_no_peer_present.Why It Works
The TB
_itemsparser fix is just walking a tree the right way;system_profilerhas reported Thunderbolt topology this way as long as macOS has had hubs.JACCL init retry is the standard pattern for an n-node init handshake where boot order isn't synchronized: instead of failing on the first race, retry with backoff until the network stabilizes. The retry budget (~30s) is well under the cluster's idle timeout, so it doesn't mask real failures.
The address-class tiebreaker only kicks in when a peer's
SocketConnectionset is non-empty and contains multiple same-type IPs, which is the empirical pain point: nodes that have both LAN and Tailscale interfaces advertise both, and we want the LAN one for RDMA.The dashboard TB5 filter just reads the negotiated link speed from
system_profileroutput rather than assuming any TB interface is TB5.Test Plan
Automated Testing
uv run basedpyrighton the changed files: 0 errors.uv run ruff check+ruff format --check: clean.Manual Testing
Hardware: 3-node M5 Max RDMA cluster (
wc-smbp128GB master +wc-smbpt128GB worker +wc-bmbp48GB worker) chained over a TB5 fabric.wc-smbp ↔ wc-smbptis a direct TB5 cable;wc-smbp ↔ wc-bmbpruns through one iVANKY Fusiondock Ultra, which is the hub leg that exercises the nested-_itemsparser. (TCP-onlywc-studioM1 Ultra is on the cluster but not on the RDMA fabric.)TB hub topology — nested-
_itemsfix: with the hub on thewc-smbp ↔ wc-bmbpleg,wc-bmbp'ssystem_profileroutput shows the peer Mac one level below an iVANKY entry that has nodomain_uuid_keyof its own. Running the new parser against livesystem_profiler -jsonoutput on each node:i.e. the old parser silently lost the hub-attached peer on bmbp's side of the connection; the new one walks through the hub and resolves the actual peer.
Hub latency / throughput vs direct TB5 (real numbers, since you asked):
wc-smbp ↔ wc-smbptdirect TB5wc-smbp ↔ wc-bmbpthrough iVANKYwc-bmbp ↔ wc-smbpreverse, hubwc-smbpt ↔ wc-smbpreverse, directreceptacle_1_tag.current_speed_keyis80 Gb/son every connected port; the hub does not knock the link down. Within run-to-run noise the hub leg is indistinguishable from a direct TB5 cable.JACCL init: held one worker engine 8s behind master startup. Master used to fail outright on the first
mx.distributed.initattempt; with the retry loop the cluster boots cleanly.RDMA placement: with both LAN (
192.168.1.0/24) and Tailscale (100.64.0.0/10) sockets advertised, master picks the LAN address for both ring hosts and JACCL coordinators (covered by the two new placement tests).Dashboard: TB4-only ports no longer show the "enable RDMA" prompt; TB5 ports still do.
Notes for reviewers
RCA-3node-shard-fix.md) was dropped from history since it's not relevant upstream, and two no-op pairs (apply.py change-then-revert and a duplicate TB4/TB5 dashboard cherry-pick) were squashed into clean diffs against current upstream main.chore: ruff/basedpyright cleanup) is SIM103/PIE810/reportPossiblyUnboundVariable resolution; no behavior change.Address PR feedback) drops_fallback_interface_ips, reorders the priority tuple, and simplifies_address_priorityper the maintainer comment thread on this PR.