Asymmetric tensor parallelism for uneven-memory clusters#2064
Open
team-wcv wants to merge 9 commits into
Open
Conversation
`system_profiler SPThunderboltDataType -json` represents transparent TB hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`, nesting the peer Mac one level deeper than a direct cable would. The previous parser only inspected the top-level `_items`, so on the side that enumerated through the dock the peer's `domain_uuid_key` was never found and the corresponding RDMA edge silently went missing - producing an asymmetric mesh where placement could see the link in one direction but not the other. Extend `_ConnectivityItem` to recursively model `_items` and walk the tree depth-first for the first descendant `domain_uuid_key`. The link is the actual peer endpoint regardless of how many transparent switches sit between the local receptacle and the peer Mac. Cover the new behaviour with three deterministic unit tests: the iVANKY hub case observed in production, a direct-cable sanity case, and an empty-receptacle case.
Retry the JACCL initialization handshake when peers come online out of order, which we frequently observed when a 4-node Apple Silicon cluster booted on a Thunderbolt RDMA fabric — the master would attempt the JACCL collective before all worker engines had registered, yielding a single-shot failure that took the whole inference path with it. Adds a small retry loop with backoff in mlx utils so the master will reattempt init for ~30s before giving up.
Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s (TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric). Without this, the dashboard was prompting users to enable RDMA on TB4-only nodes that can't actually run rdma_ctl in TB5 mode.
- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810 ruff fixes (no behavior change). - utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the JACCL retry-loop return; group is always bound when the loop exits with break, but the static analyzer can't reason across the raise-on-final-attempt branch.
Keep asymmetric tensor rank 0 constrained to the largest socket-reachable node and avoid auto-upgrading tensor placements that cannot satisfy the two-node asymmetric constraints.
da1ae21 to
201ade3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Running large models across heterogeneous Apple Silicon clusters (e.g. an M5 Max 128GB paired with an M1 Ultra 192GB) is currently bottlenecked on the smaller node — equal tensor parallelism splits the model evenly, so we can only fit `min(node_memory) * world_size` parameters even when the cluster has plenty of total RAM headroom.
This change adds asymmetric tensor parallelism: a model-shard layout that splits parameters proportionally to per-node available memory rather than evenly. It currently supports 2-node Qwen3.5 (and family) which is enough to unlock big-model launches on common heterogeneous pairs (e.g. fitting Qwen3.5-235B on a 128GB + 192GB pair where equal-TP doesn't).
Changes
Core feature
Dashboard
Tests
Why It Works
Tensor parallelism splits the per-layer weight matrices column-wise across the world. As long as the per-rank `hidden_size` is divisible by the world's split factor (and KV heads divide cleanly when not MQA), the math is correct independent of equal vs unequal splits. The existing `Sharding.Tensor` filter in placement already enforces `hidden_size % world == 0`; the asymmetric path adds the per-rank `hidden_size_share[i]` and verifies head-divisibility per rank.
The auto-upgrade gate is conservative: only triggers if equal split is >90% of the smallest node's available RAM AND the asymmetric split fits in <85% of total cluster RAM (so we leave room for KV cache and forward-pass activations). This means the auto-upgrade kicks in on the cases where the current behavior would OOM, and stays out of the way otherwise.
The 2-node restriction is current-PR scope; the math generalizes, but the load path and tests are limited to N=2 for this round.
Test Plan
Automated Testing
```
src/exo/master/tests/test_master.py . [ 1%]
src/exo/master/tests/test_placement.py .......................... [ 42%]
src/exo/master/tests/test_placement_utils.py ....................... [ 79%]
src/exo/master/tests/test_topology.py ..... [ 87%]
src/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py ........ [100%]
=== 63 passed in 0.11s ===
```
`uv run basedpyright` and `uv run ruff check` both clean.
Manual Testing
Hardware: 2-node pair (M5 Max 128GB + M5 Max 48GB) over Thunderbolt 5 RDMA.
Notes for reviewers