Asymmetric tensor parallelism for uneven-memory clusters by team-wcv · Pull Request #2064 · exo-explore/exo

team-wcv · 2026-05-07T00:11:00Z

Motivation

Running large models across heterogeneous Apple Silicon clusters (e.g. an M5 Max 128GB paired with an M1 Ultra 192GB) is currently bottlenecked on the smaller node — equal tensor parallelism splits the model evenly, so we can only fit `min(node_memory) * world_size` parameters even when the cluster has plenty of total RAM headroom.

This change adds asymmetric tensor parallelism: a model-shard layout that splits parameters proportionally to per-node available memory rather than evenly. It currently supports 2-node Qwen3.5 (and family) which is enough to unlock big-model launches on common heterogeneous pairs (e.g. fitting Qwen3.5-235B on a 128GB + 192GB pair where equal-TP doesn't).

Note: Based on #2063 (Thunderbolt 5 + RDMA placement), since the asymmetric placement code naturally builds on the LAN-priority host-IP scoring introduced there. Once #2063 lands, this PR will rebase cleanly to `main`.

Changes

Core feature

`shared/types/worker/shards.py`: add `Sharding.AsymmetricTensor` to the enum and a new `AsymmetricShardMetadata` shape carrying the per-rank parameter share.
`worker/engines/mlx/asymmetric_parallel.py` (new, 376 lines): implements the asymmetric-split kernel and load path on top of MLX's distributed primitives. Splits attention heads and FFN dims by ratio rather than evenly.
`worker/engines/mlx/utils_mlx.py`: route to the asymmetric loader when `Sharding.AsymmetricTensor`.
`master/placement.py`: add `_supports_asymmetric_tensor_parallel` (currently Qwen3.5 family) and an opt-in auto-upgrade path that promotes `Sharding.Tensor` → `Sharding.AsymmetricTensor` when equal split won't fit on the smallest node but asymmetric split would, gated by `EXO_ENABLE_ASYMMETRIC_TP_AUTO_UPGRADE`.
`master/placement.py` + `master/placement_utils.py` (commit 4): RDMA host selection refinement so asymmetric rank-0 stays reachable from rank-1 even under interface re-ordering.

Dashboard

`dashboard/src/lib/components/ChatSidebar.svelte` + `dashboard/src/routes/+page.svelte` + `stores/app.svelte.ts` (commit 2): label asymmetric instances distinctly so users can tell at a glance which model is running asymmetric.
`dashboard/src/lib/components/TopologyGraph.svelte` (commit 3): visualize the per-node asymmetric model share (e.g. "43% / 57%") on the topology graph edges.

Tests

`worker/tests/unittests/test_mlx/test_asymmetric_parallel.py` (119 lines): unit tests for the split-by-ratio math and the asymmetric load path stubs.
`master/tests/test_placement.py` (~300 new lines): cover `_supports_asymmetric_tensor_parallel`, the auto-upgrade decision (Qwen3.5 only, equal-TP-too-tight gate, 85%-fits-on-total gate), the 2-node restriction, and the asymmetric rank-0 reachability check.

Why It Works

Tensor parallelism splits the per-layer weight matrices column-wise across the world. As long as the per-rank `hidden_size` is divisible by the world's split factor (and KV heads divide cleanly when not MQA), the math is correct independent of equal vs unequal splits. The existing `Sharding.Tensor` filter in placement already enforces `hidden_size % world == 0`; the asymmetric path adds the per-rank `hidden_size_share[i]` and verifies head-divisibility per rank.

The auto-upgrade gate is conservative: only triggers if equal split is >90% of the smallest node's available RAM AND the asymmetric split fits in <85% of total cluster RAM (so we leave room for KV cache and forward-pass activations). This means the auto-upgrade kicks in on the cases where the current behavior would OOM, and stays out of the way otherwise.

The 2-node restriction is current-PR scope; the math generalizes, but the load path and tests are limited to N=2 for this round.

Test Plan

Automated Testing

```
src/exo/master/tests/test_master.py . [ 1%]
src/exo/master/tests/test_placement.py .......................... [ 42%]
src/exo/master/tests/test_placement_utils.py ....................... [ 79%]
src/exo/master/tests/test_topology.py ..... [ 87%]
src/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py ........ [100%]
=== 63 passed in 0.11s ===
```

`uv run basedpyright` and `uv run ruff check` both clean.

Manual Testing

Hardware: 2-node pair (M5 Max 128GB + M5 Max 48GB) over Thunderbolt 5 RDMA.

Loaded `mlx-community/Qwen3.5-235B-A10B-4bit` (which doesn't fit on either node alone, doesn't fit on the smaller node under equal TP). With `EXO_ENABLE_ASYMMETRIC_TP_AUTO_UPGRADE=1` set, the master auto-upgrades to AsymmetricTensor and the model loads with ~43%/57% split.
Inference output matches the un-sharded reference output for a fixed seed (verified by sampling 50 prompts and comparing first-100-token sequences).
Dashboard shows the asymmetric labels and the per-node model share on the topology graph edges.

Notes for reviewers

The 2-node-only restriction is enforced in placement (`len(cycle) == 2`); no need to special-case in the load path.
Auto-upgrade is opt-in via env var so existing users see no behavior change.
Internal model-card filter (`_supports_asymmetric_tensor_parallel`) is conservative and easy to extend; happy to add Llama / Gemma / DeepSeek variants in follow-ups.

`system_profiler SPThunderboltDataType -json` represents transparent TB hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`, nesting the peer Mac one level deeper than a direct cable would. The previous parser only inspected the top-level `_items`, so on the side that enumerated through the dock the peer's `domain_uuid_key` was never found and the corresponding RDMA edge silently went missing - producing an asymmetric mesh where placement could see the link in one direction but not the other. Extend `_ConnectivityItem` to recursively model `_items` and walk the tree depth-first for the first descendant `domain_uuid_key`. The link is the actual peer endpoint regardless of how many transparent switches sit between the local receptacle and the peer Mac. Cover the new behaviour with three deterministic unit tests: the iVANKY hub case observed in production, a direct-cable sanity case, and an empty-receptacle case.

Retry the JACCL initialization handshake when peers come online out of order, which we frequently observed when a 4-node Apple Silicon cluster booted on a Thunderbolt RDMA fabric — the master would attempt the JACCL collective before all worker engines had registered, yielding a single-shot failure that took the whole inference path with it. Adds a small retry loop with backoff in mlx utils so the master will reattempt init for ~30s before giving up.

Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s (TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric). Without this, the dashboard was prompting users to enable RDMA on TB4-only nodes that can't actually run rdma_ctl in TB5 mode.

- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810 ruff fixes (no behavior change). - utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the JACCL retry-loop return; group is always bound when the loop exits with break, but the static analyzer can't reason across the raise-on-final-attempt branch.

Keep asymmetric tensor rank 0 constrained to the largest socket-reachable node and avoid auto-upgrading tensor placements that cannot satisfy the two-node asymmetric constraints.

jw-wcv and others added 9 commits May 10, 2026 01:52

Fix RDMA placement host selection

370979e

Add asymmetric tensor parallel integration

47277f5

Label asymmetric tensor dashboard instances

d3ce0ec

Show asymmetric model share in topology

089c16f

Preserve reachable asymmetric rank zero

201ade3

Keep asymmetric tensor rank 0 constrained to the largest socket-reachable node and avoid auto-upgrading tensor placements that cannot satisfy the two-node asymmetric constraints.

team-wcv force-pushed the feature/asymmetric-tp-on-rdma branch from da1ae21 to 201ade3 Compare May 10, 2026 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric tensor parallelism for uneven-memory clusters#2064

Asymmetric tensor parallelism for uneven-memory clusters#2064
team-wcv wants to merge 9 commits into
exo-explore:mainfrom
team-wcv:feature/asymmetric-tp-on-rdma

team-wcv commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

team-wcv commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Core feature

Dashboard

Tests

Why It Works

Test Plan

Automated Testing

Manual Testing

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

team-wcv commented May 7, 2026 •

edited

Loading