Skip to content

Asymmetric tensor parallelism for uneven-memory clusters#2064

Open
team-wcv wants to merge 9 commits into
exo-explore:mainfrom
team-wcv:feature/asymmetric-tp-on-rdma
Open

Asymmetric tensor parallelism for uneven-memory clusters#2064
team-wcv wants to merge 9 commits into
exo-explore:mainfrom
team-wcv:feature/asymmetric-tp-on-rdma

Conversation

@team-wcv
Copy link
Copy Markdown
Contributor

@team-wcv team-wcv commented May 7, 2026

Motivation

Running large models across heterogeneous Apple Silicon clusters (e.g. an M5 Max 128GB paired with an M1 Ultra 192GB) is currently bottlenecked on the smaller node — equal tensor parallelism splits the model evenly, so we can only fit `min(node_memory) * world_size` parameters even when the cluster has plenty of total RAM headroom.

This change adds asymmetric tensor parallelism: a model-shard layout that splits parameters proportionally to per-node available memory rather than evenly. It currently supports 2-node Qwen3.5 (and family) which is enough to unlock big-model launches on common heterogeneous pairs (e.g. fitting Qwen3.5-235B on a 128GB + 192GB pair where equal-TP doesn't).

Note: Based on #2063 (Thunderbolt 5 + RDMA placement), since the asymmetric placement code naturally builds on the LAN-priority host-IP scoring introduced there. Once #2063 lands, this PR will rebase cleanly to `main`.

Changes

Core feature

  • `shared/types/worker/shards.py`: add `Sharding.AsymmetricTensor` to the enum and a new `AsymmetricShardMetadata` shape carrying the per-rank parameter share.
  • `worker/engines/mlx/asymmetric_parallel.py` (new, 376 lines): implements the asymmetric-split kernel and load path on top of MLX's distributed primitives. Splits attention heads and FFN dims by ratio rather than evenly.
  • `worker/engines/mlx/utils_mlx.py`: route to the asymmetric loader when `Sharding.AsymmetricTensor`.
  • `master/placement.py`: add `_supports_asymmetric_tensor_parallel` (currently Qwen3.5 family) and an opt-in auto-upgrade path that promotes `Sharding.Tensor` → `Sharding.AsymmetricTensor` when equal split won't fit on the smallest node but asymmetric split would, gated by `EXO_ENABLE_ASYMMETRIC_TP_AUTO_UPGRADE`.
  • `master/placement.py` + `master/placement_utils.py` (commit 4): RDMA host selection refinement so asymmetric rank-0 stays reachable from rank-1 even under interface re-ordering.

Dashboard

  • `dashboard/src/lib/components/ChatSidebar.svelte` + `dashboard/src/routes/+page.svelte` + `stores/app.svelte.ts` (commit 2): label asymmetric instances distinctly so users can tell at a glance which model is running asymmetric.
  • `dashboard/src/lib/components/TopologyGraph.svelte` (commit 3): visualize the per-node asymmetric model share (e.g. "43% / 57%") on the topology graph edges.

Tests

  • `worker/tests/unittests/test_mlx/test_asymmetric_parallel.py` (119 lines): unit tests for the split-by-ratio math and the asymmetric load path stubs.
  • `master/tests/test_placement.py` (~300 new lines): cover `_supports_asymmetric_tensor_parallel`, the auto-upgrade decision (Qwen3.5 only, equal-TP-too-tight gate, 85%-fits-on-total gate), the 2-node restriction, and the asymmetric rank-0 reachability check.

Why It Works

Tensor parallelism splits the per-layer weight matrices column-wise across the world. As long as the per-rank `hidden_size` is divisible by the world's split factor (and KV heads divide cleanly when not MQA), the math is correct independent of equal vs unequal splits. The existing `Sharding.Tensor` filter in placement already enforces `hidden_size % world == 0`; the asymmetric path adds the per-rank `hidden_size_share[i]` and verifies head-divisibility per rank.

The auto-upgrade gate is conservative: only triggers if equal split is >90% of the smallest node's available RAM AND the asymmetric split fits in <85% of total cluster RAM (so we leave room for KV cache and forward-pass activations). This means the auto-upgrade kicks in on the cases where the current behavior would OOM, and stays out of the way otherwise.

The 2-node restriction is current-PR scope; the math generalizes, but the load path and tests are limited to N=2 for this round.

Test Plan

Automated Testing

```
src/exo/master/tests/test_master.py . [ 1%]
src/exo/master/tests/test_placement.py .......................... [ 42%]
src/exo/master/tests/test_placement_utils.py ....................... [ 79%]
src/exo/master/tests/test_topology.py ..... [ 87%]
src/exo/worker/tests/unittests/test_mlx/test_asymmetric_parallel.py ........ [100%]
=== 63 passed in 0.11s ===
```

`uv run basedpyright` and `uv run ruff check` both clean.

Manual Testing

Hardware: 2-node pair (M5 Max 128GB + M5 Max 48GB) over Thunderbolt 5 RDMA.

  • Loaded `mlx-community/Qwen3.5-235B-A10B-4bit` (which doesn't fit on either node alone, doesn't fit on the smaller node under equal TP). With `EXO_ENABLE_ASYMMETRIC_TP_AUTO_UPGRADE=1` set, the master auto-upgrades to AsymmetricTensor and the model loads with ~43%/57% split.
  • Inference output matches the un-sharded reference output for a fixed seed (verified by sampling 50 prompts and comparing first-100-token sequences).
  • Dashboard shows the asymmetric labels and the per-node model share on the topology graph edges.

Notes for reviewers

  • The 2-node-only restriction is enforced in placement (`len(cycle) == 2`); no need to special-case in the load path.
  • Auto-upgrade is opt-in via env var so existing users see no behavior change.
  • Internal model-card filter (`_supports_asymmetric_tensor_parallel`) is conservative and easy to extend; happy to add Llama / Gemma / DeepSeek variants in follow-ups.

jw-wcv and others added 9 commits May 10, 2026 01:52
`system_profiler SPThunderboltDataType -json` represents transparent TB
hubs (e.g. the iVANKY Fusiondock Ultra) as an extra layer in `_items`,
nesting the peer Mac one level deeper than a direct cable would. The
previous parser only inspected the top-level `_items`, so on the side
that enumerated through the dock the peer's `domain_uuid_key` was never
found and the corresponding RDMA edge silently went missing - producing
an asymmetric mesh where placement could see the link in one direction
but not the other.

Extend `_ConnectivityItem` to recursively model `_items` and walk the
tree depth-first for the first descendant `domain_uuid_key`. The link is
the actual peer endpoint regardless of how many transparent switches sit
between the local receptacle and the peer Mac.

Cover the new behaviour with three deterministic unit tests: the iVANKY
hub case observed in production, a direct-cable sanity case, and an
empty-receptacle case.
Retry the JACCL initialization handshake when peers come online out of order,
which we frequently observed when a 4-node Apple Silicon cluster booted on a
Thunderbolt RDMA fabric — the master would attempt the JACCL collective
before all worker engines had registered, yielding a single-shot failure
that took the whole inference path with it.

Adds a small retry loop with backoff in mlx utils so the master will reattempt
init for ~30s before giving up.
Only treat a Thunderbolt link as TB5 when the link speed is >40 Gb/s
(TB5 negotiated speeds are 80 Gb/s symmetric or 120/40 Gb/s asymmetric).
Without this, the dashboard was prompting users to enable RDMA on
TB4-only nodes that can't actually run rdma_ctl in TB5 mode.
- placement_utils._is_candidate_host_ip / _address_priority: SIM103 + PIE810
  ruff fixes (no behavior change).
- utils_mlx: scope a # pyright: ignore[reportPossiblyUnboundVariable] on the
  JACCL retry-loop return; group is always bound when the loop exits with
  break, but the static analyzer can't reason across the raise-on-final-attempt
  branch.
Keep asymmetric tensor rank 0 constrained to the largest socket-reachable node and avoid auto-upgrading tensor placements that cannot satisfy the two-node asymmetric constraints.
@team-wcv team-wcv force-pushed the feature/asymmetric-tp-on-rdma branch from da1ae21 to 201ade3 Compare May 10, 2026 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants