Skip to content

fix: prefer 192.168.100.x subnet for TB4 ring — explicit priority 0 beats bridge0 aliases#2099

Open
mpuodziukas-labs wants to merge 3 commits into
exo-explore:mainfrom
mpuodziukas-labs:fix/tb4-ring-ip-priority-192-168-100
Open

fix: prefer 192.168.100.x subnet for TB4 ring — explicit priority 0 beats bridge0 aliases#2099
mpuodziukas-labs wants to merge 3 commits into
exo-explore:mainfrom
mpuodziukas-labs:fix/tb4-ring-ip-priority-192-168-100

Conversation

@mpuodziukas-labs
Copy link
Copy Markdown

Problem

On macOS with Thunderbolt 4 direct link between two Apple Silicon nodes, ring TCP connections fail with EHOSTUNREACH (errno 65) despite the peer being pingable.

Root cause: The _find_ip_prioritised function in placement_utils.py selects ring IPs by interface type. TB4 creates two IP aliases on the same physical interface:

  • 192.168.100.x — the dedicated raw-TB P2P subnet, assigned to en1 directly
  • 192.168.2.x — a compatibility alias, also on en1 but routed via bridge0

Both IPs have interface type maybe_ethernet, so they tie in the priority sort. When bridge0 wins the tiebreak, TCP connections from spawned ring subprocesses get EHOSTUNREACH — macOS routes subprocess traffic differently than the parent process, and the bridge route doesn't resolve for them.

Fix

Add an explicit prefix check for 192.168.100.0/24 (the standard TB4 P2P subnet) and assign it priority 0 — always first choice for ring. All other interface types shift up by 1.

def _ring_ip_key(ip: str) -> int:
    # 192.168.100.0/24 is the dedicated raw-TB4 subnet. macOS classifies
    # it as maybe_ethernet via the en1 bridge, but ring TCP reliably
    # works here (verified <0.5ms RTT). Give it explicit priority 0 so it
    # always beats bridge0 aliases (192.168.2.x) which cause EHOSTUNREACH
    # in spawned ring subprocesses.
    if ip.startswith("192.168.100."):
        return 0
    return priority.get(ip_to_type.get(ip, "unknown"), 5)

Verified

  • M1 Max ↔ M4 Mini over TB4, 192.168.100.1/2 subnet
  • Ring TCP connections succeed (previously failing with errno 65)
  • Inference working: mlx-community/Qwen3-0.6B-4bit → single-node and ring mode

Also included

  • fix: default backends to [MlxMetal]ModelCard.backends is required since Add node backends to model cards #2071 but old TOML cards don't have the field. Default to [MlxMetal] prevents ValidationError on startup when loading custom model cards.

mpuodziukas and others added 3 commits May 18, 2026 09:03
…eats bridge0 aliases

192.168.100.0/24 is the dedicated raw-TB4 P2P subnet (en1 direct, <0.5ms RTT).
macOS classifies it as maybe_ethernet which tied with LAN, causing bridge0 routes
(192.168.2.x) to win via tiebreak and produce EHOSTUNREACH in ring subprocesses.
Explicit prefix check gives TB4 subnet priority 0, all other types shift up by 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…acy TOML cards missing backends field

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants