feat: Add Native CUDA Support for Linux GPU Inference by Winston-9527 · Pull Request #2103 · exo-explore/exo

Winston-9527 · 2026-05-22T05:03:55Z

feat: Add Native CUDA Support for Linux GPU Inference

Summary

This PR enables exo to run on NVIDIA GPUs on Linux by fixing Metal-specific assumptions in the MLX inference path. This is the native code complement to PR #2053 (Docker support) — providing Linux CUDA deployment without requiring Docker.

Key Achievement: exo now supports heterogeneous clusters with both macOS (Metal) and Linux (CUDA) nodes working together.

Problem Statement

Currently, exo on Linux only runs on CPU. When using mlx-cuda, several issues prevent GPU inference:

Metal-only wired limit: set_wired_limit_for_model() assumes Metal is always available and crashes on CUDA
Metal Fast Synch: MLX_METAL_FAST_SYNCH is unconditionally set, causing warnings on Linux
MLX API mismatch: mlx-lm expects mx.new_thread_local_stream which doesn't exist in Linux CUDA MLX
CUDA library loading: Runner subprocess can't find libcublasLt.so.13 even when parent process has it in LD_LIBRARY_PATH

Solution Overview

Changes Made

1. CUDA Compatibility Patches (`src/exo/worker/engines/mlx/patches/cuda_compat.py`)

New file providing API shims for Linux CUDA MLX
Patches mx.new_thread_local_stream → mx.new_stream (Linux CUDA exposes different API)
Only applied on non-Darwin platforms, no-op on macOS

2. Platform-gated Metal Settings (`src/exo/worker/runner/bootstrap.py`)

MLX_METAL_FAST_SYNCH only set on macOS (sys.platform == "darwin")
On Linux, logs "skipped (non-Darwin platform)" instead of setting Metal env var
Added automatic LD_LIBRARY_PATH setup for runner subprocess to find CUDA libraries

3. Backend-aware Wired Limit (`src/exo/worker/engines/mlx/utils_mlx.py`)

set_wired_limit_for_model() now checks mx.metal.is_available() first
If Metal unavailable but CUDA available, logs "CUDA backend active — skipping Metal wired limit"
No-op on CPU-only systems

4. Patch Registration (`src/exo/worker/engines/mlx/patches/init.py`)

Registers apply_cuda_compat_patches() in the patch application chain

Compatibility Impact

Zero breaking changes:

All changes are additive or platform-gated
macOS behavior unchanged (Metal path still works exactly as before)
CPU-only Linux still works (CUDA checks are conditional)
Existing model cards already declare MlxCuda backend support

Testing Evidence

Environment

Linux Node: NVIDIA GeForce RTX 3090, CUDA 13.1, Driver 590.48.01
macOS Node: MacBook Pro M5 Pro (Metal)
Network: Local network + Tailscale (100.85.176.124)

Test Results

Test	Result	Evidence
MLX basic GPU operations	✅ Pass	Unit tests
MLX unit tests (16 tests)	✅ All pass	pytest
exo service startup	✅ Success	Logs
Qwen3-0.6B-8bit inference (Linux CUDA)	✅ GPU accelerated	Screenshot 1
Qwen3-0.6B-8bit inference (Mac Metal)	✅ GPU accelerated	Screenshot 2
Cross-platform cluster (Linux + macOS)	✅ Working	Screenshot 3
Multi-model backend support	✅ All models show MlxCuda backend	API response

Performance Benchmarks

Configuration	TTFT	TPS	Backend
Linux RTX 3090 (Single)	919ms	188.6 tok/s	MLX Ring (CUDA)
Mac M5 Pro (Single)	418ms	179.7 tok/s	MLX Ring (Metal)
Mac + Linux (Joint)	920ms	69.3 tok/s	MLX Ring (Metal + CUDA)

Performance Analysis:

Single-device performance is excellent on both platforms
Cross-platform inference shows expected degradation due to:
1. Network communication overhead (Tailscale VPN)
2. Pipeline parallelism inefficiency with small models (0.6B)
3. Heterogeneous device speed mismatch
This is normal for distributed inference of small models; larger models (70B+) would show better scaling

Screenshots

Screenshot 1: Linux CUDA Single-Device

- Model: Qwen3-0.6B-8bit - Backend: MLX Ring (CUDA) - Performance: 188.6 tok/s

Screenshot 2: macOS Metal Single-Device

- Model: Qwen3-0.6B-8bit - Backend: MLX Ring (Metal) - Performance: 179.7 tok/s

Screenshot 3: Cross-Platform Cluster (macOS + Linux)

- Model: Qwen3-0.6B-8bit - Backend: MLX Ring (Metal + CUDA) - Nodes: MacBook Pro + Linux RTX 3090 - Performance: 69.3 tok/s (network-bound, expected for small model)

Usage

Prerequisites

Install mlx-cuda: uv sync --extra mlx-cuda13 (or mlx-cuda12)
Ensure CUDA libraries are available (ollama bundled libs work)
Set CUDA_HOME for kernel compilation

Launch

export CUDA_HOME=/path/to/nvidia/cu13
export LD_LIBRARY_PATH=/usr/local/lib/ollama/mlx_cuda_v13:/usr/local/lib/ollama/cuda_v13:$LD_LIBRARY_PATH
LD_PRELOAD=/path/to/libmlx.so uv run exo -v

Checklist

Changes are minimal and focused (4 files, +49/-27 lines)
macOS compatibility preserved
No breaking changes
Lint checks pass (uv run ruff check)
Tested on real hardware (RTX 3090 + M5 Pro)
Cross-platform cluster verified
Screenshots provided as evidence

This commit enables exo to run on NVIDIA GPUs on Linux by fixing Metal-specific assumptions in the MLX inference path. Changes: - Add CUDA compatibility shim for mlx-lm's new_thread_local_stream API (Linux CUDA MLX exposes new_stream instead) - Gate MLX_METAL_FAST_SYNCH env var to macOS only, preventing warnings on Linux - Make set_wired_limit_for_model handle CUDA backends gracefully by checking mx.metal.is_available() first - Add automatic LD_LIBRARY_PATH setup in runner bootstrap for CUDA libraries (libcublasLt.so.13, etc.) Compatibility: - Zero breaking changes - all modifications are platform-gated - macOS Metal path unchanged - CPU-only Linux still works - Enables heterogeneous clusters (macOS Metal + Linux CUDA) Tested on: - Linux: NVIDIA RTX 3090, CUDA 13.1, Driver 590.48.01 - macOS: MacBook Pro M5 Pro (Metal) - Verified cross-platform cluster inference with Qwen3-0.6B-8bit Refs: PR exo-explore#2053 (Docker support) - this provides the native code changes needed for Linux CUDA deployment without requiring Docker.

Copilot

Pull request overview

This PR aims to enable MLX-based GPU inference on Linux NVIDIA (via mlx-cuda) by removing Metal-only assumptions in the MLX runner path and introducing a small CUDA-compatibility shim so heterogeneous macOS (Metal) + Linux (CUDA) clusters can run together.

Changes:

Gate MLX_METAL_FAST_SYNCH configuration to macOS-only in the runner bootstrap.
Adjust MLX wired-limit behavior to be backend-aware (Metal vs CUDA) in utils_mlx.py.
Add and register a CUDA compatibility patch that shims mx.new_thread_local_stream via mx.new_stream on non-Darwin platforms.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`src/exo/worker/runner/bootstrap.py`	Platform-gates Metal-specific env var (`MLX_METAL_FAST_SYNCH`) and adjusts logging on non-Darwin.
`src/exo/worker/engines/mlx/utils_mlx.py`	Makes wired-limit handling conditional on Metal availability and logs when CUDA is active.
`src/exo/worker/engines/mlx/patches/cuda_compat.py`	Adds CUDA-side API shim for `mlx-lm` stream creation expectations.
`src/exo/worker/engines/mlx/patches/__init__.py`	Registers the new CUDA compat patch in the MLX patch chain.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    fast_synch_override = os.environ.get("EXO_FAST_SYNCH")
-    if fast_synch_override == "false":
-        os.environ["MLX_METAL_FAST_SYNCH"] = "0"
+    if sys.platform == "darwin":
+        if fast_synch_override == "false":
+            os.environ["MLX_METAL_FAST_SYNCH"] = "0"
+        else:
+            os.environ["MLX_METAL_FAST_SYNCH"] = "1"
+        logger.info(f"Fast synch flag: {os.environ['MLX_METAL_FAST_SYNCH']}")
    else:
-        os.environ["MLX_METAL_FAST_SYNCH"] = "1"
-
-    logger.info(f"Fast synch flag: {os.environ['MLX_METAL_FAST_SYNCH']}")
+        logger.info("Fast synch flag: skipped (non-Darwin platform)")


+def apply_cuda_compat_patches() -> None:
+    """Apply MLX CUDA compatibility patches.
+
+    These patches are only applied on Linux systems where MLX uses the CUDA backend.
+    They are no-ops on macOS or CPU-only Linux.
+    """
+    if sys.platform == "darwin":
+        return
+
+    # mlx-lm expects new_thread_local_stream, but Linux CUDA MLX exposes new_stream.
+    # Patch mx to provide the expected API.
+    if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
+        mx.new_thread_local_stream = mx.new_stream  # type: ignore[attr-defined]


Winston-9527 · 2026-05-22T05:22:34Z

Acknowledgments

This PR builds upon the MLX CUDA backend recently introduced by Apple's MLX team. Special thanks to Cheng (@zcbenz) and the Apple MLX team for their extensive work on bringing CUDA support to MLX, which made this cross-platform GPU inference possible.

What This PR Adds

While MLX CUDA backend provides the foundation for running MLX on NVIDIA GPUs, exo (the distributed inference framework) had several Metal-specific assumptions that prevented it from working with MLX CUDA. This PR fixes those assumptions, enabling:

✅ Linux nodes with NVIDIA GPUs to join exo clusters
✅ Heterogeneous clusters (macOS Metal + Linux CUDA)
✅ Zero breaking changes to existing macOS users

The MLX CUDA backend is the engine; this PR is the adapter that makes exo work with that engine on Linux.

…#2103) Cherry-picked from Winston-9527/feature/linux-cuda-support (05e9811) Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 22, 2026 05:03

Copilot started reviewing on behalf of Winston-9527 May 22, 2026 05:04 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Juliuscply mentioned this pull request Jun 1, 2026

feat: 3-node heterogeneous cluster support (Metal + CUDA) #2129

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Native CUDA Support for Linux GPU Inference#2103

feat: Add Native CUDA Support for Linux GPU Inference#2103
Winston-9527 wants to merge 1 commit into
exo-explore:mainfrom
Winston-9527:feature/linux-cuda-support

Winston-9527 commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Winston-9527 commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Winston-9527 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Add Native CUDA Support for Linux GPU Inference

Summary

Problem Statement

Solution Overview

Changes Made

1. CUDA Compatibility Patches (src/exo/worker/engines/mlx/patches/cuda_compat.py)

2. Platform-gated Metal Settings (src/exo/worker/runner/bootstrap.py)

3. Backend-aware Wired Limit (src/exo/worker/engines/mlx/utils_mlx.py)

4. Patch Registration (src/exo/worker/engines/mlx/patches/__init__.py)

Compatibility Impact

Testing Evidence

Environment

Test Results

Performance Benchmarks

Screenshots

Screenshot 1: Linux CUDA Single-Device

Screenshot 2: macOS Metal Single-Device

Screenshot 3: Cross-Platform Cluster (macOS + Linux)

Usage

Prerequisites

Launch

Related

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Winston-9527 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Acknowledgments

What This PR Adds

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Winston-9527 commented May 22, 2026 •

edited

Loading

1. CUDA Compatibility Patches (`src/exo/worker/engines/mlx/patches/cuda_compat.py`)

2. Platform-gated Metal Settings (`src/exo/worker/runner/bootstrap.py`)

3. Backend-aware Wired Limit (`src/exo/worker/engines/mlx/utils_mlx.py`)

4. Patch Registration (`src/exo/worker/engines/mlx/patches/init.py`)

Winston-9527 commented May 22, 2026 •

edited

Loading