Skip to content

feat: Add Native CUDA Support for Linux GPU Inference#2103

Open
Winston-9527 wants to merge 1 commit into
exo-explore:mainfrom
Winston-9527:feature/linux-cuda-support
Open

feat: Add Native CUDA Support for Linux GPU Inference#2103
Winston-9527 wants to merge 1 commit into
exo-explore:mainfrom
Winston-9527:feature/linux-cuda-support

Conversation

@Winston-9527
Copy link
Copy Markdown

@Winston-9527 Winston-9527 commented May 22, 2026

feat: Add Native CUDA Support for Linux GPU Inference

Summary

This PR enables exo to run on NVIDIA GPUs on Linux by fixing Metal-specific assumptions in the MLX inference path. This is the native code complement to PR #2053 (Docker support) — providing Linux CUDA deployment without requiring Docker.

Key Achievement: exo now supports heterogeneous clusters with both macOS (Metal) and Linux (CUDA) nodes working together.

Problem Statement

Currently, exo on Linux only runs on CPU. When using mlx-cuda, several issues prevent GPU inference:

  1. Metal-only wired limit: set_wired_limit_for_model() assumes Metal is always available and crashes on CUDA
  2. Metal Fast Synch: MLX_METAL_FAST_SYNCH is unconditionally set, causing warnings on Linux
  3. MLX API mismatch: mlx-lm expects mx.new_thread_local_stream which doesn't exist in Linux CUDA MLX
  4. CUDA library loading: Runner subprocess can't find libcublasLt.so.13 even when parent process has it in LD_LIBRARY_PATH

Solution Overview

Changes Made

1. CUDA Compatibility Patches (src/exo/worker/engines/mlx/patches/cuda_compat.py)

  • New file providing API shims for Linux CUDA MLX
  • Patches mx.new_thread_local_streammx.new_stream (Linux CUDA exposes different API)
  • Only applied on non-Darwin platforms, no-op on macOS

2. Platform-gated Metal Settings (src/exo/worker/runner/bootstrap.py)

  • MLX_METAL_FAST_SYNCH only set on macOS (sys.platform == "darwin")
  • On Linux, logs "skipped (non-Darwin platform)" instead of setting Metal env var
  • Added automatic LD_LIBRARY_PATH setup for runner subprocess to find CUDA libraries

3. Backend-aware Wired Limit (src/exo/worker/engines/mlx/utils_mlx.py)

  • set_wired_limit_for_model() now checks mx.metal.is_available() first
  • If Metal unavailable but CUDA available, logs "CUDA backend active — skipping Metal wired limit"
  • No-op on CPU-only systems

4. Patch Registration (src/exo/worker/engines/mlx/patches/__init__.py)

  • Registers apply_cuda_compat_patches() in the patch application chain

Compatibility Impact

Zero breaking changes:

  • All changes are additive or platform-gated
  • macOS behavior unchanged (Metal path still works exactly as before)
  • CPU-only Linux still works (CUDA checks are conditional)
  • Existing model cards already declare MlxCuda backend support

Testing Evidence

Environment

  • Linux Node: NVIDIA GeForce RTX 3090, CUDA 13.1, Driver 590.48.01
  • macOS Node: MacBook Pro M5 Pro (Metal)
  • Network: Local network + Tailscale (100.85.176.124)

Test Results

Test Result Evidence
MLX basic GPU operations ✅ Pass Unit tests
MLX unit tests (16 tests) ✅ All pass pytest
exo service startup ✅ Success Logs
Qwen3-0.6B-8bit inference (Linux CUDA) ✅ GPU accelerated Screenshot 1
Qwen3-0.6B-8bit inference (Mac Metal) ✅ GPU accelerated Screenshot 2
Cross-platform cluster (Linux + macOS) ✅ Working Screenshot 3
Multi-model backend support ✅ All models show MlxCuda backend API response

Performance Benchmarks

Configuration TTFT TPS Backend
Linux RTX 3090 (Single) 919ms 188.6 tok/s MLX Ring (CUDA)
Mac M5 Pro (Single) 418ms 179.7 tok/s MLX Ring (Metal)
Mac + Linux (Joint) 920ms 69.3 tok/s MLX Ring (Metal + CUDA)

Performance Analysis:

  • Single-device performance is excellent on both platforms
  • Cross-platform inference shows expected degradation due to:
    1. Network communication overhead (Tailscale VPN)
    2. Pipeline parallelism inefficiency with small models (0.6B)
    3. Heterogeneous device speed mismatch
  • This is normal for distributed inference of small models; larger models (70B+) would show better scaling

Screenshots

Screenshot 1: Linux CUDA Single-Device

61cc340f4411b03c874642427c00b11f - Model: Qwen3-0.6B-8bit - Backend: MLX Ring (CUDA) - Performance: 188.6 tok/s

Screenshot 2: macOS Metal Single-Device

b4547ec1f3681484c7fd91c2f3849573 - Model: Qwen3-0.6B-8bit - Backend: MLX Ring (Metal) - Performance: 179.7 tok/s

Screenshot 3: Cross-Platform Cluster (macOS + Linux)

165e3480624532bd78199bbba334b0a9 - Model: Qwen3-0.6B-8bit - Backend: MLX Ring (Metal + CUDA) - Nodes: MacBook Pro + Linux RTX 3090 - Performance: 69.3 tok/s (network-bound, expected for small model)

Usage

Prerequisites

  1. Install mlx-cuda: uv sync --extra mlx-cuda13 (or mlx-cuda12)
  2. Ensure CUDA libraries are available (ollama bundled libs work)
  3. Set CUDA_HOME for kernel compilation

Launch

export CUDA_HOME=/path/to/nvidia/cu13
export LD_LIBRARY_PATH=/usr/local/lib/ollama/mlx_cuda_v13:/usr/local/lib/ollama/cuda_v13:$LD_LIBRARY_PATH
LD_PRELOAD=/path/to/libmlx.so uv run exo -v

Related

Checklist

  • Changes are minimal and focused (4 files, +49/-27 lines)
  • macOS compatibility preserved
  • No breaking changes
  • Lint checks pass (uv run ruff check)
  • Tested on real hardware (RTX 3090 + M5 Pro)
  • Cross-platform cluster verified
  • Screenshots provided as evidence

This commit enables exo to run on NVIDIA GPUs on Linux by fixing
Metal-specific assumptions in the MLX inference path.

Changes:
- Add CUDA compatibility shim for mlx-lm's new_thread_local_stream API
  (Linux CUDA MLX exposes new_stream instead)
- Gate MLX_METAL_FAST_SYNCH env var to macOS only, preventing warnings
  on Linux
- Make set_wired_limit_for_model handle CUDA backends gracefully by
  checking mx.metal.is_available() first
- Add automatic LD_LIBRARY_PATH setup in runner bootstrap for CUDA
  libraries (libcublasLt.so.13, etc.)

Compatibility:
- Zero breaking changes - all modifications are platform-gated
- macOS Metal path unchanged
- CPU-only Linux still works
- Enables heterogeneous clusters (macOS Metal + Linux CUDA)

Tested on:
- Linux: NVIDIA RTX 3090, CUDA 13.1, Driver 590.48.01
- macOS: MacBook Pro M5 Pro (Metal)
- Verified cross-platform cluster inference with Qwen3-0.6B-8bit

Refs: PR exo-explore#2053 (Docker support) - this provides the native code changes
needed for Linux CUDA deployment without requiring Docker.
Copilot AI review requested due to automatic review settings May 22, 2026 05:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to enable MLX-based GPU inference on Linux NVIDIA (via mlx-cuda) by removing Metal-only assumptions in the MLX runner path and introducing a small CUDA-compatibility shim so heterogeneous macOS (Metal) + Linux (CUDA) clusters can run together.

Changes:

  • Gate MLX_METAL_FAST_SYNCH configuration to macOS-only in the runner bootstrap.
  • Adjust MLX wired-limit behavior to be backend-aware (Metal vs CUDA) in utils_mlx.py.
  • Add and register a CUDA compatibility patch that shims mx.new_thread_local_stream via mx.new_stream on non-Darwin platforms.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/exo/worker/runner/bootstrap.py Platform-gates Metal-specific env var (MLX_METAL_FAST_SYNCH) and adjusts logging on non-Darwin.
src/exo/worker/engines/mlx/utils_mlx.py Makes wired-limit handling conditional on Metal availability and logs when CUDA is active.
src/exo/worker/engines/mlx/patches/cuda_compat.py Adds CUDA-side API shim for mlx-lm stream creation expectations.
src/exo/worker/engines/mlx/patches/__init__.py Registers the new CUDA compat patch in the MLX patch chain.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 54 to +62
fast_synch_override = os.environ.get("EXO_FAST_SYNCH")
if fast_synch_override == "false":
os.environ["MLX_METAL_FAST_SYNCH"] = "0"
if sys.platform == "darwin":
if fast_synch_override == "false":
os.environ["MLX_METAL_FAST_SYNCH"] = "0"
else:
os.environ["MLX_METAL_FAST_SYNCH"] = "1"
logger.info(f"Fast synch flag: {os.environ['MLX_METAL_FAST_SYNCH']}")
else:
os.environ["MLX_METAL_FAST_SYNCH"] = "1"

logger.info(f"Fast synch flag: {os.environ['MLX_METAL_FAST_SYNCH']}")
logger.info("Fast synch flag: skipped (non-Darwin platform)")
Comment on lines +12 to +24
def apply_cuda_compat_patches() -> None:
"""Apply MLX CUDA compatibility patches.

These patches are only applied on Linux systems where MLX uses the CUDA backend.
They are no-ops on macOS or CPU-only Linux.
"""
if sys.platform == "darwin":
return

# mlx-lm expects new_thread_local_stream, but Linux CUDA MLX exposes new_stream.
# Patch mx to provide the expected API.
if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
mx.new_thread_local_stream = mx.new_stream # type: ignore[attr-defined]
@Winston-9527
Copy link
Copy Markdown
Author

Winston-9527 commented May 22, 2026

Acknowledgments

This PR builds upon the MLX CUDA backend recently introduced by Apple's MLX team. Special thanks to Cheng (@zcbenz) and the Apple MLX team for their extensive work on bringing CUDA support to MLX, which made this cross-platform GPU inference possible.

What This PR Adds

While MLX CUDA backend provides the foundation for running MLX on NVIDIA GPUs, exo (the distributed inference framework) had several Metal-specific assumptions that prevented it from working with MLX CUDA. This PR fixes those assumptions, enabling:

  • ✅ Linux nodes with NVIDIA GPUs to join exo clusters
  • ✅ Heterogeneous clusters (macOS Metal + Linux CUDA)
  • ✅ Zero breaking changes to existing macOS users

The MLX CUDA backend is the engine; this PR is the adapter that makes exo work with that engine on Linux.

Juliuscply pushed a commit to Juliuscply/exo that referenced this pull request May 31, 2026
…#2103)

Cherry-picked from Winston-9527/feature/linux-cuda-support (05e9811)

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants