Add CUDA Docker image support by cameronbergh · Pull Request #2053 · exo-explore/exo

cameronbergh · 2026-05-05T19:30:08Z

Summary

add a CUDA 13 Docker image for Linux/NVIDIA hosts
build the Rust Python bindings and dashboard in Docker stages
install Exo with the existing cuda13 extra instead of changing default Linux dependencies
add compose/just helpers and an entrypoint that exposes NVIDIA pip-package libraries to MLX

Notes

This ports the useful Docker pieces from #1317 onto current main without making CUDA the default Linux install path. Non-CUDA Linux users should continue to use the existing CPU/default paths.

Validation

ruff check .
basedpyright --project pyproject.toml
cd dashboard && npm install && npm run build
pytest src → 410 passed, 1 skipped, 187 deselected
Linux/NVIDIA host: docker build -t exo:cuda13-pr1317 . completed successfully
Linux/NVIDIA host MLX CUDA smoke test passed inside the image when NVIDIA devices and libcuda.so.1 were exposed manually:
- default Device(gpu, 0)
- array [2, 3, 4]

Known follow-up

The test host's Docker daemon does not currently have NVIDIA Container Toolkit/CDI configured, so docker run --gpus all ... fails with:

failed to discover GPU vendor from CDI: no known GPU vendor found

Manual device/library exposure verified that the image itself can import and execute MLX on the RTX 3060 GPU.

pulmhealthagent-ai · 2026-05-05T22:56:25Z

Tested this PR on an NVIDIA DGX Spark running ARM64 Ubuntu 24.04.

First, thanks for putting this together. This is the first path I’ve found that gets EXO close to running on the DGX Spark CUDA stack.

Environment

NVIDIA DGX Spark / Grace-Hopper / ARM64
Ubuntu 24.04
Docker with --gpus all
CUDA 13 image from this PR
EXO_LIBP2P_NAMESPACE=pulmhealth_cuda_test
NFS-mounted model directory mounted into /root/.local/share/exo/models

Build result

The image built successfully:

docker build -t exo:cuda13-pr2053 .
Runtime issue 1: resources/dashboard path

When running the image directly, EXO failed with:

FileNotFoundError: Unable to locate resources. Did you clone the repo properly?

I was able to get past that by bind-mounting the repo separately and setting EXO_RESOURCES_DIR:

docker run --rm -it \
  --gpus all \
  --network host \
  -e EXO_LIBP2P_NAMESPACE=pulmhealth_cuda_test \
  -e EXO_RESOURCES_DIR=/repo \
  -v "$PWD":/repo \
  -v /mnt/exo-models/models:/root/.local/share/exo/models \
  exo:cuda13-pr2053-patched

This suggests the runtime image may not include or locate the dashboard/resources path correctly.

Runtime issue 2: mlx_lm expects new_thread_local_stream

After getting past the resources issue, the runner repeatedly crashed on Linux/CUDA with:

AttributeError: module 'mlx.core' has no attribute 'new_thread_local_stream'

Relevant trace:

File "/app/.venv/lib/python3.13/site-packages/mlx_lm/generate.py", line 226, in <module>
    generation_stream = mx.new_thread_local_stream(mx.default_device())

Inside the container:

import mlx.core as mx

print(hasattr(mx, "new_thread_local_stream"))
# False

print([x for x in dir(mx) if "stream" in x.lower()])
# ['Stream', 'StreamContext', 'default_stream', 'new_stream', 'set_default_stream', 'stream']

So the installed Linux/CUDA mlx.core exposes new_stream, but not new_thread_local_stream, while the mlx-lm branch used by this PR expects new_thread_local_stream.

Temporary compatibility shim

This shim allowed mlx_lm to import successfully inside the container:

import mlx.core as mx

if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
    mx.new_thread_local_stream = mx.new_stream

import mlx_lm

I then patched src/exo/worker/runner/bootstrap.py immediately before importing/applying the MLX patches:

try:
    import mlx.core as mx

    if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
        mx.new_thread_local_stream = mx.new_stream
except Exception:
    pass

from exo.worker.engines.mlx.patches import apply_mlx_patches

After rebuilding the image with that shim, the previous new_thread_local_stream import crash appears to be resolved.

Suggested fixes

Potential fixes to consider:

Align the Linux/CUDA mlx, mlx-cuda-13, and mlx-lm pins so mlx_lm does not call APIs missing from Linux/CUDA MLX.
Add a small Linux/CUDA compatibility shim for new_thread_local_stream if new_stream is the intended equivalent.
Ensure dashboard/resources are included in the runtime image, or document the required EXO_RESOURCES_DIR and bind-mount pattern.

I’m continuing to test this with multiple DGX Spark nodes in an isolated namespace and can validate another branch or image if helpful.

cameronbergh · 2026-05-05T23:01:29Z

Holding this as draft until we validate with a real Qwen 4B model on Mac MLX, Linux CUDA, and a mixed Mac+CUDA cluster.

cameronbergh · 2026-05-05T23:25:36Z

Validation update for mlx-community/Qwen3.5-4B-4bit:

Mac MLX-only: passed via /bench/chat/completions
- prompt TPS: 55.097565061709865
- generation TPS: 129.76622721094307
- prompt/generation tokens: 17 / 32
- peak memory: 2722856562 bytes
Linux CUDA Docker-only: passed via /bench/chat/completions in exo:cuda13-pr1317
- prompt TPS: 2.217056995804178
- generation TPS: 11.91972019735572
- prompt/generation tokens: 17 / 32
- peak memory: 2805840141 bytes
- note: needed free VRAM on the RTX 3060 host; an existing llama-server process was occupying most GPU memory and initially caused cudaMallocAsync(&data, size, stream) failed: out of memory.
Mixed Mac MLX + Linux CUDA: passed via /bench/chat/completions with a 2-node MlxRing pipeline instance split across Mac Studio + RTX 3060 host
- prompt TPS: 2.292403738276393
- generation TPS: 55.165733318081294
- prompt/generation tokens: 17 / 32
- peak memory: 1601602326 bytes

Also fixed two Docker issues discovered during validation:

Added resources/ to the image so model-card/resource loading works in Docker.
Added a temporary Docker-local compatibility shim for the current Linux CUDA MLX wheel: mlx-lm expects mx.new_thread_local_stream, while the Linux CUDA MLX wheel exposes the equivalent mx.new_stream API.

cameronbergh · 2026-05-05T23:54:45Z

Follow-up validation with a longer prompt and longer generation (230 prompt tokens / 512 generated tokens):

Mac MLX-only: passed via /bench/chat/completions
- response id: ab733326-606e-45c2-aec6-ce4a8d809e5a
- prompt TPS: 28.24285388749967
- generation TPS: 139.84676721721408
- peak memory: 3249673297 bytes
- elapsed: 11.948140457971022s
Linux CUDA using both RTX 3060s: passed via a 2-node MlxRing pipeline instance, one Docker EXO node per GPU (/dev/nvidia0 and /dev/nvidia1)
- response id: b0277cb9-cdff-4a26-975b-7b3f74270e25
- prompt TPS: 21.523230252187886
- generation TPS: 25.82012615634045
- peak memory: 2241211860 bytes
- elapsed: 46.736752146855s
- nvidia-smi during/after load showed both GPUs active: GPU0 about 2937 MiB, GPU1 about 2906 MiB used.
Mixed Mac MLX + Linux CUDA: passed via a 2-node MlxRing pipeline instance split across Mac Studio and one RTX 3060 Docker EXO node
- response id: 724bb700-791c-445f-b00d-fb1021da3613
- prompt TPS: 26.14653022620217
- generation TPS: 58.75482148222831
- peak memory: 2074232420 bytes
- elapsed: 28.841311666183174s

Note: the earlier validation comment used a very short 17 prompt token / 32 generated token request, so these longer numbers are the more meaningful validation signal.

This commit enables exo to run on NVIDIA GPUs on Linux by fixing Metal-specific assumptions in the MLX inference path. Changes: - Add CUDA compatibility shim for mlx-lm's new_thread_local_stream API (Linux CUDA MLX exposes new_stream instead) - Gate MLX_METAL_FAST_SYNCH env var to macOS only, preventing warnings on Linux - Make set_wired_limit_for_model handle CUDA backends gracefully by checking mx.metal.is_available() first - Add automatic LD_LIBRARY_PATH setup in runner bootstrap for CUDA libraries (libcublasLt.so.13, etc.) Compatibility: - Zero breaking changes - all modifications are platform-gated - macOS Metal path unchanged - CPU-only Linux still works - Enables heterogeneous clusters (macOS Metal + Linux CUDA) Tested on: - Linux: NVIDIA RTX 3090, CUDA 13.1, Driver 590.48.01 - macOS: MacBook Pro M5 Pro (Metal) - Verified cross-platform cluster inference with Qwen3-0.6B-8bit Refs: PR exo-explore#2053 (Docker support) - this provides the native code changes needed for Linux CUDA deployment without requiring Docker.

cameronbergh marked this pull request as draft May 5, 2026 23:01

cameronbergh force-pushed the fix-pr-1317-current branch 3 times, most recently from 7b4fa84 to a864666 Compare May 5, 2026 23:12

Add CUDA Docker image support

23e0de2

cameronbergh force-pushed the fix-pr-1317-current branch from a864666 to 23e0de2 Compare May 5, 2026 23:14

Winston-9527 mentioned this pull request May 20, 2026

feat: add native CUDA support for Linux GPU inference #2100

Closed

Winston-9527 mentioned this pull request May 22, 2026

feat: Add Native CUDA Support for Linux GPU Inference #2103

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA Docker image support#2053

Add CUDA Docker image support#2053
cameronbergh wants to merge 1 commit into
exo-explore:mainfrom
cameronbergh:fix-pr-1317-current

cameronbergh commented May 5, 2026

Uh oh!

pulmhealthagent-ai commented May 5, 2026

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cameronbergh commented May 5, 2026

Summary

Notes

Validation

Known follow-up

Uh oh!

pulmhealthagent-ai commented May 5, 2026

Environment

Build result

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

cameronbergh commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants