Skip to content

Add CUDA Docker image support#2053

Draft
cameronbergh wants to merge 1 commit into
exo-explore:mainfrom
cameronbergh:fix-pr-1317-current
Draft

Add CUDA Docker image support#2053
cameronbergh wants to merge 1 commit into
exo-explore:mainfrom
cameronbergh:fix-pr-1317-current

Conversation

@cameronbergh
Copy link
Copy Markdown

Summary

  • add a CUDA 13 Docker image for Linux/NVIDIA hosts
  • build the Rust Python bindings and dashboard in Docker stages
  • install Exo with the existing cuda13 extra instead of changing default Linux dependencies
  • add compose/just helpers and an entrypoint that exposes NVIDIA pip-package libraries to MLX

Notes

This ports the useful Docker pieces from #1317 onto current main without making CUDA the default Linux install path. Non-CUDA Linux users should continue to use the existing CPU/default paths.

Validation

  • ruff check .
  • basedpyright --project pyproject.toml
  • cd dashboard && npm install && npm run build
  • pytest src → 410 passed, 1 skipped, 187 deselected
  • Linux/NVIDIA host: docker build -t exo:cuda13-pr1317 . completed successfully
  • Linux/NVIDIA host MLX CUDA smoke test passed inside the image when NVIDIA devices and libcuda.so.1 were exposed manually:
    • default Device(gpu, 0)
    • array [2, 3, 4]

Known follow-up

The test host's Docker daemon does not currently have NVIDIA Container Toolkit/CDI configured, so docker run --gpus all ... fails with:

failed to discover GPU vendor from CDI: no known GPU vendor found

Manual device/library exposure verified that the image itself can import and execute MLX on the RTX 3060 GPU.

@pulmhealthagent-ai
Copy link
Copy Markdown

Tested this PR on an NVIDIA DGX Spark running ARM64 Ubuntu 24.04.

First, thanks for putting this together. This is the first path I’ve found that gets EXO close to running on the DGX Spark CUDA stack.

Environment

  • NVIDIA DGX Spark / Grace-Hopper / ARM64
  • Ubuntu 24.04
  • Docker with --gpus all
  • CUDA 13 image from this PR
  • EXO_LIBP2P_NAMESPACE=pulmhealth_cuda_test
  • NFS-mounted model directory mounted into /root/.local/share/exo/models

Build result

The image built successfully:

docker build -t exo:cuda13-pr2053 .
Runtime issue 1: resources/dashboard path

When running the image directly, EXO failed with:

FileNotFoundError: Unable to locate resources. Did you clone the repo properly?

I was able to get past that by bind-mounting the repo separately and setting EXO_RESOURCES_DIR:

docker run --rm -it \
  --gpus all \
  --network host \
  -e EXO_LIBP2P_NAMESPACE=pulmhealth_cuda_test \
  -e EXO_RESOURCES_DIR=/repo \
  -v "$PWD":/repo \
  -v /mnt/exo-models/models:/root/.local/share/exo/models \
  exo:cuda13-pr2053-patched

This suggests the runtime image may not include or locate the dashboard/resources path correctly.

Runtime issue 2: mlx_lm expects new_thread_local_stream

After getting past the resources issue, the runner repeatedly crashed on Linux/CUDA with:

AttributeError: module 'mlx.core' has no attribute 'new_thread_local_stream'

Relevant trace:

File "/app/.venv/lib/python3.13/site-packages/mlx_lm/generate.py", line 226, in <module>
    generation_stream = mx.new_thread_local_stream(mx.default_device())

Inside the container:

import mlx.core as mx

print(hasattr(mx, "new_thread_local_stream"))
# False

print([x for x in dir(mx) if "stream" in x.lower()])
# ['Stream', 'StreamContext', 'default_stream', 'new_stream', 'set_default_stream', 'stream']

So the installed Linux/CUDA mlx.core exposes new_stream, but not new_thread_local_stream, while the mlx-lm branch used by this PR expects new_thread_local_stream.

Temporary compatibility shim

This shim allowed mlx_lm to import successfully inside the container:

import mlx.core as mx

if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
    mx.new_thread_local_stream = mx.new_stream

import mlx_lm

I then patched src/exo/worker/runner/bootstrap.py immediately before importing/applying the MLX patches:

try:
    import mlx.core as mx

    if not hasattr(mx, "new_thread_local_stream") and hasattr(mx, "new_stream"):
        mx.new_thread_local_stream = mx.new_stream
except Exception:
    pass

from exo.worker.engines.mlx.patches import apply_mlx_patches

After rebuilding the image with that shim, the previous new_thread_local_stream import crash appears to be resolved.

Suggested fixes

Potential fixes to consider:

Align the Linux/CUDA mlx, mlx-cuda-13, and mlx-lm pins so mlx_lm does not call APIs missing from Linux/CUDA MLX.
Add a small Linux/CUDA compatibility shim for new_thread_local_stream if new_stream is the intended equivalent.
Ensure dashboard/resources are included in the runtime image, or document the required EXO_RESOURCES_DIR and bind-mount pattern.

I’m continuing to test this with multiple DGX Spark nodes in an isolated namespace and can validate another branch or image if helpful.

@cameronbergh cameronbergh marked this pull request as draft May 5, 2026 23:01
@cameronbergh
Copy link
Copy Markdown
Author

Holding this as draft until we validate with a real Qwen 4B model on Mac MLX, Linux CUDA, and a mixed Mac+CUDA cluster.

@cameronbergh cameronbergh force-pushed the fix-pr-1317-current branch 3 times, most recently from 7b4fa84 to a864666 Compare May 5, 2026 23:12
@cameronbergh cameronbergh force-pushed the fix-pr-1317-current branch from a864666 to 23e0de2 Compare May 5, 2026 23:14
@cameronbergh
Copy link
Copy Markdown
Author

Validation update for mlx-community/Qwen3.5-4B-4bit:

  • Mac MLX-only: passed via /bench/chat/completions
    • prompt TPS: 55.097565061709865
    • generation TPS: 129.76622721094307
    • prompt/generation tokens: 17 / 32
    • peak memory: 2722856562 bytes
  • Linux CUDA Docker-only: passed via /bench/chat/completions in exo:cuda13-pr1317
    • prompt TPS: 2.217056995804178
    • generation TPS: 11.91972019735572
    • prompt/generation tokens: 17 / 32
    • peak memory: 2805840141 bytes
    • note: needed free VRAM on the RTX 3060 host; an existing llama-server process was occupying most GPU memory and initially caused cudaMallocAsync(&data, size, stream) failed: out of memory.
  • Mixed Mac MLX + Linux CUDA: passed via /bench/chat/completions with a 2-node MlxRing pipeline instance split across Mac Studio + RTX 3060 host
    • prompt TPS: 2.292403738276393
    • generation TPS: 55.165733318081294
    • prompt/generation tokens: 17 / 32
    • peak memory: 1601602326 bytes

Also fixed two Docker issues discovered during validation:

  • Added resources/ to the image so model-card/resource loading works in Docker.
  • Added a temporary Docker-local compatibility shim for the current Linux CUDA MLX wheel: mlx-lm expects mx.new_thread_local_stream, while the Linux CUDA MLX wheel exposes the equivalent mx.new_stream API.

@cameronbergh
Copy link
Copy Markdown
Author

Follow-up validation with a longer prompt and longer generation (230 prompt tokens / 512 generated tokens):

  • Mac MLX-only: passed via /bench/chat/completions
    • response id: ab733326-606e-45c2-aec6-ce4a8d809e5a
    • prompt TPS: 28.24285388749967
    • generation TPS: 139.84676721721408
    • peak memory: 3249673297 bytes
    • elapsed: 11.948140457971022s
  • Linux CUDA using both RTX 3060s: passed via a 2-node MlxRing pipeline instance, one Docker EXO node per GPU (/dev/nvidia0 and /dev/nvidia1)
    • response id: b0277cb9-cdff-4a26-975b-7b3f74270e25
    • prompt TPS: 21.523230252187886
    • generation TPS: 25.82012615634045
    • peak memory: 2241211860 bytes
    • elapsed: 46.736752146855s
    • nvidia-smi during/after load showed both GPUs active: GPU0 about 2937 MiB, GPU1 about 2906 MiB used.
  • Mixed Mac MLX + Linux CUDA: passed via a 2-node MlxRing pipeline instance split across Mac Studio and one RTX 3060 Docker EXO node
    • response id: 724bb700-791c-445f-b00d-fb1021da3613
    • prompt TPS: 26.14653022620217
    • generation TPS: 58.75482148222831
    • peak memory: 2074232420 bytes
    • elapsed: 28.841311666183174s

Note: the earlier validation comment used a very short 17 prompt token / 32 generated token request, so these longer numbers are the more meaningful validation signal.

Winston-9527 pushed a commit to Winston-9527/exo that referenced this pull request May 22, 2026
This commit enables exo to run on NVIDIA GPUs on Linux by fixing
Metal-specific assumptions in the MLX inference path.

Changes:
- Add CUDA compatibility shim for mlx-lm's new_thread_local_stream API
  (Linux CUDA MLX exposes new_stream instead)
- Gate MLX_METAL_FAST_SYNCH env var to macOS only, preventing warnings
  on Linux
- Make set_wired_limit_for_model handle CUDA backends gracefully by
  checking mx.metal.is_available() first
- Add automatic LD_LIBRARY_PATH setup in runner bootstrap for CUDA
  libraries (libcublasLt.so.13, etc.)

Compatibility:
- Zero breaking changes - all modifications are platform-gated
- macOS Metal path unchanged
- CPU-only Linux still works
- Enables heterogeneous clusters (macOS Metal + Linux CUDA)

Tested on:
- Linux: NVIDIA RTX 3090, CUDA 13.1, Driver 590.48.01
- macOS: MacBook Pro M5 Pro (Metal)
- Verified cross-platform cluster inference with Qwen3-0.6B-8bit

Refs: PR exo-explore#2053 (Docker support) - this provides the native code changes
needed for Linux CUDA deployment without requiring Docker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants