Skip to content

[BUG]: cuda.bindings.nvml doesn't find libnvidia-ml.so.1 in situations where pynvml (ctypes) does #2189

@mdboom

Description

@mdboom

Is this a duplicate?

Type of Bug

Runtime Error

Component

cuda.pathfinder

Describe the bug

In a CI environment that worked when using pynvml.py, migrating to cuda.bindings.nvml fails to find the nvidia-ml.so.1 library.

rapidsai/ucxx#640 (comment)

How to Reproduce

I'm not sure what is different about the CI environment in ucxx yet.

Expected behavior

cuda.bindings.nvml should find the underlying .so in all of the cases that pynvml.py did.

For reference, here is what pynvml.py does:

def _LoadNvmlLibrary():
    '''
    Load the library if it isn't loaded already
    '''
    global nvmlLib

    if (nvmlLib == None):
        # lock to ensure only one caller loads the library
        libLoadLock.acquire()

        try:
            # ensure the library still isn't loaded
            if (nvmlLib == None):
                try:
                    if (sys.platform[:3] == "win"):
                        # cdecl calling convention
                        try:
                            # Check for nvml.dll in System32 first for DCH drivers
                            nvmlLib = CDLL(os.path.join(os.getenv("WINDIR", "C:/Windows"), "System32/nvml.dll"))
                        except OSError as ose:
                            # If nvml.dll is not found in System32, it should be in ProgramFiles
                            # load nvml.dll from %ProgramFiles%/NVIDIA Corporation/NVSMI/nvml.dll
                            nvmlLib = CDLL(os.path.join(os.getenv("ProgramFiles", "C:/Program Files"), "NVIDIA Corporation/NVSMI/nvml.dll"))
                    else:
                        # assume linux
                        nvmlLib = CDLL("libnvidia-ml.so.1")
                except OSError as ose:
                    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
                if (nvmlLib == None):
                    _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND)
        finally:
            # lock is always freed
            libLoadLock.release()

Operating System

rockylinux8

nvidia-smi output

No response

Metadata

Metadata

Assignees

Labels

P0High priority - Must do!bugSomething isn't workingcuda.pathfinderEverything related to the cuda.pathfinder module

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions