ENH limit threads for C-Libraries dynamically by tomMoral · Pull Request #135 · joblib/loky

tomMoral · 2018-06-07T18:18:26Z

No description provided.

tomMoral · 2018-06-08T06:14:31Z

After investigating, I could not find a way to dynamically set the number of threads used by Accelerate on OSX. I know it is not supported by numpy anymore but
The only possible option seems to be an environment variable VECLIB_MAXIMUM_THREADS, general that must be used before loading numpy. We can set it in the joblib initializer if needed but we re-scale dynamically the size of the pool.

ogrisel · 2018-06-08T09:06:23Z

There might be some dynamic API to control the Grand Central Dispatch runtime of macOS but I am not sure. Anyway I think it's much less used than MKL, OpenBLAS and OpenMP in our cross-platform ecosystem. Let's keep macOS specific things for later PRs if users request it.

ogrisel

Some comments but otherwise LGTM!

ogrisel · 2018-06-08T09:28:22Z

We might also want to add a CI entry (on travis) that pip install mkl and then force the MKL tests.

https://pypi.org/project/mkl/

This does not require importing numpy.

The MKL download is 220MB though so I would not put it as a default test dependency, only for one of the build matrix entry.

ogrisel · 2018-06-08T09:31:54Z

Also, a general style nitpick. I don't like the various OpenBLAS / openMP / MKL mixed case in python variables, function names and constant. I would rather use openblas / openmp / mkl (or the all-uppercase versions) consistently in our code :)

tomMoral · 2018-06-08T14:51:27Z

I am starting to think we should not support non-POSIX for this feature, it is a nightmare with OSX and windows as ctypes is not finding the library and it is not fun to debug...

We could state that we do not support dynamic scaling for these platforms, and fallback to {OMP,MKL,NUMEXPR,VECLIB}_NUM_THREADS=1

ogrisel · 2018-06-09T05:52:54Z

I am starting to think we should not support non-POSIX for this feature, it is a nightmare with OSX and windows as ctypes is not finding the library and it is not fun to debug...
We could state that we do not support dynamic scaling for these platforms, and fallback to {OMP,MKL,NUMEXPR,VECLIB}_NUM_THREADS=1

That's an option, at least for a start. In this case *_set_max_threads and *_set_max_threads should explicitly raise NotImplementedError on non-posix platforms.

ogrisel

Some more comments.

ogrisel

A potential limitation of this code is that we can only call the *_s/get_num_threads function once the Python package that relies on those C-runtimes has been important in the Python process.

For instance in OpenBLAS, you can observe the following:

$ python -c "from loky.backend.utils import get_thread_limits; print(get_thread_limits())"
{'openblas': None, 'openmp_intel': None, 'openmp_gnu': 12, 'mkl': None}
$ python -c "import numpy; from loky.backend.utils import get_thread_limits; print(get_thread_limits())"
{'openblas': 12, 'openmp_intel': None, 'openmp_gnu': 12, 'mkl': None}

Both in loky and joblib it's hard to tell which Python packages should actually be imported in the process workers ahead of time: it depends on the tasks being scheduled for execution.

This means that the best time to runtime helper function is after the task has been unpickled on the worker (triggering compiled module imports) but before actually executing the function.

What I do not understand in the above code is why then GNU OpenMP could be detected: I did not important any compiled extensions built with the -fopenmp gcc compiler flag and I doubt that the Python interpreter it-self is built with this flag.

ogrisel

Here is another batch of comments and questions:

tomMoral · 2018-06-11T15:30:24Z

This means that the best time to runtime helper function is after the task has been unpickled on the worker (triggering compiled module imports) but before actually executing the function.

I think that if we do not find the loaded library, we can fallback to setting the env variable that controls the maximal number of thread. So the initializer should be something like:

def initializer(n_threads):
    dynamic_threadpool_size = limit_threads_clibs(n_threads)
    for clib, dynamic in dynamic_threadpool_size:
        if not dynamic:
            os.environ[CLIB_ENV_VARIABLES[clib]] = str(n_threads)

This way, we would have the correct behavior without needing to predict which library will be used, on all platform.

The rest of the API is for rescaling the threadpool and in this case, it only needs to change the number of loaded library.

ogrisel · 2018-06-12T07:15:09Z

Indeed, good point.

ogrisel · 2018-06-12T07:27:13Z

Could you try to add a travis CI entry with anaconda numpy + MKL and another with conda-forge numpy + openblas?

tomMoral · 2018-06-12T21:57:23Z

Ok the dynamic library loading is now working in OSX.
I just need to see if listDll can do the job for windows and re-implement everything to have a mechanism that loops over all loaded library for the different functions

tomMoral · 2018-06-18T11:38:09Z

Ok I just found this SO answer which show how to use Psapi and kernel32 to loop through all loaded modules. The openblas test is now working on windows 🎉
I will improve the tests.

ogrisel

Some more refactoring suggestions:

ogrisel · 2019-03-24T16:46:35Z

+        for name, info in SUPPORTED_IMPLEMENTATION.items():
+            if self.starts_with_any(module_name, info['filename_prefixes']):
+                return name, info['library']
+        return None, None


_is_supported_implementation could be renamed to _get_library_info_for_path and just return the full info dict with the name and the APIs info in it or None.

To make this simpler we could rewrite:

SUPPORTED_IMPLEMENTATION = { "openmp_intel": { "filename_prefixes": ("libiomp",), "internal_api": "openmp" "user_api": "openmp", }, ... }

to:

SUPPORTED_IMPLEMENTATION = [ { "name": "openmp_intel", "filename_prefixes": ("libiomp",), "internal_api": "openmp" "user_api": "openmp", }, ... ]

Should we keep the different names openmp_msvc, openmp_gnu,... or merge them as openmp with multiple filename_prefix?

I think the later makes more sense with our current implementation.

I thought about that. I am fine with both as long as we are not able to retrieve the version number.

Seems good. In case of nested parallelism BLAS inside prange, setting the number of threads for the inner BLAS can be done through the "blas" user-api, even if BLAS is linked to OpenMP. So there should never be a case where we need to explicitly restrict one OpenMP and not the other.

I am fine with both as long as we are not able to retrieve the version number.

OpenMP does not exposes it's version so I would not worry about that.

OpenMP no but the individual openmp runtime library might have a version introspection API.

None of them does. Retrieving the version is a real pain :)
What you have to do is use the _OPENMP preprocessor macro in a program to get it's value. This value is a date, not a version number. Finally you have to use a map date/version to match the version from the date...

ogrisel

As discussed IRL, I agree we should further get rid of the wrapper class itself and just use a bunch of stateless functions to get and set the limits + the context manager.

rgommers · 2019-03-26T13:01:10Z

FYI here is a recently open PR for adding a similar get/set number of threads for numpy: numpy/numpy#13136

ogrisel

+1 for extraction as https://github.com/joblib/threadpoolctl and closing this PR :)

tomMoral force-pushed the PR_limit_threads_omp branch from d177ba6 to 1c48f8c Compare June 7, 2018 22:42

tomMoral force-pushed the PR_limit_threads_omp branch from 672be0e to 321428a Compare June 8, 2018 07:57

ogrisel reviewed Jun 8, 2018

View reviewed changes

tomMoral force-pushed the PR_limit_threads_omp branch from 74e4b43 to 35f44e4 Compare June 8, 2018 14:28

ogrisel reviewed Jun 9, 2018

View reviewed changes

Comment thread tests/_openmp_test_helper/parallel_sum.pyx Outdated

ogrisel reviewed Jun 9, 2018

View reviewed changes

Comment thread tests/conftest.py Outdated

ogrisel reviewed Jun 11, 2018

View reviewed changes

Comment thread loky/backend/utils.py Outdated

ogrisel reviewed Jun 11, 2018

View reviewed changes

tomMoral changed the title ~~ENH limit threads for OMP, MKL and OpenBLAS~~ [WIP] ENH limit threads for OMP, MKL and OpenBLAS Jun 12, 2018

ogrisel reviewed Jun 13, 2018

View reviewed changes

Comment thread loky/backend/utils.py Outdated

ogrisel reviewed Jun 13, 2018

View reviewed changes

Comment thread loky/backend/utils.py Outdated

ogrisel reviewed Jun 13, 2018

View reviewed changes

Comment thread loky/backend/utils.py Outdated

tomMoral force-pushed the PR_limit_threads_omp branch 6 times, most recently from 6bf5489 to 302926c Compare June 19, 2018 14:04

ogrisel reviewed Mar 20, 2019

View reviewed changes

Comment thread loky/backend/spawn.py Outdated

tomMoral added 4 commits March 24, 2019 13:39

CLN ci/appveyor blank lines

ff20241

CLN rename {openblas/mkl}-test-noskip -> -present

3483cf3

CLN rewrite the API

59f84e3

CLN spawn from useless changed

d8ee56e

ogrisel reviewed Mar 24, 2019

View reviewed changes

CLN refactor user/intern-api

893a4d4

rth mentioned this pull request Mar 25, 2019

[WIP] Add helpers for BLAS and OpenMP libs scikit-learn/scikit-learn#13297

Closed

tomMoral added 5 commits March 25, 2019 18:07

ENH merge the openMP implementations

ca162a5

FIX test mkl win32

ad6ce6e

CLN some improvments in the coments and re-organisation

5845326

FIX skip test on extra mkl implementation

c472b23

CLN some typo and renaming

2ef93dc

ogrisel reviewed Mar 26, 2019

View reviewed changes