Skip to content

ENH limit threads for C-Libraries dynamically#135

Closed
tomMoral wants to merge 61 commits into
masterfrom
PR_limit_threads_omp
Closed

ENH limit threads for C-Libraries dynamically#135
tomMoral wants to merge 61 commits into
masterfrom
PR_limit_threads_omp

Conversation

@tomMoral

@tomMoral tomMoral commented Jun 7, 2018

Copy link
Copy Markdown
Contributor

No description provided.

@tomMoral tomMoral force-pushed the PR_limit_threads_omp branch from d177ba6 to 1c48f8c Compare June 7, 2018 22:42
@tomMoral

tomMoral commented Jun 8, 2018

Copy link
Copy Markdown
Contributor Author

After investigating, I could not find a way to dynamically set the number of threads used by Accelerate on OSX. I know it is not supported by numpy anymore but
The only possible option seems to be an environment variable VECLIB_MAXIMUM_THREADS, general that must be used before loading numpy. We can set it in the joblib initializer if needed but we re-scale dynamically the size of the pool.

@tomMoral tomMoral force-pushed the PR_limit_threads_omp branch from 672be0e to 321428a Compare June 8, 2018 07:57
@ogrisel

ogrisel commented Jun 8, 2018

Copy link
Copy Markdown
Contributor

There might be some dynamic API to control the Grand Central Dispatch runtime of macOS but I am not sure. Anyway I think it's much less used than MKL, OpenBLAS and OpenMP in our cross-platform ecosystem. Let's keep macOS specific things for later PRs if users request it.

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but otherwise LGTM!

Comment thread tests/conftest.py Outdated
Comment thread tests/conftest.py Outdated
Comment thread tests/test_loky_backend.py Outdated
Comment thread tests/test_loky_backend.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread tests/test_loky_backend.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread continuous_integration/travis/runtests.sh Outdated
@ogrisel

ogrisel commented Jun 8, 2018

Copy link
Copy Markdown
Contributor

We might also want to add a CI entry (on travis) that pip install mkl and then force the MKL tests.

https://pypi.org/project/mkl/

This does not require importing numpy.

The MKL download is 220MB though so I would not put it as a default test dependency, only for one of the build matrix entry.

@ogrisel

ogrisel commented Jun 8, 2018

Copy link
Copy Markdown
Contributor

Also, a general style nitpick. I don't like the various OpenBLAS / openMP / MKL mixed case in python variables, function names and constant. I would rather use openblas / openmp / mkl (or the all-uppercase versions) consistently in our code :)

@tomMoral tomMoral force-pushed the PR_limit_threads_omp branch from 74e4b43 to 35f44e4 Compare June 8, 2018 14:28
@tomMoral

tomMoral commented Jun 8, 2018

Copy link
Copy Markdown
Contributor Author

I am starting to think we should not support non-POSIX for this feature, it is a nightmare with OSX and windows as ctypes is not finding the library and it is not fun to debug...

We could state that we do not support dynamic scaling for these platforms, and fallback to {OMP,MKL,NUMEXPR,VECLIB}_NUM_THREADS=1

Comment thread tests/_openmp_test_helper/parallel_sum.pyx Outdated
Comment thread tests/conftest.py Outdated
@ogrisel

ogrisel commented Jun 9, 2018

Copy link
Copy Markdown
Contributor

I am starting to think we should not support non-POSIX for this feature, it is a nightmare with OSX and windows as ctypes is not finding the library and it is not fun to debug...
We could state that we do not support dynamic scaling for these platforms, and fallback to {OMP,MKL,NUMEXPR,VECLIB}_NUM_THREADS=1

That's an option, at least for a start. In this case *_set_max_threads and *_set_max_threads should explicitly raise NotImplementedError on non-posix platforms.

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread tests/_openmp_test_helper/parallel_sum.pyx Outdated
Comment thread tests/test_loky_backend.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread tests/test_loky_module.py Outdated

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential limitation of this code is that we can only call the *_s/get_num_threads function once the Python package that relies on those C-runtimes has been important in the Python process.

For instance in OpenBLAS, you can observe the following:

$ python -c "from loky.backend.utils import get_thread_limits; print(get_thread_limits())"
{'openblas': None, 'openmp_intel': None, 'openmp_gnu': 12, 'mkl': None}
$ python -c "import numpy; from loky.backend.utils import get_thread_limits; print(get_thread_limits())"
{'openblas': 12, 'openmp_intel': None, 'openmp_gnu': 12, 'mkl': None}

Both in loky and joblib it's hard to tell which Python packages should actually be imported in the process workers ahead of time: it depends on the tasks being scheduled for execution.

This means that the best time to runtime helper function is after the task has been unpickled on the worker (triggering compiled module imports) but before actually executing the function.

What I do not understand in the above code is why then GNU OpenMP could be detected: I did not important any compiled extensions built with the -fopenmp gcc compiler flag and I doubt that the Python interpreter it-self is built with this flag.

Comment thread loky/backend/utils.py Outdated

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another batch of comments and questions:

Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
@tomMoral

tomMoral commented Jun 11, 2018

Copy link
Copy Markdown
Contributor Author

This means that the best time to runtime helper function is after the task has been unpickled on the worker (triggering compiled module imports) but before actually executing the function.

I think that if we do not find the loaded library, we can fallback to setting the env variable that controls the maximal number of thread. So the initializer should be something like:

def initializer(n_threads):
    dynamic_threadpool_size = limit_threads_clibs(n_threads)
    for clib, dynamic in dynamic_threadpool_size:
        if not dynamic:
            os.environ[CLIB_ENV_VARIABLES[clib]] = str(n_threads)

This way, we would have the correct behavior without needing to predict which library will be used, on all platform.

The rest of the API is for rescaling the threadpool and in this case, it only needs to change the number of loaded library.

@ogrisel

ogrisel commented Jun 12, 2018

Copy link
Copy Markdown
Contributor

Indeed, good point.

@ogrisel

ogrisel commented Jun 12, 2018

Copy link
Copy Markdown
Contributor

Could you try to add a travis CI entry with anaconda numpy + MKL and another with conda-forge numpy + openblas?

@tomMoral

tomMoral commented Jun 12, 2018

Copy link
Copy Markdown
Contributor Author

Ok the dynamic library loading is now working in OSX.
I just need to see if listDll can do the job for windows and re-implement everything to have a mechanism that loops over all loaded library for the different functions

@tomMoral tomMoral changed the title ENH limit threads for OMP, MKL and OpenBLAS [WIP] ENH limit threads for OMP, MKL and OpenBLAS Jun 12, 2018
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
Comment thread loky/backend/utils.py Outdated
@tomMoral

Copy link
Copy Markdown
Contributor Author

Ok I just found this SO answer which show how to use Psapi and kernel32 to loop through all loaded modules. The openblas test is now working on windows 🎉
I will improve the tests.

@tomMoral tomMoral force-pushed the PR_limit_threads_omp branch 6 times, most recently from 6bf5489 to 302926c Compare June 19, 2018 14:04
Comment thread loky/backend/spawn.py Outdated

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more refactoring suggestions:

Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
for name, info in SUPPORTED_IMPLEMENTATION.items():
if self.starts_with_any(module_name, info['filename_prefixes']):
return name, info['library']
return None, None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_supported_implementation could be renamed to _get_library_info_for_path and just return the full info dict with the name and the APIs info in it or None.

To make this simpler we could rewrite:

SUPPORTED_IMPLEMENTATION = {
    "openmp_intel": {
        "filename_prefixes": ("libiomp",),
        "internal_api": "openmp"
        "user_api": "openmp",
    },
    ...
}

to:

SUPPORTED_IMPLEMENTATION = [
    {
        "name": "openmp_intel",
        "filename_prefixes": ("libiomp",),
        "internal_api": "openmp"
        "user_api": "openmp",
    },
    ...
]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the different names openmp_msvc, openmp_gnu,... or merge them as openmp with multiple filename_prefix?

I think the later makes more sense with our current implementation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. I am fine with both as long as we are not able to retrieve the version number.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good. In case of nested parallelism BLAS inside prange, setting the number of threads for the inner BLAS can be done through the "blas" user-api, even if BLAS is linked to OpenMP. So there should never be a case where we need to explicitly restrict one OpenMP and not the other.

I am fine with both as long as we are not able to retrieve the version number.

OpenMP does not exposes it's version so I would not worry about that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenMP no but the individual openmp runtime library might have a version introspection API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of them does. Retrieving the version is a real pain :)
What you have to do is use the _OPENMP preprocessor macro in a program to get it's value. This value is a date, not a version number. Finally you have to use a map date/version to match the version from the date...

Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated
Comment thread loky/backend/_thread_pool_limiters.py Outdated

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed IRL, I agree we should further get rid of the wrapper class itself and just use a bunch of stateless functions to get and set the limits + the context manager.

@rgommers

Copy link
Copy Markdown

FYI here is a recently open PR for adding a similar get/set number of threads for numpy: numpy/numpy#13136

@tomMoral tomMoral requested a review from ogrisel March 27, 2019 08:59

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for extraction as https://github.com/joblib/threadpoolctl and closing this PR :)

Comment thread tests/test_threadpool_limits.py
@tomMoral tomMoral closed this Mar 27, 2019
@tomMoral tomMoral deleted the PR_limit_threads_omp branch September 20, 2019 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants