Replace manual ref counting with shared_ptr/weak_ptr#1193
Conversation
654bc62 to
6216fbd
Compare
6216fbd to
4727383
Compare
|
The domain/device cleanup_resources() prints INFO messages like: This is for example in : int nccl_net_ofi_rdma_domain_t::cleanup_resources()
{
if (!this->ep_table.empty()) {
NCCL_OFI_INFO(NCCL_NET, "%zu RDMA endpoints still active at domain cleanup",
this->ep_table.size());
err_code = this->release_all_ep();
if (err_code != 0) {
NCCL_OFI_WARN("Cleanup of RDMA domain failed. RC: %d, ERROR: %s",
err_code, fi_strerror(-err_code));
ret = -EINVAL;
}
}
With the old manual ref counting, these indicated endpoints that hadn't been properly released. With the new weak_ptr design, these just mean there are stale non-owning cache entries in the table and are completely harmless. So the messages are misleading now since they suggest a problem where none exists. Need input on what to do: 1- Either remove the check or What do you prefer? Update: we discussed offline and made changes to the PR. |
5c2a78a to
6b34d20
Compare
bwbarrett
left a comment
There was a problem hiding this comment.
Love to see all the code removal. This seems way simpler.
9dfbc1f to
a89e943
Compare
bwbarrett
left a comment
There was a problem hiding this comment.
I'd probably squash the 4 commits. I probably would have done the GIN EP pointer cleanup in a second commit, but your GIN EP pointer cleanup ALSO has a bunch of other things in it, which really should just be part of the first commit.
Given where you are, i'd just squash rather than try to rework some of the commits to make more sense.
a89e943 to
ab416df
Compare
|
@bwbarrett squash done. |
4f088c6 to
1c7243e
Compare
Replace manual reference counting between domain, endpoint, and communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as non-owning caches, comms hold shared_ptr<ep>, endpoints hold shared_ptr<domain>. Destruction cascades automatically via the shared_ptr destructor chain without requiring locks. This eliminates release_ep(), release_domain(), release_device(), cleanup_resources(), the skip_lock and force_cleanup parameters, the called_cleanup_resources flag, and the domain/device lock interlock during destruction. Cleanup logic is inlined directly into destructors. The plugin destructor uses delete instead of release_device(). GIN ep_holder and gin_listen_comm hold shared_ptr<ep> by const reference. get_domain_impl merged into get_domain. device_lock released before calling domain->get_ep(). Expired weak_ptr entries purged on cache miss to prevent unbounded table growth. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
1c7243e to
f37ed1b
Compare
NVTX tracing is not built by CI, so build breakages under --with-nvtx=... have been landing undetected. For example, the shared_ptr refactor in PR aws#1193 left an implicit pointer cast in tracing_impl/nvtx.h that no CI job caught. Install the cuda-nvtx package in the CUDA-enabled CI container images so NVTX headers are available at configure time, driven by a new ENABLE_NVTX build arg. Parameterise the CUDA toolkit version as a CUDA_VERSION build arg (default 12-6) so cuda-cudart-dev, cuda-crt, and cuda-nvtx stay on the same release. Introduce a new tracing matrix value lttng-nvtx in matrix-config.json. Because NVTX requires CUDA, exclude rules pin lttng-nvtx to cuda only and pin plain lttng to neuron only; the resulting matrix produces the same number of containers as before, with cuda images retagged from -lttng- to -lttng-nvtx-. update-containers.yaml derives ENABLE_TRACING and ENABLE_NVTX from whether the tracing value contains lttng or nvtx. This change only rebuilds the container images. A follow-up PR will pass --with-nvtx=... to configure in distcheck.yaml and point the PR CI jobs at the new -lttng-nvtx- tag. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
NVTX tracing is not built by CI, so build breakages under --with-nvtx=... have been landing undetected. For example, the shared_ptr refactor in PR aws#1193 left an implicit pointer cast in tracing_impl/nvtx.h that no CI job caught. Extend the tracing-enabled Ubuntu images to install cuda-nvtx in addition to liblttng-ust-dev whenever ENABLE_CUDA is also set. The two tracing backends are independent in the plugin (separate HAVE_LIBLTTNG_UST and HAVE_NVTX_TRACING guards, each tracepoint macro expands to calls into both) so they coexist cleanly in a single container and a single build. Neuron images still get just lttng because NVTX requires the CUDA toolkit. Parameterise the CUDA toolkit release as a CUDA_VERSION build arg (default 12-6) so cuda-cudart-dev, cuda-crt, and cuda-nvtx all stay on the same release. No matrix, workflow, or tag changes: the existing 'lttng' tracing variant now also brings in NVTX on CUDA containers. A follow-up PR will add --with-nvtx=... to the configure line in distcheck.yaml so the plugin actually gets built with NVTX in CI. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
Description of changes:
Replace manual reference counting between domain, endpoint, and communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as non-owning caches, comms hold shared_ptr, endpoints hold shared_ptr. Destruction cascades automatically via the shared_ptr destructor chain without requiring locks.
This eliminates release_ep(), release_domain(), the skip_lock and force_cleanup parameters, and the domain/device lock interlock during destruction. Destructors now call cleanup_resources() directly instead of relying on the manual release path.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.