Replace manual ref counting with shared_ptr/weak_ptr by bibrakc · Pull Request #1193 · aws/aws-ofi-nccl

bibrakc · 2026-04-13T23:33:21Z

Description of changes:

Replace manual reference counting between domain, endpoint, and communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as non-owning caches, comms hold shared_ptr, endpoints hold shared_ptr. Destruction cascades automatically via the shared_ptr destructor chain without requiring locks.

This eliminates release_ep(), release_domain(), the skip_lock and force_cleanup parameters, and the domain/device lock interlock during destruction. Destructors now call cleanup_resources() directly instead of relying on the manual release path.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

bibrakc · 2026-04-14T22:13:46Z

The domain/device cleanup_resources() prints INFO messages like:

NET/OFI 1 RDMA endpoints still active at domain cleanup
NET/OFI 1 RDMA domains still active at close

This is for example in :

int nccl_net_ofi_rdma_domain_t::cleanup_resources()
{


if (!this->ep_table.empty()) {
		NCCL_OFI_INFO(NCCL_NET, "%zu RDMA endpoints still active at domain cleanup",
					 this->ep_table.size());
		err_code = this->release_all_ep();
		if (err_code != 0) {
			NCCL_OFI_WARN("Cleanup of RDMA domain failed. RC: %d, ERROR: %s",
				       err_code, fi_strerror(-err_code));
			ret = -EINVAL;
		}
	}

With the old manual ref counting, these indicated endpoints that hadn't been properly released. With the new weak_ptr design, these just mean there are stale non-owning cache entries in the table and are completely harmless. So the messages are misleading now since they suggest a problem where none exists.

Need input on what to do:

1- Either remove the check or
2- Change the message.

What do you prefer?
@bwbarrett @rauteric

Update: we discussed offline and made changes to the PR.

bwbarrett

Love to see all the code removal. This seems way simpler.

bwbarrett

I'd probably squash the 4 commits. I probably would have done the GIN EP pointer cleanup in a second commit, but your GIN EP pointer cleanup ALSO has a bunch of other things in it, which really should just be part of the first commit.

Given where you are, i'd just squash rather than try to rework some of the commits to make more sense.

bibrakc · 2026-04-17T19:33:57Z

@bwbarrett squash done.

Replace manual reference counting between domain, endpoint, and communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as non-owning caches, comms hold shared_ptr<ep>, endpoints hold shared_ptr<domain>. Destruction cascades automatically via the shared_ptr destructor chain without requiring locks. This eliminates release_ep(), release_domain(), release_device(), cleanup_resources(), the skip_lock and force_cleanup parameters, the called_cleanup_resources flag, and the domain/device lock interlock during destruction. Cleanup logic is inlined directly into destructors. The plugin destructor uses delete instead of release_device(). GIN ep_holder and gin_listen_comm hold shared_ptr<ep> by const reference. get_domain_impl merged into get_domain. device_lock released before calling domain->get_ep(). Expired weak_ptr entries purged on cache miss to prevent unbounded table growth. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>

NVTX tracing is not built by CI, so build breakages under --with-nvtx=... have been landing undetected. For example, the shared_ptr refactor in PR aws#1193 left an implicit pointer cast in tracing_impl/nvtx.h that no CI job caught. Install the cuda-nvtx package in the CUDA-enabled CI container images so NVTX headers are available at configure time, driven by a new ENABLE_NVTX build arg. Parameterise the CUDA toolkit version as a CUDA_VERSION build arg (default 12-6) so cuda-cudart-dev, cuda-crt, and cuda-nvtx stay on the same release. Introduce a new tracing matrix value lttng-nvtx in matrix-config.json. Because NVTX requires CUDA, exclude rules pin lttng-nvtx to cuda only and pin plain lttng to neuron only; the resulting matrix produces the same number of containers as before, with cuda images retagged from -lttng- to -lttng-nvtx-. update-containers.yaml derives ENABLE_TRACING and ENABLE_NVTX from whether the tracing value contains lttng or nvtx. This change only rebuilds the container images. A follow-up PR will pass --with-nvtx=... to configure in distcheck.yaml and point the PR CI jobs at the new -lttng-nvtx- tag. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>

NVTX tracing is not built by CI, so build breakages under --with-nvtx=... have been landing undetected. For example, the shared_ptr refactor in PR aws#1193 left an implicit pointer cast in tracing_impl/nvtx.h that no CI job caught. Extend the tracing-enabled Ubuntu images to install cuda-nvtx in addition to liblttng-ust-dev whenever ENABLE_CUDA is also set. The two tracing backends are independent in the plugin (separate HAVE_LIBLTTNG_UST and HAVE_NVTX_TRACING guards, each tracepoint macro expands to calls into both) so they coexist cleanly in a single container and a single build. Neuron images still get just lttng because NVTX requires the CUDA toolkit. Parameterise the CUDA toolkit release as a CUDA_VERSION build arg (default 12-6) so cuda-cudart-dev, cuda-crt, and cuda-nvtx all stay on the same release. No matrix, workflow, or tag changes: the existing 'lttng' tracing variant now also brings in NVTX on CUDA containers. A follow-up PR will add --with-nvtx=... to the configure line in distcheck.yaml so the plugin actually gets built with NVTX in CI. Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 4 times, most recently from 654bc62 to 6216fbd Compare April 14, 2026 19:32

bibrakc marked this pull request as ready for review April 14, 2026 19:53

bibrakc requested review from a team and bwbarrett as code owners April 14, 2026 19:53

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from 6216fbd to 4727383 Compare April 14, 2026 20:40

bibrakc requested review from akkart-aws and rauteric April 14, 2026 21:50

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 2 times, most recently from 5c2a78a to 6b34d20 Compare April 15, 2026 00:47

bwbarrett reviewed Apr 15, 2026

View reviewed changes

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 3 times, most recently from 9dfbc1f to a89e943 Compare April 15, 2026 23:06

bwbarrett reviewed Apr 17, 2026

View reviewed changes

Comment thread include/nccl_ofi.h

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from a89e943 to ab416df Compare April 17, 2026 19:25

rauteric reviewed Apr 20, 2026

View reviewed changes

Comment thread src/rdma/gin/nccl_ofi_gin_api.cpp

Comment thread src/nccl_ofi_net.cpp Outdated

Comment thread src/nccl_ofi_net.cpp Outdated

Comment thread src/nccl_ofi_net.cpp

Comment thread src/nccl_ofi_net.cpp

Comment thread src/nccl_ofi_rdma.cpp Outdated

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 2 times, most recently from 4f088c6 to 1c7243e Compare April 20, 2026 18:51

bwbarrett previously approved these changes Apr 20, 2026

View reviewed changes

Comment thread include/nccl_ofi.h Outdated

bibrakc dismissed bwbarrett’s stale review via f37ed1b April 20, 2026 21:37

bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from 1c7243e to f37ed1b Compare April 20, 2026 21:37

rauteric approved these changes Apr 20, 2026

View reviewed changes

bwbarrett approved these changes Apr 20, 2026

View reviewed changes

bibrakc merged commit dbe6612 into aws:master Apr 20, 2026
71 checks passed

bibrakc mentioned this pull request May 4, 2026

nvtx: fix missing .get() on shared_ptr cast in RECV_SEGMENT_COMPLETE tracepoint #1221

Open

bibrakc mentioned this pull request May 5, 2026

ci: add NVTX to CI container images for CUDA builds #1224

Open

Conversation

bibrakc commented Apr 13, 2026

Uh oh!

bibrakc commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bwbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bwbarrett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bibrakc commented Apr 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bibrakc commented Apr 14, 2026 •

edited

Loading