Skip to content

Replace manual ref counting with shared_ptr/weak_ptr#1193

Merged
bibrakc merged 1 commit intoaws:masterfrom
bibrakc:cpp-refactor/shared-ptr-ref-counting
Apr 20, 2026
Merged

Replace manual ref counting with shared_ptr/weak_ptr#1193
bibrakc merged 1 commit intoaws:masterfrom
bibrakc:cpp-refactor/shared-ptr-ref-counting

Conversation

@bibrakc
Copy link
Copy Markdown
Contributor

@bibrakc bibrakc commented Apr 13, 2026

Description of changes:

Replace manual reference counting between domain, endpoint, and communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as non-owning caches, comms hold shared_ptr, endpoints hold shared_ptr. Destruction cascades automatically via the shared_ptr destructor chain without requiring locks.

This eliminates release_ep(), release_domain(), the skip_lock and force_cleanup parameters, and the domain/device lock interlock during destruction. Destructors now call cleanup_resources() directly instead of relying on the manual release path.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 4 times, most recently from 654bc62 to 6216fbd Compare April 14, 2026 19:32
@bibrakc bibrakc marked this pull request as ready for review April 14, 2026 19:53
@bibrakc bibrakc requested review from a team and bwbarrett as code owners April 14, 2026 19:53
@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from 6216fbd to 4727383 Compare April 14, 2026 20:40
@bibrakc bibrakc requested review from akkart-aws and rauteric April 14, 2026 21:50
@bibrakc
Copy link
Copy Markdown
Contributor Author

bibrakc commented Apr 14, 2026

The domain/device cleanup_resources() prints INFO messages like:

NET/OFI 1 RDMA endpoints still active at domain cleanup
NET/OFI 1 RDMA domains still active at close

This is for example in :

int nccl_net_ofi_rdma_domain_t::cleanup_resources()
{


if (!this->ep_table.empty()) {
		NCCL_OFI_INFO(NCCL_NET, "%zu RDMA endpoints still active at domain cleanup",
					 this->ep_table.size());
		err_code = this->release_all_ep();
		if (err_code != 0) {
			NCCL_OFI_WARN("Cleanup of RDMA domain failed. RC: %d, ERROR: %s",
				       err_code, fi_strerror(-err_code));
			ret = -EINVAL;
		}
	}

With the old manual ref counting, these indicated endpoints that hadn't been properly released. With the new weak_ptr design, these just mean there are stale non-owning cache entries in the table and are completely harmless. So the messages are misleading now since they suggest a problem where none exists.

Need input on what to do:

1- Either remove the check or
2- Change the message.

What do you prefer?
@bwbarrett @rauteric

Update: we discussed offline and made changes to the PR.

@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 2 times, most recently from 5c2a78a to 6b34d20 Compare April 15, 2026 00:47
Copy link
Copy Markdown
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see all the code removal. This seems way simpler.

Comment thread include/rdma/gin/nccl_ofi_gin.h Outdated
Comment thread include/rdma/gin/nccl_ofi_gin.h Outdated
Comment thread include/rdma/gin/nccl_ofi_gin_resources.h Outdated
Comment thread include/nccl_ofi.h
Comment thread include/nccl_ofi.h
Comment thread src/nccl_ofi_net.cpp Outdated
Comment thread src/nccl_ofi_net.cpp
Comment thread src/nccl_ofi_net.cpp Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
Comment thread src/nccl_ofi_rdma.cpp Outdated
@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 3 times, most recently from 9dfbc1f to a89e943 Compare April 15, 2026 23:06
Copy link
Copy Markdown
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably squash the 4 commits. I probably would have done the GIN EP pointer cleanup in a second commit, but your GIN EP pointer cleanup ALSO has a bunch of other things in it, which really should just be part of the first commit.

Given where you are, i'd just squash rather than try to rework some of the commits to make more sense.

Comment thread include/nccl_ofi.h
@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from a89e943 to ab416df Compare April 17, 2026 19:25
@bibrakc
Copy link
Copy Markdown
Contributor Author

bibrakc commented Apr 17, 2026

@bwbarrett squash done.

Comment thread src/rdma/gin/nccl_ofi_gin_api.cpp
Comment thread src/nccl_ofi_net.cpp Outdated
Comment thread src/nccl_ofi_net.cpp Outdated
Comment thread src/nccl_ofi_net.cpp
Comment thread src/nccl_ofi_net.cpp
Comment thread src/nccl_ofi_rdma.cpp Outdated
@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch 2 times, most recently from 4f088c6 to 1c7243e Compare April 20, 2026 18:51
bwbarrett
bwbarrett previously approved these changes Apr 20, 2026
Comment thread include/nccl_ofi.h Outdated
Replace manual reference counting between domain, endpoint, and
communicator with shared_ptr/weak_ptr. Tables hold weak_ptr as
non-owning caches, comms hold shared_ptr<ep>, endpoints hold
shared_ptr<domain>. Destruction cascades automatically via the
shared_ptr destructor chain without requiring locks.

This eliminates release_ep(), release_domain(), release_device(),
cleanup_resources(), the skip_lock and force_cleanup parameters,
the called_cleanup_resources flag, and the domain/device lock
interlock during destruction. Cleanup logic is inlined directly
into destructors. The plugin destructor uses delete instead of
release_device().

GIN ep_holder and gin_listen_comm hold shared_ptr<ep> by const
reference. get_domain_impl merged into get_domain. device_lock
released before calling domain->get_ep(). Expired weak_ptr
entries purged on cache miss to prevent unbounded table growth.

Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
@bibrakc bibrakc force-pushed the cpp-refactor/shared-ptr-ref-counting branch from 1c7243e to f37ed1b Compare April 20, 2026 21:37
@bibrakc bibrakc merged commit dbe6612 into aws:master Apr 20, 2026
71 checks passed
bibrakc added a commit to bibrakc/aws-ofi-nccl that referenced this pull request May 5, 2026
NVTX tracing is not built by CI, so build breakages under
--with-nvtx=... have been landing undetected.  For example, the
shared_ptr refactor in PR aws#1193 left an implicit pointer cast in
tracing_impl/nvtx.h that no CI job caught.

Install the cuda-nvtx package in the CUDA-enabled CI container
images so NVTX headers are available at configure time, driven by
a new ENABLE_NVTX build arg.  Parameterise the CUDA toolkit version
as a CUDA_VERSION build arg (default 12-6) so cuda-cudart-dev,
cuda-crt, and cuda-nvtx stay on the same release.

Introduce a new tracing matrix value lttng-nvtx in matrix-config.json.
Because NVTX requires CUDA, exclude rules pin lttng-nvtx to cuda only
and pin plain lttng to neuron only; the resulting matrix produces the
same number of containers as before, with cuda images retagged from
-lttng- to -lttng-nvtx-.  update-containers.yaml derives ENABLE_TRACING
and ENABLE_NVTX from whether the tracing value contains lttng or nvtx.

This change only rebuilds the container images.  A follow-up PR will
pass --with-nvtx=... to configure in distcheck.yaml and point the
PR CI jobs at the new -lttng-nvtx- tag.

Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
bibrakc added a commit to bibrakc/aws-ofi-nccl that referenced this pull request May 5, 2026
NVTX tracing is not built by CI, so build breakages under
--with-nvtx=... have been landing undetected.  For example, the
shared_ptr refactor in PR aws#1193 left an implicit pointer cast in
tracing_impl/nvtx.h that no CI job caught.

Extend the tracing-enabled Ubuntu images to install cuda-nvtx in
addition to liblttng-ust-dev whenever ENABLE_CUDA is also set.  The
two tracing backends are independent in the plugin (separate
HAVE_LIBLTTNG_UST and HAVE_NVTX_TRACING guards, each tracepoint
macro expands to calls into both) so they coexist cleanly in a
single container and a single build.  Neuron images still get just
lttng because NVTX requires the CUDA toolkit.

Parameterise the CUDA toolkit release as a CUDA_VERSION build arg
(default 12-6) so cuda-cudart-dev, cuda-crt, and cuda-nvtx all stay
on the same release.

No matrix, workflow, or tag changes: the existing 'lttng' tracing
variant now also brings in NVTX on CUDA containers.  A follow-up
PR will add --with-nvtx=... to the configure line in distcheck.yaml
so the plugin actually gets built with NVTX in CI.

Signed-off-by: Bibrak Qamar Chandio <bibracha@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants