Predictable raft::resources#3052
Conversation
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR changes RAFT resource storage to shared atomic cells, updates resource helper initialization to use default factories, revises multi-GPU and NCCL resource setup, changes memory snapshot handling, and raises C++/CUDA language requirements to C++20. ChangesRAFT resource storage and core semantics
Estimated code review effort: 4 (Complex) | ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/include/raft/core/resource/device_memory_resource.hpp (1)
255-323: 📐 Maintainability & Code Quality | 🟠 Major | 🏗️ Heavy liftHIGH: Public workspace setter signatures now hard-break const-callers.
Changing
set_workspace_resource,set_workspace_to_pool_resource,set_workspace_to_global_resource, andset_large_workspace_resourcefromresources const&toresources&is a source-breaking change in a public header, and I don't see deprecated bridge overloads or migration shims here. Downstream code that forwards a const handle into these helpers will stop compiling immediately.As per path instructions, "public C++ APIs should be stateless and lightweight wrappers" and "breaking C++ API changes require deprecation over releases when possible".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/device_memory_resource.hpp` around lines 255 - 323, The public workspace setter APIs now require a mutable resources handle, which breaks existing const-callers. Restore compatibility in the workspace helper functions set_workspace_resource, set_workspace_to_pool_resource, set_workspace_to_global_resource, and set_large_workspace_resource by adding const-friendly overloads or a non-breaking bridge that forwards from resources const& without changing existing call sites, while keeping the new behavior internally where mutation is needed.Source: Path instructions
cpp/include/raft/core/resource/sub_comms.hpp (1)
54-67: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftHIGH:
set_subcommmutates shared state through a const handle.Line 59 keeps
resources const&, but the function inserts into a sharedunordered_mapwith no synchronization. Under the new shared-cell copy semantics that mutation propagates across copies and can race withget_subcommor anotherset_subcomm, which is at odds with the documented "different instances of raft::resources are safe" expectation.As per path instructions, "different instances of raft::resources are safe", and the PR objective says copies "can be accessed and modified independently without propagating changes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/sub_comms.hpp` around lines 54 - 67, `set_subcomm` is mutating shared `resources` state through a const handle, which can propagate across copies and race with `get_subcomm` or another `set_subcomm`. Update the `set_subcomm` path in `sub_comms.hpp` so sub-communicator updates are isolated per `resources` instance rather than writing directly through shared state; use the existing `resources`/`sub_comms_resource` APIs to make the mutation copy-local and add synchronization or ownership separation as needed. Keep the fix centered around `set_subcomm`, `get_subcomm`, and `sub_comms_resource_factory` so the “different instances are safe” and independent-copy semantics are preserved.Source: Path instructions
🧹 Nitpick comments (2)
cpp/tests/core/handle.cpp (1)
354-360: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueStrengthen the propagation assertion (optional).
stream_a == stream_bis trivially satisfied because both resolve to the defaultcuda_stream_per_thread, so it doesn't actually prove the concrete resource is shared via the cell. Consider explicitly setting a non-default stream onabefore the lazy-init path, or rely solely onhas_resource_factory(b)as the propagation signal. Not a defect — thehas_resource_factorycheck on Line 358 already validates the core behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/core/handle.cpp` around lines 354 - 360, The propagation test in the resource lazy-init path is too weak because `resource::get_cuda_stream(a)` and `resource::get_cuda_stream(b)` can both return the default `cuda_stream_per_thread`, so the equality check in the `handle.cpp` test does not prove shared state. Update the test around `resource::get_cuda_stream` and `has_resource_factory` by either setting a non-default CUDA stream on `a` before lazy initialization and then asserting `b` sees the same concrete stream, or remove the trivial `ASSERT_EQ(stream_a, stream_b)` and rely on the `has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW)` assertion as the propagation check.cpp/include/raft/core/resource/cublaslt_handle.hpp (1)
49-50: 🚀 Performance & Scalability | 🔵 TrivialAvoid allocating the default factory on every
get_cublaslt_handle()call.
ensure_default_factory()still receives a freshly constructedcublaslt_resource_factoryeach time, so repeated lookups pay an unnecessary heap allocation before the CAS. Have you considered a lazy-builder overload or an earlyhas_resource_factory()check?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/cublaslt_handle.hpp` around lines 49 - 50, The get_cublaslt_handle() path is still creating a new cublaslt_resource_factory on every call before ensuring the default factory. Update the cublaslt_handle.hpp logic around get_cublaslt_handle() and ensure_default_factory() so the default factory is only constructed lazily when no factory is already present, using an early has_resource_factory() check or a lazy-builder overload. Keep the resource lookup behavior the same, but avoid the repeated heap allocation before the CAS.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/include/raft/core/resource/comms.hpp`:
- Line 61: The public resource setter API was changed from const to non-const,
which is a breaking change for existing callers using const resources handles.
Update the relevant public-facing documentation and migration/upgrade notes to
explicitly call out the `set_comms` signature change in `resources` and the
similar setter changes in this PR, and include guidance for downstream users to
update their call sites accordingly. Use the `set_comms` function and any
related setters as the symbols to reference in the release notes so the API
change is easy to find.
In `@cpp/include/raft/core/resource/cublas_handle.hpp`:
- Around line 62-64: The cublas handle access path in cublas_handle.hpp is
mutating a shared CUBLAS_HANDLE cell by rebinding its stream on every call,
which breaks safety when copied resources diverge via CUDA_STREAM_VIEW. Update
the cublas handle setup/access logic around cublas_resource_factory and
get_cublas_handle so each raft::resources instance keeps stream ordering
isolated, rather than reusing and retargeting the same handle across copies.
Ensure the handle’s stream is bound per-resource or per-stream-view in a way
that does not affect sibling copies and preserves correct ordering with
get_cuda_stream().
In `@cpp/include/raft/core/resource/cusolver_dn_handle.hpp`:
- Around line 68-69: get_cusolver_dn_handle currently returns a cached
cusolverDnHandle_t without reattaching it to the current stream, so after
set_cuda_stream() the handle may still target an old stream. Update
get_cusolver_dn_handle to call cusolverDnSetStream(...) on the retrieved handle
using get_cuda_stream(res) before returning it, and apply the same rebinding
pattern to get_cusparse_handle and get_cusolver_sp_handle so all stream-bound
solver handles stay aligned with the active raft::resources stream.
In `@cpp/include/raft/core/resource/multi_gpu.hpp`:
- Around line 55-58: The public root-rank API change in get_root_rank and
set_root_rank needs a temporary deprecated compatibility bridge to avoid
breaking downstream callers. Add a deprecated shim in the multi_gpu resource API
that preserves the previous mutable-reference/caller behavior while forwarding
to the new implementation, and make sure set_root_rank still accepts existing
const resources usage through a migration wrapper. Include Doxygen on any new
public API in resources/multi_gpu.hpp and mark the old entry points deprecated
with a clear migration path before removing them.
In `@cpp/include/raft/core/resource/nccl_comm.hpp`:
- Around line 78-80: `get_resource` in `nccl_comm.hpp` is using
`has_resource_factory(NCCL_COMM)` as an initialization gate, which races under
the shared-cell model and can return a partially initialized communicator vector
or run `_init_nccl_comms(res)` twice. Make `get_resource` (and the NCCL_COMM
path in `ensure_default_factory`/`nccl_comm_resource_factory`) rely on a single
atomic initialization mechanism so that only one caller publishes the factory
and performs `_init_nccl_comms(res)`, while concurrent callers on the same const
handle safely observe the fully initialized state.
In `@cpp/include/raft/core/resources.hpp`:
- Line 81: The public resource update API in resources.hpp has a breaking
constness change, so add a compatibility/deprecation path instead of only
exposing the new mutable form. Keep add_resource_factory and the related set_*
wrappers locatable as the affected symbols, and provide transitional overloads
or forwarding methods that preserve the old public surface where possible while
steering callers to the new API. For the public header, add Doxygen for any new
or transitional APIs and mark the breaking replacements with deprecation
warnings so downstream users get a clear migration path.
- Around line 100-104: `ensure_default_factory` currently indexes `cells_` with
`rtype` without validating the resource type, so a factory returning `LAST_KEY`
can access past the vector. Update `ensure_default_factory` in `resources.hpp`
to mirror the `add_resource_factory` guard for
`resource::resource_type::LAST_KEY` before touching `cells_`, and only perform
the `cells_[rtype]->factory.compare_exchange_strong(...)` path for valid
resource types.
---
Outside diff comments:
In `@cpp/include/raft/core/resource/device_memory_resource.hpp`:
- Around line 255-323: The public workspace setter APIs now require a mutable
resources handle, which breaks existing const-callers. Restore compatibility in
the workspace helper functions set_workspace_resource,
set_workspace_to_pool_resource, set_workspace_to_global_resource, and
set_large_workspace_resource by adding const-friendly overloads or a
non-breaking bridge that forwards from resources const& without changing
existing call sites, while keeping the new behavior internally where mutation is
needed.
In `@cpp/include/raft/core/resource/sub_comms.hpp`:
- Around line 54-67: `set_subcomm` is mutating shared `resources` state through
a const handle, which can propagate across copies and race with `get_subcomm` or
another `set_subcomm`. Update the `set_subcomm` path in `sub_comms.hpp` so
sub-communicator updates are isolated per `resources` instance rather than
writing directly through shared state; use the existing
`resources`/`sub_comms_resource` APIs to make the mutation copy-local and add
synchronization or ownership separation as needed. Keep the fix centered around
`set_subcomm`, `get_subcomm`, and `sub_comms_resource_factory` so the “different
instances are safe” and independent-copy semantics are preserved.
---
Nitpick comments:
In `@cpp/include/raft/core/resource/cublaslt_handle.hpp`:
- Around line 49-50: The get_cublaslt_handle() path is still creating a new
cublaslt_resource_factory on every call before ensuring the default factory.
Update the cublaslt_handle.hpp logic around get_cublaslt_handle() and
ensure_default_factory() so the default factory is only constructed lazily when
no factory is already present, using an early has_resource_factory() check or a
lazy-builder overload. Keep the resource lookup behavior the same, but avoid the
repeated heap allocation before the CAS.
In `@cpp/tests/core/handle.cpp`:
- Around line 354-360: The propagation test in the resource lazy-init path is
too weak because `resource::get_cuda_stream(a)` and
`resource::get_cuda_stream(b)` can both return the default
`cuda_stream_per_thread`, so the equality check in the `handle.cpp` test does
not prove shared state. Update the test around `resource::get_cuda_stream` and
`has_resource_factory` by either setting a non-default CUDA stream on `a` before
lazy initialization and then asserting `b` sees the same concrete stream, or
remove the trivial `ASSERT_EQ(stream_a, stream_b)` and rely on the
`has_resource_factory(resource::resource_type::CUDA_STREAM_VIEW)` assertion as
the propagation check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 38c5e204-a599-42f7-a81d-cdc0784e4e52
📒 Files selected for processing (29)
cpp/CMakeLists.txtcpp/include/raft/core/device_resources.hppcpp/include/raft/core/handle.hppcpp/include/raft/core/memory_stats_resources.hppcpp/include/raft/core/memory_tracking_resources.hppcpp/include/raft/core/resource/comms.hppcpp/include/raft/core/resource/cublas_handle.hppcpp/include/raft/core/resource/cublaslt_handle.hppcpp/include/raft/core/resource/cuda_stream.hppcpp/include/raft/core/resource/cuda_stream_pool.hppcpp/include/raft/core/resource/cusolver_dn_handle.hppcpp/include/raft/core/resource/cusolver_sp_handle.hppcpp/include/raft/core/resource/cusparse_handle.hppcpp/include/raft/core/resource/custom_resource.hppcpp/include/raft/core/resource/detail/stream_sync_event.hppcpp/include/raft/core/resource/device_id.hppcpp/include/raft/core/resource/device_memory_resource.hppcpp/include/raft/core/resource/device_properties.hppcpp/include/raft/core/resource/managed_memory_resource.hppcpp/include/raft/core/resource/multi_gpu.hppcpp/include/raft/core/resource/nccl_comm.hppcpp/include/raft/core/resource/pinned_memory_resource.hppcpp/include/raft/core/resource/resource_types.hppcpp/include/raft/core/resource/stream_view.hppcpp/include/raft/core/resource/sub_comms.hppcpp/include/raft/core/resource/thrust_policy.hppcpp/include/raft/core/resources.hppcpp/internal/CMakeLists.txtcpp/tests/core/handle.cpp
18b01f9 to
34c4bfa
Compare
…tion of multiple independent comm cliques.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/include/raft/core/resource/sub_comms.hpp (1)
14-23: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftUnsynchronized shared sub-comm state
raft::resourcescopies now share this cell, butcommunicators_is still read and mutated without any locking. That makesset_subcomm()/get_subcomm()on copied handles race on the samestd::unordered_map, which can corrupt the container or crash.Have you considered guarding this state or switching
set_subcomm()to the same replace-the-cell pattern used byset_comms()?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/sub_comms.hpp` around lines 14 - 23, `sub_comms_resource` currently exposes a shared `communicators_` map through `get_resource()` with no synchronization, so copied `raft::resources` handles can race when `set_subcomm()` and `get_subcomm()` touch the same state. Fix this by making the sub-comm state safe for concurrent access in the `sub_comms_resource`/`set_subcomm()`/`get_subcomm()` path, either by guarding `communicators_` with the same locking approach used elsewhere or by switching `set_subcomm()` to a replace-the-cell pattern instead of mutating the shared `std::unordered_map` in place.Source: Path instructions
♻️ Duplicate comments (5)
cpp/include/raft/core/resource/cublaslt_handle.hpp (1)
47-51: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftSame eager-allocation pattern as other resource getters.
get_cublaslt_handlenow heap-allocates ashared_ptr<cublaslt_resource_factory>on every call rather than only whenCUBLASLT_HANDLEis unregistered. See the primary write-up oncpp/include/raft/core/resource/cuda_stream.hpp(lines 60-64).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/cublaslt_handle.hpp` around lines 47 - 51, The get_cublaslt_handle helper eagerly creates a new cublaslt_resource_factory on every call, matching the same allocation issue seen in the other resource getter. Update get_cublaslt_handle to only construct and register the factory when CUBLASLT_HANDLE is not already present, using the existing resources APIs and the cublaslt_resource_factory symbol to preserve lazy initialization.cpp/include/raft/core/resource/pinned_memory_resource.hpp (1)
56-63: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftFactory heap-allocated on every call, even when resource already exists.
res.ensure_default_factory(std::make_shared<pinned_memory_resource_factory>())builds and heap-allocates ashared_ptr<pinned_memory_resource_factory>unconditionally on every invocation ofget_pinned_memory_resource_ref, regardless of whetherPINNED_MEMORY_RESOURCEis already registered — the argument is evaluated eagerly beforeensure_default_factorycan decide whether it's needed. The priorhas_resource_factory/add_resource_factorypattern only allocated the factory on the (rare) first call. See the related discussion oncpp/include/raft/core/resource/cuda_stream.hpp(lines 60-64) for the primary write-up; same pattern repeats here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/pinned_memory_resource.hpp` around lines 56 - 63, The get_pinned_memory_resource_ref helper eagerly allocates a pinned_memory_resource_factory on every call because ensure_default_factory is passed a freshly constructed shared_ptr unconditionally. Update get_pinned_memory_resource_ref to use the same lazy registration pattern as the cuda_stream resource helper: check whether the PINNED_MEMORY_RESOURCE factory already exists before creating and registering pinned_memory_resource_factory, so the heap allocation only happens on the first call.cpp/include/raft/core/resource/cusolver_dn_handle.hpp (1)
66-70: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftSame eager-allocation pattern as other resource getters.
Same concern as flagged in
cpp/include/raft/core/resource/cuda_stream.hpp(lines 60-64):get_cusolver_dn_handlenow heap-allocates a factory shared_ptr (plus a nested one viaget_cuda_stream(res)) on every call, not just the first.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/cusolver_dn_handle.hpp` around lines 66 - 70, The get_cusolver_dn_handle helper is eagerly allocating a new cusolver_dn_resource_factory and a new CUDA stream-backed factory on every call instead of only initializing once. Update get_cusolver_dn_handle to follow the same lazy-caching pattern used by the other resource getters, so the factory setup happens only when the resource is missing and reused afterward; use the existing resources / ensure_default_factory / get_resource flow to avoid repeated heap allocations.cpp/include/raft/core/resource/cusparse_handle.hpp (1)
58-62: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftSame eager-allocation pattern as other resource getters.
get_cusparse_handleheap-allocates a factory shared_ptr (plus a nested one viaget_cuda_stream(res)) on every call now, not just on first registration. See the primary write-up oncpp/include/raft/core/resource/cuda_stream.hpp(lines 60-64).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/cusparse_handle.hpp` around lines 58 - 62, The get_cusparse_handle accessor is eagerly allocating a new cusparse_resource_factory on every call instead of only when the resource is first registered. Update get_cusparse_handle to follow the same lazy registration pattern as the other resource getters, reusing the existing factory once present and only constructing the cusparse_resource_factory (and calling get_cuda_stream(res)) when the CUSPARSE_HANDLE resource is missing.cpp/include/raft/core/resource/thrust_policy.hpp (1)
54-59: 🚀 Performance & Scalability | 🟠 Major | 🏗️ Heavy liftSame eager-allocation-on-every-call pattern, plus a second nested allocation.
get_thrust_policyallocates ashared_ptr<thrust_policy_resource_factory>on every call regardless of whetherTHRUST_POLICYis already registered, and the constructor argumentget_cuda_stream(res)triggers its own independentensure_default_factoryallocation forCUDA_STREAM_VIEWinternally — so a singleget_thrust_policycall now does up to 2 heap allocations even in steady state, versus none previously. See the primary comment oncpp/include/raft/core/resource/cuda_stream.hpp(lines 60-64).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/raft/core/resource/thrust_policy.hpp` around lines 54 - 59, get_thrust_policy currently does eager work on every call: it always creates a thrust_policy_resource_factory and also forces get_cuda_stream(res), which may allocate again even when THRUST_POLICY already exists. Update get_thrust_policy to first check whether the THRUST_POLICY resource is already registered and only call ensure_default_factory when it is missing, reusing the existing resource afterward. Keep the fix localized around get_thrust_policy and its use of thrust_policy_resource_factory/get_cuda_stream so steady-state calls avoid unnecessary heap allocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/include/raft/core/resource/cuda_stream.hpp`:
- Around line 60-64: The hot path in get_cuda_stream() eagerly constructs a new
cuda_stream_resource_factory on every call because ensure_default_factory is
passed a freshly created shared_ptr by value. Update get_cuda_stream() to avoid
creating the factory unless it is actually needed, and apply the same lazy
pattern to the other resource getter helpers that delegate through
get_cuda_stream(res) so they do not incur repeated heap allocations after
initialization.
---
Outside diff comments:
In `@cpp/include/raft/core/resource/sub_comms.hpp`:
- Around line 14-23: `sub_comms_resource` currently exposes a shared
`communicators_` map through `get_resource()` with no synchronization, so copied
`raft::resources` handles can race when `set_subcomm()` and `get_subcomm()`
touch the same state. Fix this by making the sub-comm state safe for concurrent
access in the `sub_comms_resource`/`set_subcomm()`/`get_subcomm()` path, either
by guarding `communicators_` with the same locking approach used elsewhere or by
switching `set_subcomm()` to a replace-the-cell pattern instead of mutating the
shared `std::unordered_map` in place.
---
Duplicate comments:
In `@cpp/include/raft/core/resource/cublaslt_handle.hpp`:
- Around line 47-51: The get_cublaslt_handle helper eagerly creates a new
cublaslt_resource_factory on every call, matching the same allocation issue seen
in the other resource getter. Update get_cublaslt_handle to only construct and
register the factory when CUBLASLT_HANDLE is not already present, using the
existing resources APIs and the cublaslt_resource_factory symbol to preserve
lazy initialization.
In `@cpp/include/raft/core/resource/cusolver_dn_handle.hpp`:
- Around line 66-70: The get_cusolver_dn_handle helper is eagerly allocating a
new cusolver_dn_resource_factory and a new CUDA stream-backed factory on every
call instead of only initializing once. Update get_cusolver_dn_handle to follow
the same lazy-caching pattern used by the other resource getters, so the factory
setup happens only when the resource is missing and reused afterward; use the
existing resources / ensure_default_factory / get_resource flow to avoid
repeated heap allocations.
In `@cpp/include/raft/core/resource/cusparse_handle.hpp`:
- Around line 58-62: The get_cusparse_handle accessor is eagerly allocating a
new cusparse_resource_factory on every call instead of only when the resource is
first registered. Update get_cusparse_handle to follow the same lazy
registration pattern as the other resource getters, reusing the existing factory
once present and only constructing the cusparse_resource_factory (and calling
get_cuda_stream(res)) when the CUSPARSE_HANDLE resource is missing.
In `@cpp/include/raft/core/resource/pinned_memory_resource.hpp`:
- Around line 56-63: The get_pinned_memory_resource_ref helper eagerly allocates
a pinned_memory_resource_factory on every call because ensure_default_factory is
passed a freshly constructed shared_ptr unconditionally. Update
get_pinned_memory_resource_ref to use the same lazy registration pattern as the
cuda_stream resource helper: check whether the PINNED_MEMORY_RESOURCE factory
already exists before creating and registering pinned_memory_resource_factory,
so the heap allocation only happens on the first call.
In `@cpp/include/raft/core/resource/thrust_policy.hpp`:
- Around line 54-59: get_thrust_policy currently does eager work on every call:
it always creates a thrust_policy_resource_factory and also forces
get_cuda_stream(res), which may allocate again even when THRUST_POLICY already
exists. Update get_thrust_policy to first check whether the THRUST_POLICY
resource is already registered and only call ensure_default_factory when it is
missing, reusing the existing resource afterward. Keep the fix localized around
get_thrust_policy and its use of thrust_policy_resource_factory/get_cuda_stream
so steady-state calls avoid unnecessary heap allocations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b3c8bea1-cce3-4baa-9e8a-37fe515192df
📒 Files selected for processing (30)
cpp/.clang-formatcpp/CMakeLists.txtcpp/include/raft/core/device_resources.hppcpp/include/raft/core/handle.hppcpp/include/raft/core/memory_stats_resources.hppcpp/include/raft/core/memory_tracking_resources.hppcpp/include/raft/core/resource/comms.hppcpp/include/raft/core/resource/cublas_handle.hppcpp/include/raft/core/resource/cublaslt_handle.hppcpp/include/raft/core/resource/cuda_stream.hppcpp/include/raft/core/resource/cuda_stream_pool.hppcpp/include/raft/core/resource/cusolver_dn_handle.hppcpp/include/raft/core/resource/cusolver_sp_handle.hppcpp/include/raft/core/resource/cusparse_handle.hppcpp/include/raft/core/resource/custom_resource.hppcpp/include/raft/core/resource/detail/stream_sync_event.hppcpp/include/raft/core/resource/device_id.hppcpp/include/raft/core/resource/device_memory_resource.hppcpp/include/raft/core/resource/device_properties.hppcpp/include/raft/core/resource/managed_memory_resource.hppcpp/include/raft/core/resource/multi_gpu.hppcpp/include/raft/core/resource/nccl_comm.hppcpp/include/raft/core/resource/pinned_memory_resource.hppcpp/include/raft/core/resource/resource_types.hppcpp/include/raft/core/resource/stream_view.hppcpp/include/raft/core/resource/sub_comms.hppcpp/include/raft/core/resource/thrust_policy.hppcpp/include/raft/core/resources.hppcpp/internal/CMakeLists.txtcpp/tests/core/handle.cpp
…to avoid overheads of creating and discarding the factories
…to avoid overheads of creating and discarding the factories
…to avoid overheads of creating and discarding the factories
…to avoid overheads of creating and discarding the factories
|
Note for reviewers: in the last few commits I've addressed coderabbitai comments, mostly minimizing the diff. I don't expect this could affect the state of any downstream projects (full local build with dependencies is in progress, I'll report if I see any failures). |
Make the
raft::resourcesresource initialization semantics more predictable.raft::resourceshandle construction: all copies of theraft::resourceshandle point to the same resources (not a breaking change, fixes the re-initialization issue).set_resourcechanges to non-const semantics (breaking change).Before: all
set_xxxresource-updating calls were operating on const handleNow: all
set_xxxresources require a non-const handle.Thread-safety
First and foremost, thread-safety of using a specific resource depends on that resource. Here we discuss the thread-safety of using
raft::resourceshandle itself (get_resource and set_resource functions).Accessing the same resource by const ref
Updates (set_resource) are not possible (a user should copy a handle and modify the new one). All concurrent get_resource calls are atomic and safe, even if they initialize the factories and resources under the hood. The worst can happen is the same resource being initialized concurrently in two threads but only one being stored in the handle (the other one is discarded).
Accessing the same resource by non-const ref
Using the same object by non-const ref from multiple threads is always unsafe.
Accessing copies of the same resource
The resources and factories are updated atomically. Modifying any resource doesn't propagate to the copies. Accessing a resource while another threads access or modifies the same resource via another handle is thus safe.
Implementation details
The PR adds one more layer of indirection (one extra shared_ptr) to each resource, which may cause an extra runtime overhead. This is unavoidable if we want to allow lazy-initialized resources back-propagate across handles.
On the other hand, the PR removes the handle mutex in favor of C++20
std::atomic<shared_ptr>, which may reduce the runtime overheads a little bit.Breaking changes
raft::resourceshandles.std::atomic<shared_ptr>requires enforcing C++20 for all dependent libraries. An alternative would be to put a mutex perraft::resource_cell.