gin: GDAKI reuses shared proxy APIs for MR reg, close, progress, finalize#1201
Merged
anshumang merged 1 commit intoaws:masterfrom Apr 24, 2026
Merged
gin: GDAKI reuses shared proxy APIs for MR reg, close, progress, finalize#1201anshumang merged 1 commit intoaws:masterfrom
anshumang merged 1 commit intoaws:masterfrom
Conversation
f613668 to
d157b9e
Compare
Wire all 11 shared plugin APIs (init, devices, listen, connect, regMrSym, regMrSymDmaBuf, deregMrSym, closeColl, closeListen, ginProgress, finalize) directly into nccl_ofi_gin_gdaki_plugin at compile time rather than patching the plugin struct at runtime. Reuses the proxy implementations verbatim because these APIs operate on shared types (nccl_ofi_rdma_gin_put_comm etc.) produced by connect() in both modes. Changes: - Un-static the 11 shared functions in nccl_ofi_gin_api.cpp so they have external linkage. - Forward-declare them in nccl_ofi_gin_gdaki.h. - nccl_ofi_gin_gdaki_plugin now sets all 11 slots to the shared functions; only createContext, destroyContext, get_properties, and queryLastError remain GDAKI-specific. - Drop the 4 runtime .init/.devices/.listen/.connect assignments from nccl_ofi_gin_api.cpp's GDAKI init branch; keep the memcpy that swaps the exported plugin symbol. Remaining GDAKI-only stubs: createContext, destroyContext (pair tracked separately), regMrSym/regMrSymDmaBuf — those return ncclInternalError until real GDAKI MR registration is wired up. Tested on p5en cluster (EFA, libfabric 2.2, CUDA 12.8, 1/2 node).
d157b9e to
7dacd67
Compare
bhasunit
approved these changes
Apr 24, 2026
mozarhua
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wire GDAKI plugin APIs that are structurally identical to the proxy APIs
directly to the proxy implementations at compile time, instead of patching
the plugin struct at runtime. Rebased onto current master (post-#1197 and
#1195) to use the v13 GIN API.
Design
11 plugin APIs —
init,devices,listen,connect,regMrSym,regMrSymDmaBuf,deregMrSym,closeColl,closeListen,ginProgress,finalize— operate on shared types (nccl_ofi_rdma_gin_put_commetc.)produced by
connect()in both proxy and GDAKI modes. Their semantics areidentical across modes, so GDAKI reuses the proxy implementations directly.
Concretely:
nccl_ofi_gin_api.cppso they haveexternal linkage.
nccl_ofi_gin_gdaki.h.nccl_ofi_gin_gdaki_pluginnow sets all 11 slots to the shared functionsat compile time. GDAKI-specific stubs (
createContext,destroyContext,get_properties,queryLastError) stay innccl_ofi_gin_gdaki.cpp..init/.devices/.listen/.connectassignments fromthe GDAKI init branch in
nccl_ofi_gin_api.cpp; keep thememcpythatswaps the exported plugin symbol.
Remaining stubs
createContext,destroyContext— tracked separatelyqueryLastError— writes*hasError = false; paired withcreateContextregMrSym,regMrSymDmaBuf— returnncclInternalErroruntil real GDAKIMR registration is wired up
iput,iputSignal,iget,iflush,test— permanentlynullptrinGDAKI mode (GPU posts work directly)
Testing
Built on p5en cluster (EFA, libfabric 2.2, CUDA 12.8).
alltoall_perf from nccl-tests (NCCL v2.29.2):
-R 2 -D 3(pure GIN)GDAKI mode enabledon every rank-R 2 -D 4(hybrid GIN)-R 2 -D 3(pure GIN)-R 2 -D 4(hybrid GIN)NCCL 2.29 exercises the v11 fallback path; the 4-config run confirms the
compile-time wiring does not regress the non-GDAKI proxy mode.
Repo's own GIN functional test (
tests/functional/gin), which usesncclGin_v13_tdirectly:OFI_NCCL_GIN_GDAKI=1gin: GDAKI mode enabled (OFI_NCCL_GIN_GDAKI=1)on both ranksNCCL-GIN device used … is Libfabric_GDAKI— confirmsncclGinPlugin_v13symbol was replaced by
nccl_ofi_gin_gdaki_plugingin GDAKI: createContext not yet implemented (nSignals=64, nCounters=0)fired on both ranks with real
ncclGinConfig_v13_tvaluesncclInternalError)Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.