gin: add GDAKI plugin with stub implementations#1197
Open
anshumang wants to merge 2 commits intoaws:masterfrom
Open
gin: add GDAKI plugin with stub implementations#1197anshumang wants to merge 2 commits intoaws:masterfrom
anshumang wants to merge 2 commits intoaws:masterfrom
Conversation
Add a complete GDAKI ncclGin_v11_t vtable with stub implementations, following the IB pattern where GDAKI and Proxy have separate vtables (ncclGinIbGdaki vs ncclGinIbProxy in NCCL's gin.cc). New files: include/rdma/gin/gdaki/nccl_ofi_gin_gdaki.h - GDAKI vtable + enabled check src/rdma/gin/gdaki/nccl_ofi_gin_gdaki.cpp - all 18 vtable function stubs The GDAKI vtable has: - getProperties: reports NCCL_NET_DEVICE_GIN_GDAKI - createContext: returns ncclInternalError (not yet implemented) - regMrSym/regMrSymDmaBuf: returns ncclInternalError (not yet implemented) - destroyContext/deregMrSym: no-op stubs - ginProgress/queryLastError: no-op stubs - closeColl/closeListen/finalize: no-op stubs - iput/iputSignal/test: nullptr (GPU-direct, not CPU-mediated) - init/devices/listen/connect: nullptr (copied from proxy at init time) When OFI_NCCL_GIN_GDAKI=1 is set, init() morphs the exported ncclGinPlugin_v11 vtable by copying shared functions from proxy into the GDAKI vtable, then memcpy-ing it over the exported symbol. Tested on p5en cluster (1 and 2 node, 8 and 16 ranks): alltoall_perf -R 2 -D 3 (pure GIN) alltoall_perf -R 2 -D 4 (hybrid GIN) All 4 configs: createContext called on every rank, NCCL WARN 'createContext not yet implemented', clean exit code 3.
bhasunit
reviewed
Apr 20, 2026
| return env != nullptr && atoi(env) != 0; | ||
| } | ||
|
|
||
| static ncclResult_t nccl_ofi_gin_gdaki_getProperties(int dev, ncclNetProperties_v11_t *props) |
Contributor
There was a problem hiding this comment.
Can we please rename to nccl_ofi_gin_gdaki_get_properties.
Author
There was a problem hiding this comment.
Discussed. As this is more widely used, skipping.
bhasunit
reviewed
Apr 20, 2026
Contributor
bhasunit
left a comment
There was a problem hiding this comment.
I don't think we need a separate gdaki folder inside gin since the filenames already have gdaki in them.
bhasunit
reviewed
Apr 20, 2026
|
|
||
| bool nccl_ofi_gin_gdaki_enabled() | ||
| { | ||
| const char *env = getenv("OFI_NCCL_GIN_GDAKI"); |
Contributor
There was a problem hiding this comment.
If you want to use environment variables then please use https://github.com/aws/aws-ofi-nccl/blob/master/include/nccl_ofi_param.h
Author
Done. |
c6dd5c0 to
7a7d887
Compare
- Move files from rdma/gin/gdaki/ to rdma/gin/ (bhasunit) - Rename getProperties to get_properties (bhasunit) - Use OFI_NCCL_PARAM instead of raw getenv (bhasunit)
bhasunit
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add GDAKI (GPUDirect Async) stub vtable for the GIN plugin, enabling GDAKI mode selection via
OFI_NCCL_GIN_GDAKI=1. This follows the IB pattern in NCCL'sgin.ccwhere GDAKI and Proxy have separate vtables (ncclGinIbGdakivsncclGinIbProxy).Design
A new
rdma/gin/gdaki/subdirectory contains the GDAKI vtable with stub implementations for allncclGin_v11_tfunction pointers:getProperties: reportsNCCL_NET_DEVICE_GIN_GDAKIcreateContext,regMrSym,regMrSymDmaBuf: returnncclInternalError(not yet implemented)destroyContext,deregMrSym,closeColl,closeListen,ginProgress,queryLastError,finalize: no-op stubsiput,iputSignal,test:nullptr(no CPU involvement in GDAKI mode)init,devices,listen,connect: copied from the proxy vtable at init timeWhen
OFI_NCCL_GIN_GDAKI=1is set,init()morphs the exportedncclGinPlugin_v11vtable by copying shared functions from proxy into the GDAKI vtable, thenmemcpy-ing it over the exported symbol.Testing
Tested on p5en cluster with
alltoall_perffrom nccl-tests (NCCL v2.29.2, 8× H200 per node):-R 2 -D 3(pure GIN)createContextcalled, NCCL WARN, clean exit 3-R 2 -D 4(hybrid GIN)-R 2 -D 3(pure GIN)-R 2 -D 4(hybrid GIN)All 4 configs show
createContext not yet implementedon every rank and exit cleanly. WithoutOFI_NCCL_GIN_GDAKI=1, the plugin behaves identically to before (proxy mode).Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.