Skip to content

gin: add GDAKI plugin with stub implementations#1197

Open
anshumang wants to merge 2 commits intoaws:masterfrom
anshumang:gdaki-stubs
Open

gin: add GDAKI plugin with stub implementations#1197
anshumang wants to merge 2 commits intoaws:masterfrom
anshumang:gdaki-stubs

Conversation

@anshumang
Copy link
Copy Markdown

Summary

Add GDAKI (GPUDirect Async) stub vtable for the GIN plugin, enabling GDAKI mode selection via OFI_NCCL_GIN_GDAKI=1. This follows the IB pattern in NCCL's gin.cc where GDAKI and Proxy have separate vtables (ncclGinIbGdaki vs ncclGinIbProxy).

Design

A new rdma/gin/gdaki/ subdirectory contains the GDAKI vtable with stub implementations for all ncclGin_v11_t function pointers:

  • getProperties: reports NCCL_NET_DEVICE_GIN_GDAKI
  • createContext, regMrSym, regMrSymDmaBuf: return ncclInternalError (not yet implemented)
  • destroyContext, deregMrSym, closeColl, closeListen, ginProgress, queryLastError, finalize: no-op stubs
  • iput, iputSignal, test: nullptr (no CPU involvement in GDAKI mode)
  • init, devices, listen, connect: copied from the proxy vtable at init time

When OFI_NCCL_GIN_GDAKI=1 is set, init() morphs the exported ncclGinPlugin_v11 vtable by copying shared functions from proxy into the GDAKI vtable, then memcpy-ing it over the exported symbol.

Testing

Tested on p5en cluster with alltoall_perf from nccl-tests (NCCL v2.29.2, 8× H200 per node):

Config Flags Result
1-node, 8 ranks -R 2 -D 3 (pure GIN) createContext called, NCCL WARN, clean exit 3
1-node, 8 ranks -R 2 -D 4 (hybrid GIN) same
2-node, 16 ranks -R 2 -D 3 (pure GIN) same, both nodes
2-node, 16 ranks -R 2 -D 4 (hybrid GIN) same, both nodes

All 4 configs show createContext not yet implemented on every rank and exit cleanly. Without OFI_NCCL_GIN_GDAKI=1, the plugin behaves identically to before (proxy mode).

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Add a complete GDAKI ncclGin_v11_t vtable with stub implementations,
following the IB pattern where GDAKI and Proxy have separate vtables
(ncclGinIbGdaki vs ncclGinIbProxy in NCCL's gin.cc).

New files:
  include/rdma/gin/gdaki/nccl_ofi_gin_gdaki.h  - GDAKI vtable + enabled check
  src/rdma/gin/gdaki/nccl_ofi_gin_gdaki.cpp    - all 18 vtable function stubs

The GDAKI vtable has:
- getProperties: reports NCCL_NET_DEVICE_GIN_GDAKI
- createContext: returns ncclInternalError (not yet implemented)
- regMrSym/regMrSymDmaBuf: returns ncclInternalError (not yet implemented)
- destroyContext/deregMrSym: no-op stubs
- ginProgress/queryLastError: no-op stubs
- closeColl/closeListen/finalize: no-op stubs
- iput/iputSignal/test: nullptr (GPU-direct, not CPU-mediated)
- init/devices/listen/connect: nullptr (copied from proxy at init time)

When OFI_NCCL_GIN_GDAKI=1 is set, init() morphs the exported
ncclGinPlugin_v11 vtable by copying shared functions from proxy into
the GDAKI vtable, then memcpy-ing it over the exported symbol.

Tested on p5en cluster (1 and 2 node, 8 and 16 ranks):
  alltoall_perf -R 2 -D 3  (pure GIN)
  alltoall_perf -R 2 -D 4  (hybrid GIN)
All 4 configs: createContext called on every rank, NCCL WARN
'createContext not yet implemented', clean exit code 3.
return env != nullptr && atoi(env) != 0;
}

static ncclResult_t nccl_ofi_gin_gdaki_getProperties(int dev, ncclNetProperties_v11_t *props)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please rename to nccl_ofi_gin_gdaki_get_properties.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed. As this is more widely used, skipping.

Copy link
Copy Markdown
Contributor

@bhasunit bhasunit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate gdaki folder inside gin since the filenames already have gdaki in them.

Comment thread src/rdma/gin/nccl_ofi_gin_api.cpp

bool nccl_ofi_gin_gdaki_enabled()
{
const char *env = getenv("OFI_NCCL_GIN_GDAKI");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use environment variables then please use https://github.com/aws/aws-ofi-nccl/blob/master/include/nccl_ofi_param.h

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anshumang
Copy link
Copy Markdown
Author

I don't think we need a separate gdaki folder inside gin since the filenames already have gdaki in them.

Done.

@anshumang anshumang force-pushed the gdaki-stubs branch 2 times, most recently from c6dd5c0 to 7a7d887 Compare April 20, 2026 21:21
@anshumang anshumang marked this pull request as ready for review April 20, 2026 22:26
@anshumang anshumang requested review from a team and bwbarrett as code owners April 20, 2026 22:26
- Move files from rdma/gin/gdaki/ to rdma/gin/ (bhasunit)
- Rename getProperties to get_properties (bhasunit)
- Use OFI_NCCL_PARAM instead of raw getenv (bhasunit)
@anshumang anshumang changed the title gin: add GDAKI vtable with stub implementations in rdma/gin/gdaki/ gin: add GDAKI plugin with stub implementations Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants