Skip to content

gin: GDAKI reuses shared proxy APIs for MR reg, close, progress, finalize#1201

Merged
anshumang merged 1 commit intoaws:masterfrom
anshumang:gdaki-other-apis
Apr 24, 2026
Merged

gin: GDAKI reuses shared proxy APIs for MR reg, close, progress, finalize#1201
anshumang merged 1 commit intoaws:masterfrom
anshumang:gdaki-other-apis

Conversation

@anshumang
Copy link
Copy Markdown
Contributor

@anshumang anshumang commented Apr 21, 2026

Summary

Wire GDAKI plugin APIs that are structurally identical to the proxy APIs
directly to the proxy implementations at compile time, instead of patching
the plugin struct at runtime. Rebased onto current master (post-#1197 and
#1195) to use the v13 GIN API.

Design

11 plugin APIs — init, devices, listen, connect, regMrSym,
regMrSymDmaBuf, deregMrSym, closeColl, closeListen, ginProgress,
finalize — operate on shared types (nccl_ofi_rdma_gin_put_comm etc.)
produced by connect() in both proxy and GDAKI modes. Their semantics are
identical across modes, so GDAKI reuses the proxy implementations directly.
Concretely:

  • Un-static the 11 shared functions in nccl_ofi_gin_api.cpp so they have
    external linkage.
  • Forward-declare them in nccl_ofi_gin_gdaki.h.
  • nccl_ofi_gin_gdaki_plugin now sets all 11 slots to the shared functions
    at compile time. GDAKI-specific stubs (createContext, destroyContext,
    get_properties, queryLastError) stay in nccl_ofi_gin_gdaki.cpp.
  • Drop the 4 runtime .init/.devices/.listen/.connect assignments from
    the GDAKI init branch in nccl_ofi_gin_api.cpp; keep the memcpy that
    swaps the exported plugin symbol.

Remaining stubs

  • createContext, destroyContext — tracked separately
  • queryLastError — writes *hasError = false; paired with createContext
  • regMrSym, regMrSymDmaBuf — return ncclInternalError until real GDAKI
    MR registration is wired up
  • iput, iputSignal, iget, iflush, test — permanently nullptr in
    GDAKI mode (GPU posts work directly)

Testing

Built on p5en cluster (EFA, libfabric 2.2, CUDA 12.8).

alltoall_perf from nccl-tests (NCCL v2.29.2):

Config Flags Result
1-node, 8 ranks -R 2 -D 3 (pure GIN) clean exit 0, GDAKI mode enabled on every rank
1-node, 8 ranks -R 2 -D 4 (hybrid GIN) same
2-node, 16 ranks -R 2 -D 3 (pure GIN) same, both nodes
2-node, 16 ranks -R 2 -D 4 (hybrid GIN) same, both nodes

NCCL 2.29 exercises the v11 fallback path; the 4-config run confirms the
compile-time wiring does not regress the non-GDAKI proxy mode.

Repo's own GIN functional test (tests/functional/gin), which uses
ncclGin_v13_t directly:

  • 2 ranks, OFI_NCCL_GIN_GDAKI=1
  • gin: GDAKI mode enabled (OFI_NCCL_GIN_GDAKI=1) on both ranks
  • NCCL-GIN device used … is Libfabric_GDAKI — confirms ncclGinPlugin_v13
    symbol was replaced by nccl_ofi_gin_gdaki_plugin
  • gin GDAKI: createContext not yet implemented (nSignals=64, nCounters=0)
    fired on both ranks with real ncclGinConfig_v13_t values
  • Clean exit 3 (expected — test returns ncclInternalError)

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Wire all 11 shared plugin APIs (init, devices, listen, connect, regMrSym,
regMrSymDmaBuf, deregMrSym, closeColl, closeListen, ginProgress, finalize)
directly into nccl_ofi_gin_gdaki_plugin at compile time rather than
patching the plugin struct at runtime. Reuses the proxy implementations
verbatim because these APIs operate on shared types (nccl_ofi_rdma_gin_put_comm
etc.) produced by connect() in both modes.

Changes:
- Un-static the 11 shared functions in nccl_ofi_gin_api.cpp so they have
  external linkage.
- Forward-declare them in nccl_ofi_gin_gdaki.h.
- nccl_ofi_gin_gdaki_plugin now sets all 11 slots to the shared functions;
  only createContext, destroyContext, get_properties, and queryLastError
  remain GDAKI-specific.
- Drop the 4 runtime .init/.devices/.listen/.connect assignments from
  nccl_ofi_gin_api.cpp's GDAKI init branch; keep the memcpy that swaps
  the exported plugin symbol.

Remaining GDAKI-only stubs: createContext, destroyContext (pair tracked
separately), regMrSym/regMrSymDmaBuf — those return ncclInternalError
until real GDAKI MR registration is wired up.

Tested on p5en cluster (EFA, libfabric 2.2, CUDA 12.8, 1/2 node).
@anshumang anshumang marked this pull request as ready for review April 23, 2026 23:07
@anshumang anshumang requested review from a team and bwbarrett as code owners April 23, 2026 23:07
@anshumang anshumang merged commit e1efa22 into aws:master Apr 24, 2026
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants