Skip to content

[Store] Refactor accelerator device registry and staging copies#2583

Open
Aionw wants to merge 8 commits into
kvcache-ai:mainfrom
Aionw:codex/accelerator-device-rfc
Open

[Store] Refactor accelerator device registry and staging copies#2583
Aionw wants to merge 8 commits into
kvcache-ai:mainfrom
Aionw:codex/accelerator-device-rfc

Conversation

@Aionw

@Aionw Aionw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Description

Closes #2582.

This PR introduces a Store-side AcceleratorDevice abstraction for local
accelerator memory operations. Vendor-specific pointer query, context switch,
copy, and pinned-host allocation are moved behind per-vendor device
implementations, while Store call sites use the registry's available
accelerator list and pointer-based dispatch.

The latest revision also simplifies the accelerator registry around static
device registration:

  • removes the public RegisterAcceleratorDevice entrypoint and keeps
    registration available only through static AcceleratorDeviceRegistrar
    instances
  • stores registered devices in a simple list instead of maintaining a separate
    vendor-indexed table
  • caches available devices with an atomic immutable snapshot so the common
    RuntimeAccelerators(false) path avoids taking a mutex after initialization
  • keeps ensure=true as the explicit refresh path

It also tightens runtime copy behavior:

  • skips null pointer queries in FindDeviceForPointer
  • removes the unused IsDevicePointer wrapper
  • reuses FindDeviceForPointer results at D2H staging sites to avoid querying
    the same pointer twice
  • passes explicit copy directions for H2D, D2H, and D2D copies instead of
    relying on kAuto
  • fixes Ascend current-device reporting to return the logical device id
  • keeps pinned-host allocation fallback in PinnedBufferPool

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Mooncake PG (mooncake-pg)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • Common (mooncake-common)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement
  • Other

How Has This Been Tested?

Test commands:

Test by UT

Test results:

  • Unit tests pass
  • Integration tests pass (if applicable)
  • Manual testing done

runtime_accelerator_test passes with 9/9 tests.

Note: targeted pre-commit was run, but the project mooncake-code-format
hook cannot complete in this environment because clang-format-20 is not
installed. The other relevant hooks for the changed files passed.

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used (specify below)

This PR was implemented with AI assistance.

AI generated the implementation. Human review covered 100% of the diff.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors platform-specific accelerator memory management and copy operations into an object-oriented AcceleratorDevice abstraction and registry pattern, replacing direct preprocessor macro checks. The review feedback identifies several critical issues: a restriction in PinnedBufferPool that limits pinned host memory allocation to single-accelerator environments and causes performance degradation in multi-accelerator setups; a compilation error in ascend_accelerator_device.cpp when USE_ASCEND_DIRECT is defined due to a pointer type mismatch; and performance overhead in the CUDA, HIP, and Ascend implementations from repeatedly querying device counts in the hot path, which should instead be cached statically.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-store/include/pinned_buffer_pool.h Outdated
Comment thread mooncake-store/src/device/ascend_accelerator_device.cpp Outdated
Comment thread mooncake-store/src/device/cuda_like_accelerator_device.cpp Outdated
Comment thread mooncake-store/src/device/hip_accelerator_device.cpp Outdated
Comment thread mooncake-store/src/device/ascend_accelerator_device.cpp Outdated
@Aionw Aionw force-pushed the codex/accelerator-device-rfc branch 4 times, most recently from 04d3c38 to 9eccb30 Compare June 25, 2026 07:53
@Aionw Aionw force-pushed the codex/accelerator-device-rfc branch from 9eccb30 to 27b2346 Compare June 25, 2026 08:02
@Aionw

Aionw commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a unified device abstraction layer (AcceleratorDevice, AcceleratorRegistry, and RuntimeAccelerator) to replace platform-specific conditional compilation across the codebase, updating PinnedBufferPool, Client, FileStorage, and MemcpyWorkerPool to use this new interface. The review feedback highlights several critical improvements: resolving an inconsistency in AscendAcceleratorDevice where CurrentDeviceId() returns a physical ID instead of a logical ID, adding a nullptr check in FindDeviceForPointer to prevent driver errors, optimizing CopyMaybeAccelerator by passing explicit copy directions, refactoring move constructors in PinnedHostBuffer and Buffer to use member initializers, avoiding self-move-assignment in PinnedBufferPool, and handling duplicate device registrations in AcceleratorRegistry consistently.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-store/src/device/ascend_accelerator_device.cpp
Comment thread mooncake-store/src/device/runtime_accelerator.cpp
Comment thread mooncake-store/src/device/runtime_accelerator.cpp
Comment thread mooncake-store/include/pinned_host_buffer.h Outdated
Comment thread mooncake-store/include/pinned_buffer_pool.h Outdated
Comment thread mooncake-store/include/pinned_buffer_pool.h
Comment thread mooncake-store/src/device/accelerator_registry.cpp
@Aionw Aionw force-pushed the codex/accelerator-device-rfc branch from 9512d7f to cf737bb Compare June 26, 2026 07:05
@Aionw Aionw changed the title Refactor store accelerator device memory operations [Store] Refactor accelerator device registry and staging copies Jun 26, 2026
@Aionw Aionw force-pushed the codex/accelerator-device-rfc branch from cf737bb to 0def852 Compare June 26, 2026 07:36
…evice-rfc

# Conflicts:
#	mooncake-store/src/CMakeLists.txt
@Aionw Aionw marked this pull request as ready for review June 26, 2026 08:03
@Aionw Aionw requested review from XucSh, YiXR, stmatengss and ykwd as code owners June 26, 2026 08:03
@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Mooncake Store AcceleratorDevice Abstraction

2 participants