Skip to content

[TE] Improve transfer metadata reliability for node replacement and memory-region changes#2634

Open
alogfans wants to merge 6 commits into
kvcache-ai:mainfrom
alogfans:dev/fix-metadata-for-replacement
Open

[TE] Improve transfer metadata reliability for node replacement and memory-region changes#2634
alogfans wants to merge 6 commits into
kvcache-ai:mainfrom
alogfans:dev/fix-metadata-for-replacement

Conversation

@alogfans

@alogfans alogfans commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR strengthens Transfer Engine metadata reliability by introducing a single segment-level metadata_version and using it as the freshness signal for cached metadata and derived RDMA resources.

Segment names and segment IDs can be reused after node replacement, and memory registration changes can also invalidate rkeys, address ranges, and buffer availability. With this change, every published local segment metadata update bumps metadata_version, allowing readers and RDMA workers to detect that cached descriptors or endpoints may be stale.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Mooncake PG (mooncake-pg)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • Common (mooncake-common)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Performance improvement
  • Other

How Has This Been Tested?

Test commands:

# Example: bash scripts/run_ci_test.sh

Test results:

  • Unit tests pass
  • Integration tests pass (if applicable)
  • Manual testing done (describe below)

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used (specify below)

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a metadata version reliability mechanism to the Transfer Engine to handle node replacements and dynamic memory registrations/deregistrations. It adds a monotonic metadata_version to segment descriptors and a lifecycle state to buffer descriptors, enabling version-based cache invalidation and two-phase memory deregistration. Additionally, RDMA workers now track peer metadata versions to invalidate stale endpoints and rail states. The review feedback highlights two critical issues in transfer_metadata.cpp where null checks are missing for cached segment descriptors in lookupSegmentCacheByName and lookupSegmentCacheByID, the latter of which could lead to a null pointer dereference and application crash.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-transfer-engine/src/transfer_metadata.cpp Outdated
Comment thread mooncake-transfer-engine/src/transfer_metadata.cpp
@github-actions github-actions Bot added documentation Improvements or additions to documentation run-ci Transfer Engine labels Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation run-ci Transfer Engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant