[TE] Improve transfer metadata reliability for node replacement and memory-region changes#2634
[TE] Improve transfer metadata reliability for node replacement and memory-region changes#2634alogfans wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a metadata version reliability mechanism to the Transfer Engine to handle node replacements and dynamic memory registrations/deregistrations. It adds a monotonic metadata_version to segment descriptors and a lifecycle state to buffer descriptors, enabling version-based cache invalidation and two-phase memory deregistration. Additionally, RDMA workers now track peer metadata versions to invalidate stale endpoints and rail states. The review feedback highlights two critical issues in transfer_metadata.cpp where null checks are missing for cached segment descriptors in lookupSegmentCacheByName and lookupSegmentCacheByID, the latter of which could lead to a null pointer dereference and application crash.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Description
This PR strengthens Transfer Engine metadata reliability by introducing a single segment-level metadata_version and using it as the freshness signal for cached metadata and derived RDMA resources.
Segment names and segment IDs can be reused after node replacement, and memory registration changes can also invalidate rkeys, address ranges, and buffer availability. With this change, every published local segment metadata update bumps metadata_version, allowing readers and RDMA workers to detect that cached descriptors or endpoints may be stale.
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-pg)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-common)mooncake-rl)Type of Change
How Has This Been Tested?
Test commands:
# Example: bash scripts/run_ci_test.shTest results:
Checklist
./scripts/code_format.shpre-commit run --all-filesand all hooks passAI Assistance Disclosure