Skip to content

[Store] Fix source refcnt leak in CopyEnd/MoveEnd on invalid source#2628

Open
he-yufeng wants to merge 1 commit into
kvcache-ai:mainfrom
he-yufeng:fix/copyend-moveend-refcnt-leak-invalid-source
Open

[Store] Fix source refcnt leak in CopyEnd/MoveEnd on invalid source#2628
he-yufeng wants to merge 1 commit into
kvcache-ai:mainfrom
he-yufeng:fix/copyend-moveend-refcnt-leak-invalid-source

Conversation

@he-yufeng

@he-yufeng he-yufeng commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

CopyStart increments the source replica's refcnt to protect it from eviction while the copy is in flight:

// CopyStart
source->inc_refcnt();

CopyEnd is expected to release it. The success path does, at the end:

// CopyEnd, success path
source->dec_refcnt();

But the error path that fires when the source becomes invalid mid-transfer returns early without decrementing:

auto source = metadata.GetReplicaByID(source_id);
if (source == nullptr || !source->is_completed() ||
    source->has_invalid_mem_handle()) {
    LOG(ERROR) << ... << ", copy source becomes invalid during data transfer";
    EraseReplicasWithCacheTotalAccounting(metadata, ...);  // erases the *targets*
    accessor.EraseReplicationTask();
    if (!metadata.IsValid()) {
        accessor.Erase();
    }
    return tl::make_unexpected(ErrorCode::REPLICA_IS_GONE);  // source refcnt never released
}

When source is non-null but has_invalid_mem_handle() is true (its segment was unmounted during the transfer) or !is_completed(), the inc_refcnt() from CopyStart is never balanced. The cleanup in this branch erases the copy targets (task.replica_ids), not the source, so the source replica stays in metadata with an elevated refcnt and can never be evicted.

MoveEnd has the identical pattern against MoveStart's inc_refcnt().

Fix

Release the refcnt on the error path for a non-null source, mirroring what CopyRevoke / MoveRevoke already do:

// CopyRevoke (existing)
if (source == nullptr) {
    LOG(WARNING) << ... << ", copy source not found during revoke";
} else {
    source->dec_refcnt();
}

The added guard only decrements when source != nullptr, so the genuine source == nullptr case (nothing to release) is unaffected, and the success path is unchanged.

Verification

The replica refcnt isn't observable through the public MasterService API (the leak only surfaces later as a source replica that can't be evicted), so I haven't added a focused unit test that would be reliable without reaching into internals. The change is a two-line symmetry fix that makes the error path balance the inc_refcnt() exactly like the success path and the existing *Revoke paths; it relies on CI for the build and the existing master-service tests for regression coverage.

CopyStart/MoveStart call source->inc_refcnt() to pin the source replica
during the transfer, and the CopyEnd/MoveEnd success paths release it with
dec_refcnt(). The error path taken when the source becomes invalid
mid-transfer (source non-null but has_invalid_mem_handle(), or
!is_completed()) returns early without decrementing, so the inc_refcnt()
is never balanced. The branch erases the copy/move targets, not the
source, so the source replica stays pinned and can never be evicted.

Release the refcnt for a non-null source on the error path, mirroring the
existing CopyRevoke/MoveRevoke handling.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential resource leak in MasterService::CopyEnd and MasterService::MoveEnd by ensuring that the reference count of the source replica is decremented when the source becomes invalid during data transfer. This prevents the source replica from remaining pinned and unable to be evicted. There are no review comments, so no additional feedback is provided.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/master_service.cpp 50.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@LujhCoconut LujhCoconut left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A clean fix. LGTM.

@Icedcoco

Copy link
Copy Markdown
Collaborator

LGTM. This balances the CopyStart/MoveStart source refcnt on the invalid-source End paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants