Skip to content

[server] Fix table deletion stuck when StopReplicaRequest send fails#3359

Open
lilei1128 wants to merge 1 commit into
apache:mainfrom
lilei1128:fix
Open

[server] Fix table deletion stuck when StopReplicaRequest send fails#3359
lilei1128 wants to merge 1 commit into
apache:mainfrom
lilei1128:fix

Conversation

@lilei1128
Copy link
Copy Markdown

Purpose

Linked issue: close #xxx

When a StopReplicaRequest send fails with a network-level throwable,
the coordinator had already transitioned replicas to ReplicaDeletionStarted
but silently dropped the error. This permanently stuck table deletion because:

  • isEligibleForDeletion() blocks tables with any replica in ReplicaDeletionStarted
  • resumeDeletions() only triggers from processDeleteReplicaResponseReceived,
    which is only reached when a DeleteReplicaResponseReceivedEvent is emitted
  • processDeadTabletServer() skips replicas that are toBeDeleted

Brief change log

Fix: when sendStopRequest() fails with a throwable, emit a
DeleteReplicaResponseReceivedEvent with error results for the
delete-flagged buckets (delete=true && deleteRemote=true). This feeds
into the existing retry/give-up logic in processDeleteReplicaResponseReceived,
which retries up to DELETE_TRY_TIMES (5) and then force-marks replicas as
ReplicaDeletionSuccessful, allowing deletion to complete.

Non-deletion stop replicas (delete=false) continue to silently ignore
send failures as before, since they are best-effort.

Tests

Add testDeleteTableWithSendFailure() to verify that deletion completes
even when all StopReplicaRequest sends fail at the network level (i.e.,
CompletableFuture.failedFuture, not a response-level error code). The
existing testDeleteReplicaStateChange() only covered response-level errors.

API and Format

Documentation

@lilei1128
Copy link
Copy Markdown
Author

close #3357

@gyang94
Copy link
Copy Markdown
Contributor

gyang94 commented May 27, 2026

Hi @lilei1128 , thanks for your contribution to fix this issue. Glad to see you are interested in this task.

In your PR, it is a good thinking to reuse the existing retry mechanism to solve this problem. However, the current retry loop doesn't truly solve the problem. The retry loop has no delay between attempts, so it degrades to "fail 5 times in milliseconds → pretend success → data residue on disk."

I'd like to solve this problem by introducing "per-TS sender" + "ReplicaDeletionIneligible state":

  • Per-TS sender thread gives each TabletServer an independent request queue with backoff-aware retry. A temporarily unreachable TS doesn't burn through retry attempts instantly — the sender sleeps and retries over a meaningful time window.
  • ReplicaDeletionIneligible state breaks the current all-or-nothing dilemma. Instead of force-marking unreachable replicas as successfully deleted, they enter Ineligible — which (1) unblocks isEligibleForDeletion (since Ineligible ≠ DeletionStarted, resumeDeletions is no longer stuck), and (2) preserves the truth that deletion hasn't happened yet. When the TS comes back online, replicas transition Ineligible → OfflineReplica → ReplicaDeletionStarted and get genuinely deleted.

If you are interested, welcome to review this PR, leave any comments there, or help to push code to fix based on it.

@lilei1128
Copy link
Copy Markdown
Author

lilei1128 commented May 27, 2026

Hi @lilei1128 , thanks for your contribution to fix this issue. Glad to see you are interested in this task.

In your PR, it is a good thinking to reuse the existing retry mechanism to solve this problem. However, the current retry loop doesn't truly solve the problem. The retry loop has no delay between attempts, so it degrades to "fail 5 times in milliseconds → pretend success → data residue on disk."

I'd like to solve this problem by introducing "per-TS sender" + "ReplicaDeletionIneligible state":

  • Per-TS sender thread gives each TabletServer an independent request queue with backoff-aware retry. A temporarily unreachable TS doesn't burn through retry attempts instantly — the sender sleeps and retries over a meaningful time window.
  • ReplicaDeletionIneligible state breaks the current all-or-nothing dilemma. Instead of force-marking unreachable replicas as successfully deleted, they enter Ineligible — which (1) unblocks isEligibleForDeletion (since Ineligible ≠ DeletionStarted, resumeDeletions is no longer stuck), and (2) preserves the truth that deletion hasn't happened yet. When the TS comes back online, replicas transition Ineligible → OfflineReplica → ReplicaDeletionStarted and get genuinely deleted.

If you are interested, welcome to review this PR, leave any comments there, or help to push code to fix based on it.

Thank you so much for your detailed and insightful review! I really appreciate you taking the time to explain the flaws in the current retry approach and sharing the better design direction.

I fully agree with your points: the existing retry loop without delay can only cause rapid failures and leave data residues on disk, which doesn’t solve the root problem. The solution you proposed — combining per-TS sender with backoff-aware retry and ReplicaDeletionIneligible state — is elegant and thorough. It not only unblocks the deletion workflow but also keeps the correct state transition logic, which perfectly fixes the all-or-nothing dilemma.

I’m very interested in this design. I will take a close look at your PR.
Thanks again for your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants