[server] Fix table deletion stuck when StopReplicaRequest send fails by lilei1128 · Pull Request #3359 · apache/fluss

lilei1128 · 2026-05-22T03:46:26Z

Purpose

Linked issue: close #xxx

When a StopReplicaRequest send fails with a network-level throwable,
the coordinator had already transitioned replicas to ReplicaDeletionStarted
but silently dropped the error. This permanently stuck table deletion because:

isEligibleForDeletion() blocks tables with any replica in ReplicaDeletionStarted
resumeDeletions() only triggers from processDeleteReplicaResponseReceived,
which is only reached when a DeleteReplicaResponseReceivedEvent is emitted
processDeadTabletServer() skips replicas that are toBeDeleted

Brief change log

Fix: when sendStopRequest() fails with a throwable, emit a
DeleteReplicaResponseReceivedEvent with error results for the
delete-flagged buckets (delete=true && deleteRemote=true). This feeds
into the existing retry/give-up logic in processDeleteReplicaResponseReceived,
which retries up to DELETE_TRY_TIMES (5) and then force-marks replicas as
ReplicaDeletionSuccessful, allowing deletion to complete.

Non-deletion stop replicas (delete=false) continue to silently ignore
send failures as before, since they are best-effort.

Tests

Add testDeleteTableWithSendFailure() to verify that deletion completes
even when all StopReplicaRequest sends fail at the network level (i.e.,
CompletableFuture.failedFuture, not a response-level error code). The
existing testDeleteReplicaStateChange() only covered response-level errors.

API and Format

Documentation

lilei1128 · 2026-05-22T03:47:51Z

close #3357

gyang94 · 2026-05-27T13:21:37Z

Hi @lilei1128 , thanks for your contribution to fix this issue. Glad to see you are interested in this task.

In your PR, it is a good thinking to reuse the existing retry mechanism to solve this problem. However, the current retry loop doesn't truly solve the problem. The retry loop has no delay between attempts, so it degrades to "fail 5 times in milliseconds → pretend success → data residue on disk."

I'd like to solve this problem by introducing "per-TS sender" + "ReplicaDeletionIneligible state":

Per-TS sender thread gives each TabletServer an independent request queue with backoff-aware retry. A temporarily unreachable TS doesn't burn through retry attempts instantly — the sender sleeps and retries over a meaningful time window.
ReplicaDeletionIneligible state breaks the current all-or-nothing dilemma. Instead of force-marking unreachable replicas as successfully deleted, they enter Ineligible — which (1) unblocks isEligibleForDeletion (since Ineligible ≠ DeletionStarted, resumeDeletions is no longer stuck), and (2) preserves the truth that deletion hasn't happened yet. When the TS comes back online, replicas transition Ineligible → OfflineReplica → ReplicaDeletionStarted and get genuinely deleted.

If you are interested, welcome to review this PR, leave any comments there, or help to push code to fix based on it.

lilei1128 · 2026-05-27T13:31:00Z

Hi @lilei1128 , thanks for your contribution to fix this issue. Glad to see you are interested in this task.

In your PR, it is a good thinking to reuse the existing retry mechanism to solve this problem. However, the current retry loop doesn't truly solve the problem. The retry loop has no delay between attempts, so it degrades to "fail 5 times in milliseconds → pretend success → data residue on disk."

I'd like to solve this problem by introducing "per-TS sender" + "ReplicaDeletionIneligible state":

Per-TS sender thread gives each TabletServer an independent request queue with backoff-aware retry. A temporarily unreachable TS doesn't burn through retry attempts instantly — the sender sleeps and retries over a meaningful time window.

ReplicaDeletionIneligible state breaks the current all-or-nothing dilemma. Instead of force-marking unreachable replicas as successfully deleted, they enter Ineligible — which (1) unblocks isEligibleForDeletion (since Ineligible ≠ DeletionStarted, resumeDeletions is no longer stuck), and (2) preserves the truth that deletion hasn't happened yet. When the TS comes back online, replicas transition Ineligible → OfflineReplica → ReplicaDeletionStarted and get genuinely deleted.

If you are interested, welcome to review this PR, leave any comments there, or help to push code to fix based on it.

Thank you so much for your detailed and insightful review! I really appreciate you taking the time to explain the flaws in the current retry approach and sharing the better design direction.

I fully agree with your points: the existing retry loop without delay can only cause rapid failures and leave data residues on disk, which doesn’t solve the root problem. The solution you proposed — combining per-TS sender with backoff-aware retry and ReplicaDeletionIneligible state — is elegant and thorough. It not only unblocks the deletion workflow but also keeps the correct state transition logic, which perfectly fixes the all-or-nothing dilemma.

I’m very interested in this design. I will take a close look at your PR.
Thanks again for your guidance!

[server] Fix table deletion stuck when StopReplicaRequest send fails

a143907

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Fix table deletion stuck when StopReplicaRequest send fails#3359

[server] Fix table deletion stuck when StopReplicaRequest send fails#3359
lilei1128 wants to merge 1 commit into
apache:mainfrom
lilei1128:fix

lilei1128 commented May 22, 2026

Uh oh!

lilei1128 commented May 22, 2026

Uh oh!

gyang94 commented May 27, 2026

Uh oh!

lilei1128 commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lilei1128 commented May 22, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

lilei1128 commented May 22, 2026

Uh oh!

gyang94 commented May 27, 2026

Uh oh!

lilei1128 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lilei1128 commented May 27, 2026 •

edited

Loading