Skip to content

[server] fix stop-replica deletion stuck when TabletServer is offline#3369

Closed
gyang94 wants to merge 1 commit into
apache:mainfrom
gyang94:fix-stop-replicas-failed
Closed

[server] fix stop-replica deletion stuck when TabletServer is offline#3369
gyang94 wants to merge 1 commit into
apache:mainfrom
gyang94:fix-stop-replicas-failed

Conversation

@gyang94
Copy link
Copy Markdown
Contributor

@gyang94 gyang94 commented May 24, 2026

Purpose

Linked issue: close #3357

Brief change log

Previously, when a stopReplica(delete=true) RPC failed (e.g., TabletServer offline or network error), replicas got permanently stuck in ReplicaDeletionStarted, blocking table/partition deletion forever.

This PR aligns the deletion flow with Kafka's approach by introducing a ReplicaDeletionIneligible state. When a stopReplica RPC fails to send or the target TabletServer goes offline, affected replicas are transitioned to ReplicaDeletionIneligible and the owning table/partition is marked ineligible for deletion. When the TabletServer reconnects, the ineligible mark is cleared and deletion is automatically retried.

Tests

API and Format

Documentation

@gyang94 gyang94 changed the title fix: stop-replica-failed [server] fix stop-replica-failed May 24, 2026
@gyang94 gyang94 force-pushed the fix-stop-replicas-failed branch 3 times, most recently from 496b0d8 to 9e76737 Compare May 25, 2026 10:10
@gyang94 gyang94 force-pushed the fix-stop-replicas-failed branch from 9e76737 to 5fac1ab Compare May 27, 2026 10:43
@gyang94 gyang94 changed the title [server] fix stop-replica-failed [server] fix stop-replica deletion stuck when TabletServer is offline May 27, 2026
@gyang94 gyang94 closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Table deletion stuck permanently when StopReplica request fails

2 participants