PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability by tkhurana · Pull Request #2466 · apache/phoenix

tkhurana · 2026-05-07T22:13:37Z

Summary

Defer peer shard manager creation from ReplicationLogGroup.init() to individual mode consumers (SyncModeImpl, SyncAndForwardModeImpl, ReplicationLogDiscoveryForwarder)
When the standby namenode is unavailable at startup, the group gracefully degrades from SYNC to STORE_AND_FORWARD via the existing updateModeOnFailure() path
Change ReplicationLogGroup.get() to throw IOException instead of RuntimeException, so callers in IndexRegionObserver get proper error classification (not misclassified as IndexBuildingFailureException)
Forwarder lazily creates its peer shard manager with automatic retry on subsequent rounds
Replace createStandbyLog()/createFallbackLog() with single createReplicationLog(shardManager) factory

Test plan

testInitDegradesToSafWhenPeerUnavailable — peer unavailable at startup → SAF mode, writes succeed
testInitFailsWhenLocalUnavailable — local FS unavailable → init fails with IOException
testForwarderRetriesPeerCreation — forwarder retries peer shard manager on next round after initial failure
All 41 existing ReplicationLogGroupTest tests pass
All 3 ReplicationLogDiscoveryForwarderTest tests pass
All 2 ReplicationLogTest tests pass
HABaseIT integration tests (running)

… cluster unavailability

…eption - Replace RuntimeException wrapping in get() with UncheckedIOException to avoid misclassifying unrelated RuntimeExceptions with IOException causes - Bound peer shard manager creation with configurable timeout (default 10s) via CompletableFuture.get() to prevent blocking the disruptor handler thread on peer NN outage; TimeoutException triggers SAF degradation - Consolidate peer shard manager into a single lazy synchronized accessor on ReplicationLogGroup; remove per-component caching from forwarder - Cancel the in-flight future on timeout to release the ForkJoinPool thread - Add test for timeout-triggered SAF degradation

apurtell

+1 with some minor comments

apurtell · 2026-05-11T19:17:30Z

+      try {
+        peerShardManager = future.get(peerInitTimeoutMs, TimeUnit.MILLISECONDS);
+        return peerShardManager;
+      } catch (UncheckedIOException e) {


Nit. future.get() won't throw UncheckedIOException. When the supplier throws unchecked, the future fails and Future.get() wraps the cause in ExecutionException. You can unwrap the UncheckedIOException from ExecutionException where you need it.

apurtell · 2026-05-11T19:21:21Z

+        }
+        throw new IOException("Failed to create peer shard manager", e.getCause());
+      } catch (TimeoutException e) {
+        future.cancel(true);


future.cancel(true) does not actually free the worker thread. (Per CompletableFuture Javadoc, mayInterruptIfRunning has no effect i.e. interrupts are not used to control processing.) The stall is moved to the worker pool, which is better, but the stall still holds resources the entire time the underlying HDFS call is blocked. Every subsequent call to this method spawns another supplyAsync task while the previous one is still hung, so this issue can compound and potentially exhaust the worker pool.

I asked the robot about this and it recommended: Use a dedicated ExecutorService (single-thread, bounded queue) so future.cancel(true) can actually interrupt, and so leakage is bounded.

apurtell · 2026-05-11T19:25:36Z

+    if (cached != null) {
+      return cached;
+    }
+    synchronized (this) {


With both the disruptor handler and the forwarder thread involved we get a mutual exclusion here that may be longer than 10s.

- Remove dead catch(UncheckedIOException) block; future.get() wraps supplier exceptions in ExecutionException, not UncheckedIOException - Unwrap UncheckedIOException inside ExecutionException handler to recover the original IOException - Cache the in-flight CompletableFuture to prevent spawning unbounded async tasks when peer HDFS is unavailable; reuse the pending future on timeout, retry only after exceptional completion - Remove no-op future.cancel(true) which has no effect on CompletableFuture (mayInterruptIfRunning is ignored)

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby…

e86bb0a

… cluster unavailability

tkhurana requested review from apurtell and kadirozde May 7, 2026 22:16

apurtell approved these changes May 11, 2026

View reviewed changes

tkhurana merged commit dfc14db into apache:PHOENIX-7562-feature-new May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466

PHOENIX-7845 ReplicationLogGroup initialization resilience to standby cluster unavailability#2466
tkhurana merged 3 commits into
apache:PHOENIX-7562-feature-newfrom
tkhurana:PHOENIX-7562-feature-new

tkhurana commented May 7, 2026 •

edited

Loading

Uh oh!

apurtell left a comment

Uh oh!

apurtell May 11, 2026

Uh oh!

apurtell May 11, 2026

Uh oh!

apurtell May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tkhurana commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

apurtell left a comment

Choose a reason for hiding this comment

Uh oh!

apurtell May 11, 2026

Choose a reason for hiding this comment

Uh oh!

apurtell May 11, 2026

Choose a reason for hiding this comment

Uh oh!

apurtell May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tkhurana commented May 7, 2026 •

edited

Loading