Skip to content

[fix][bk] Fix NPE in IsolatedBookieEnsemblePlacementPolicy when policy class does not match#25825

Open
Shawyeok wants to merge 2 commits into
apache:masterfrom
Shawyeok:hotfix/fix-npe-in-isolated-bookie-ensemble-placement-policy
Open

[fix][bk] Fix NPE in IsolatedBookieEnsemblePlacementPolicy when policy class does not match#25825
Shawyeok wants to merge 2 commits into
apache:masterfrom
Shawyeok:hotfix/fix-npe-in-isolated-bookie-ensemble-placement-policy

Conversation

@Shawyeok
Copy link
Copy Markdown
Contributor

@Shawyeok Shawyeok commented May 19, 2026

Fixes NPE in IsolatedBookieEnsemblePlacementPolicy

Motivation

A NullPointerException occurs in getExcludedBookiesWithIsolationGroups when getIsolationGroup() returns a Pair whose left value is null. This happens when the EnsemblePlacementPolicyConfig carries a policy class name that does not equal IsolatedBookieEnsemblePlacementPolicy — in that case the if block that calls pair.setLeft() / pair.setRight() is skipped entirely, leaving both sides of the MutablePair as null.

This bug was observed during the upgrade from the 2.8 branch to the 3.0 branch. The same defect may also exist in master.

Full stack trace from production:

2026-05-18T13:28:40.756Z [BookKeeperClientWorker-OrderedExecutor-35-0] ERROR org.apache.bookkeeper.common.util.SingleThreadExecutor - Error while running task: Cannot invoke "java.util.Set.contains(Object)" because the return value of "org.apache.commons.lang3.tuple.Pair.getLeft()" is null
java.lang.NullPointerException: Cannot invoke "java.util.Set.contains(Object)" because the return value of "org.apache.commons.lang3.tuple.Pair.getLeft()" is null
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.getExcludedBookiesWithIsolationGroups(IsolatedBookieEnsemblePlacementPolicy.java:192)
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.getExcludedBookies(IsolatedBookieEnsemblePlacementPolicy.java:141)
    at org.apache.pulsar.bookie.rackawareness.IsolatedBookieEnsemblePlacementPolicy.replaceBookie(IsolatedBookieEnsemblePlacementPolicy.java:127)
    at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:316)
    at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:69)
    at org.apache.bookkeeper.client.ReadOnlyLedgerHandle.handleBookieFailure(ReadOnlyLedgerHandle.java:222)
    at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:353)
    at org.apache.bookkeeper.proto.BookieClientImpl.completeAdd(BookieClientImpl.java:287)
    at org.apache.bookkeeper.proto.BookieClientImpl.access$200(BookieClientImpl.java:79)
    at org.apache.bookkeeper.proto.BookieClientImpl$ChannelReadyForAddEntryCallback.lambda$operationComplete$0(BookieClientImpl.java:405)
    at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203)
    at org.apache.bookkeeper.common.util.SingleThreadExecutor.safeRunTask(SingleThreadExecutor.java:137)
    at org.apache.bookkeeper.common.util.SingleThreadExecutor.run(SingleThreadExecutor.java:107)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:840)

This error causes topics to time out during loading. Since there is no automatic recovery path, the only way to restore service is by restarting the broker.

Modifications

Two complementary fixes:

  1. getIsolationGroup(): Ensure not returning pair of null: add compatibility with ZkIsolatedBookieEnsemblePlacementPolicy and fallback to defaultIsolationGroups if met a mis-match policy class.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added testReplaceBookieWithNonMatchingPolicyClassShouldNotThrowNPE to reproduce the exact production failure: calling replaceBookie with custom metadata whose EnsemblePlacementPolicyConfig policy class does not match IsolatedBookieEnsemblePlacementPolicy, which previously caused the NPE.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

}
for (String group : allGroups) {
Set<String> bookiesInGroup = allGroupsBookieMapping.get(group).keySet();
if (!primaryIsolationGroup.contains(group)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problematic behavior for the non-matching policyClass case. After this PR, getIsolationGroup() returns empty primary/secondary groups when the metadata policyClass is not
IsolatedBookieEnsemblePlacementPolicy. With primaryIsolationGroup empty, this condition is true for every configured group, so all grouped bookies are added to excludedBookies below. That avoids the NPE but can
still make replaceBookie fail with BKNotEnoughBookiesException. A non-matching policyClass should be ignored or fall back to defaultIsolationGroups, not be treated as an empty isolation group.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shawyeok Please check this. I wonder if there's a test case that would cover the explained scenario?

Copy link
Copy Markdown
Contributor Author

@Shawyeok Shawyeok May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@void-ptr974
Good point! The tricky part is that handling of empty isolation group in IsolatedBookieEnsemblePlacementPolicy. Now it as treated as negative for all bookies.

Currently, we can only call the getExcludedBookiesWithIsolationGroups method before resolving any possible pair of null scenarios.

Copy link
Copy Markdown
Contributor Author

@Shawyeok Shawyeok May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a change of fallback to use defaultIsolationGroups when policy class doesn't match. Feel free to take a look. @void-ptr974 @lhotari

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…y class does not match

MutablePair was initialized without default values, leaving left/right as null
when the ensemble placement policy class name did not match. This caused a
NullPointerException when getLeft().contains() was called.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Shawyeok Shawyeok force-pushed the hotfix/fix-npe-in-isolated-bookie-ensemble-placement-policy branch from eb5e766 to 35f3158 Compare May 22, 2026 06:55
Copy link
Copy Markdown
Contributor

@void-ptr974 void-ptr974 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing the fallback behavior.

@Shawyeok
Copy link
Copy Markdown
Contributor Author

/pulsarbot rerun-failure-checks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants