Skip to content

Fix telemetry HTTP client socket leak preventing CRaC checkpoint#1333

Open
gopalldb wants to merge 1 commit into
databricks:mainfrom
gopalldb:fix/telemetry-http-client-socket-leak
Open

Fix telemetry HTTP client socket leak preventing CRaC checkpoint#1333
gopalldb wants to merge 1 commit into
databricks:mainfrom
gopalldb:fix/telemetry-http-client-socket-leak

Conversation

@gopalldb
Copy link
Copy Markdown
Collaborator

Summary

Fixes #1325 (follow-up to #1233). After Connection.close(), delayed telemetry flush tasks could re-create TELEMETRY HTTP clients that were never closed, leaking TCP sockets and preventing CRaC checkpoint.

Two cross-thread race conditions fixed:

  1. TelemetryClient re-creation: getTelemetryClient() after closeTelemetryClient() created an orphaned TelemetryClient with a periodic flush scheduler that nobody closes. Fixed by tracking closed connection UUIDs and returning NoopTelemetryClient, and by reordering closeTelemetryClient() to export pending collector events before closing the client.

  2. HTTP client re-creation: getClient(ctx, TELEMETRY) after removeClient(ctx) re-created a DatabricksHttpClient via computeIfAbsent that nobody closes. Fixed by adding closeConnection() which permanently marks a connection as closed, causing getClient() to return null.

Files changed

File Change
TelemetryClientFactory.java Added closedConnectionUuids guard, reordered close sequence
DatabricksHttpClientFactory.java Added closedConnections guard, new closeConnection() method
DatabricksConnection.java Use closeConnection() instead of removeClient()
TelemetryPushClient.java Null guard for getClient() return value
TelemetryHttpClientLeakTest.java 3 reproduction/verification tests
RCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.md Full RCA with reproduction and verification plan

Test results

  • 3085 unit tests pass, 0 failures, 0 errors
  • 3 new tests specifically reproduce and verify the fix for both race conditions

Test plan

  • TelemetryHttpClientLeakTest#testGetTelemetryClientAfterCloseReCreatesClient — verifies NoopTelemetryClient returned after close
  • TelemetryHttpClientLeakTest#testGetClientReturnsNullAfterCloseConnection — verifies null returned from HTTP client factory after close
  • TelemetryHttpClientLeakTest#testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient — verifies pending collector events don't cause re-creation
  • All 152 existing telemetry tests pass
  • Full unit test suite (3085 tests) passes
  • Manual verification with CRaC-enabled JDK (0 sockets after Connection.close())

This pull request was AI-assisted by Isaac.

…abricks#1325)

Root cause: After Connection.close(), delayed telemetry flush tasks could
re-create TELEMETRY HTTP clients and TelemetryClients that were never
closed, leaking TCP sockets and preventing CRaC checkpoint.

Two race conditions fixed:

1. TelemetryClientFactory: getTelemetryClient() after closeTelemetryClient()
   re-created an orphaned TelemetryClient. Fixed by tracking closed
   connection UUIDs and returning NoopTelemetryClient, and by reordering
   closeTelemetryClient() to export pending collector events before
   closing the client.

2. DatabricksHttpClientFactory: getClient(ctx, TELEMETRY) after
   removeClient(ctx) re-created a leaked DatabricksHttpClient via
   computeIfAbsent. Fixed by adding closeConnection() which permanently
   marks a connection as closed, causing getClient() to return null.

Co-authored-by: Isaac
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
@gopalldb gopalldb requested a review from samikshya-db March 27, 2026 06:27
LOGGER.debug(
"Rejecting getClient() for closed connection {} with type {}",
context.getConnectionUuid(),
type);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F1] TOCTOU race — the PR does not actually close the race it claims to fix (High)

The guard reads closedConnections.contains(uuid) then falls through to instances.computeIfAbsent(...). These are two separate operations on two separate maps with no mutual exclusion. Interleaving:

  1. Thread A (delayed TelemetryPushTask on the 10-thread pool): evaluates closedConnections.contains(uuid) → false.
  2. Thread B (DatabricksConnection.close()): runs closeConnection() — adds UUID, wipes instances.
  3. Thread A resumes computeIfAbsent → creates a new DatabricksHttpClient(TELEMETRY) → leaks a TCP socket.

This is literally the race the RCA describes on the old code. The new code narrows but does not close it.

Fix — fuse guard and map op atomically:

String uuid = context.getConnectionUuid();
return instances.compute(
    getClientKey(uuid, type),
    (k, existing) -> {
      if (existing != null) return existing;
      if (uuid != null && closedConnections.contains(uuid)) return null;
      return new DatabricksHttpClient(context, type);
    });

Apply the same pattern in TelemetryClientFactory.getTelemetryClient().

Flagged independently by 4 reviewers (language, security, performance, devils-advocate).

* Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from
* re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard,
* delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets.
* See GitHub issue #1325.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F2] Unbounded heap growth — PR trades a bounded socket leak for an unbounded heap leak (High)

closedConnections is a process-wide ConcurrentHashMap.newKeySet() that is append-only for the JVM lifetime — no TTL, no cap, no eviction, no removal path. TelemetryClientFactory.closedConnectionUuids has the same shape and clears only in reset() (test-only).

Per-entry cost ≈ 120 B × 2 sets ≈ ~240 B per closed connection. 100 closes/sec (realistic for HikariCP under load) → ~2 GB/month. This hits exactly the long-lived CRaC workloads this PR targets.

Fix options (in preference order):

  1. Bound via Caffeine Cache with expireAfterWrite(5-10 minutes) — the race window is seconds, not JVM lifetime.
  2. Flip to an allowlist: insert UUID in getClient on first use, remove in closeConnection. Bounded to live connections and solves F1's TOCTOU for free.
  3. Encode the tombstone in the existing instances map.

Flagged by 6 reviewers (performance, maintainability, security, ops, devils-advocate, architecture).

// getClient(ctx, TELEMETRY) after removeClient(ctx) has already run.
String connectionUuid = context.getConnectionUuid();
if (connectionUuid != null && closedConnections.contains(connectionUuid)) {
LOGGER.debug(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F3] getClient() silently widened to nullable return — 7 unverified callers (High)

Pre-PR getClient() was non-null. Post-PR it can return null. Only TelemetryPushClient.pushEvent() was updated. Verified call sites that store or dereference without a null check:

File:line Usage Null-safe?
DatabricksThriftAccessor.java:662 passed to DatabricksHttpTTransport ctor No
DatabricksTokenFederationProvider.java:74 stored in this.hc No
AzureMSICredentialProvider.java:43 stored in this.httpClient No
PrivateKeyClientCredentialProvider.java:27 stored No
DatabricksDriverFeatureFlagsContext.java:92 runs on a background scheduler; execute() caught by catch(Exception) at TRACE → silent failure of FF refresh No
ArrowStreamResult.java:53, 153 ctor param → NPE later in chunk download No

No @Nullable annotation despite the codebase using javax.annotation.Nullable elsewhere (SqlParameter.java:10, DatabricksSession.java:102). Contract change is invisible to callers.

Fix (preferred): return a rejecting sentinel IDatabricksHttpClient whose execute() throws ConnectionClosedException (symmetric to NoopTelemetryClient). All callers stay safe by default.
Alternative: split into getClient() (throws) + getClientOrNull() (used only by TelemetryPushClient), or add @Nullable + Javadoc and null-check every caller.

Flagged by 6 reviewers.


/**
* Permanently closes all HTTP clients for the given connection and prevents new ones from being
* created. This should be called from DatabricksConnection.close() to prevent delayed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F5] closeConnection() NPEs on null UUID — inconsistent with getClient()'s null guard (High)

ConcurrentHashMap.newKeySet() rejects null elements — closedConnections.add(null) throws NPE. Yet getClient() in the same file at the new lines if (connectionUuid != null && closedConnections.contains(connectionUuid)) explicitly null-guards, implying nulls are considered possible. The two paths disagree on the invariant.

Same issue in TelemetryClientFactory.closeTelemetryClient at the closedConnectionUuids.add(connectionUuid) calls inside the computeIfPresent lambdas — no null-guard.

Fix: Pick one invariant. Either (a) assert non-null in closeConnection/closeTelemetryClient and document the constraint on IDatabricksConnectionContext.getConnectionUuid(), or (b) mirror the getClient guard:

String uuid = context.getConnectionUuid();
if (uuid != null) closedConnections.add(uuid);

Flagged by 2 reviewers (architecture, language).

* flushed through the existing client. See GitHub issue #1325.
*/
public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
String key = TelemetryHelper.keyOf(connectionContext);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F6] Reordered close is not fail-safe — one exception skips ALL cleanup and re-opens the entire leak (High)

The new ordering runs exportAllPendingTelemetryDetails() BEFORE the two computeIfPresent blocks that (a) mark UUID closed, (b) close the client, (c) remove the holder.

If exportAllPendingTelemetryDetails() or the exportTelemetryLog it calls throws any unchecked exception, execution exits closeTelemetryClient and propagates up to DatabricksConnection.close(). Consequences:

  • TelemetryClient stays in the holder map with its periodic flush scheduler alive
  • closedConnectionUuids never updated → future getTelemetryClient() won't return Noop
  • DatabricksHttpClientFactory.closeConnection() on the next line of DatabricksConnection.close() never runs

One export-time exception silently re-opens the entire leak the PR set out to fix. The old ordering was actually more robust (cleanup ran first, export was a trailer).

Fix:

  1. Mark UUIDs closed unconditionally up-front (before export).
  2. Run holder cleanup.
  3. Run exportAllPendingTelemetryDetails last, wrapped in try/catch that logs-and-swallows.
  4. In DatabricksConnection.close() wrap the cleanup chain in try/finally so closeConnection(ctx) always runs even if earlier steps throw.

Flagged by ops reviewer.

"LEAK BUG (issue #1325): getTelemetryClient() after closeTelemetryClient() "
+ "created a new TelemetryClient instead of returning NoopTelemetryClient. "
+ "This orphaned client will never be closed, and its periodic flush creates "
+ "TELEMETRY HTTP clients that leak TCP sockets.");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F4] Tests are single-threaded; do not reproduce the cross-thread race (High)

The RCA describes a race between main thread (DatabricksConnection.close()) and pool thread (delayed TelemetryPushTask.run() on the 10-thread telemetryExecutorService). All three tests run sequentially on the JUnit main thread — no CountDownLatch/CyclicBarrier/real executor/parallel stress.

  • Test #1 (testGetTelemetryClientAfterCloseReCreatesClient) is tautological: closedConnectionUuids.add(uuid) runs inside closeTelemetryClient first, so a sequential getTelemetryClient call trivially returns NoopTelemetryClient. Cannot fail against a partial revert of the fix.
  • Test Refactoring packages and adding some skeletons for ResultSet and Statement #3 mocks TelemetryHelper.exportTelemetryLog with a lambda that calls getTelemetryClient(ctx). But the real exportTelemetryLog reads from DatabricksThreadContextHolder.getConnectionContext() — a thread-local that Connection.close() clears. The real re-creation vector is TelemetryPushClient.pushEvent() capturing ctx at ctor time (verified in source at TelemetryPushClient.java:47-49) — which no test covers.

Fix — add a real cross-thread test:

@Test
void testRaceBetweenCloseAndDelayedPushTask() throws Exception {
  CountDownLatch blockInPush = new CountDownLatch(1);
  CountDownLatch closeDone = new CountDownLatch(1);
  // Spy push client blocks inside pushEvent until main thread calls closeConnection
  // Submit flush task on real pool
  // Main thread: await blockInPush entered, then closeConnection(ctx)
  // Release latch; assert instances map has no TELEMETRY entry for uuid after join
}

Delete or rewrite Tests #1 and #3 so they would fail against a partial revert of the fix.

Flagged by 2 reviewers (test, devils-advocate).

}
}

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F7] No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution (Medium)

TelemetryClientFactory.reset() (in this PR at the hunk around line 243) clears closedConnectionUuids. DatabricksHttpClientFactory has no equivalent. It's a JVM singleton and closedConnections is an instance field.

Consequence: any test that calls closeConnection(ctx) permanently adds UUID to closedConnections for the rest of the test JVM. A later test reusing the same UUID gets null from getClient() with no symptom except "my test fails mysteriously when run after another test."

Fix: Add @VisibleForTesting void reset() on DatabricksHttpClientFactory that clears both instances and closedConnections. Call in @BeforeEach/@AfterEach of TelemetryHttpClientLeakTest.

k -> new DatabricksHttpClient(context, type));
}

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F8] removeClient(ctx) is a public footgun with zero production callers (Medium)

After this PR, the no-arg removeClient(IDatabricksConnectionContext) overload has exactly one caller in the repo: DatabricksHttpClientTest.java:240 (a test). All production call sites were migrated to closeConnection() by this PR.

The RCA explicitly identifies removeClient's "allows re-creation via computeIfAbsent" behavior as the root cause of #1325. Shipping two nearly-identically-named public methods where one silently re-enables the bug this PR fixes is a future-maintainer trap — the name "removeClient" reads as the obvious choice on IDE autocomplete.

Fix: Delete the no-arg overload and migrate the test; or mark @VisibleForTesting / @Deprecated with a pointer to closeConnection. At minimum, rename to signal intent (e.g., evictClientsForReconnect).

Flagged by 4 reviewers (maintainability, architecture, devils-advocate, agent-compat).

*
* <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient()
* from re-creating a TelemetryClient during or after the close sequence. Pending
* TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F9] Reordered close has a concurrency regression under concurrent close of same UUID (Medium)

New ordering: (1) removeCollector, (2) exportAllPendingTelemetryDetails() which calls getTelemetryClient(ctx) through the factory, (3) computeIfPresent removes holder and marks UUID closed.

Single-threaded: fine. Under concurrent close of the same UUID (legal via the public API):

  • Thread A starts step 2's export loop.
  • Thread B completes step 3 first (removes holder, marks UUID closed).
  • Thread A's next getTelemetryClient(ctx) now returns NoopTelemetryClient → events silently dropped.
  • Or a concurrent re-open races in between, and step 2's getTelemetryClient creates a new TelemetryClient (the exact original bug).

Old ordering was order-insensitive because holder-remove+close was atomic under computeIfPresent.

Fix: Mark UUID closed BEFORE export, and export via a directly-held TelemetryClient reference (either look it up from the holder map first, or have the removal return the client). Don't route the export back through the factory.

Flagged by architecture reviewer.

DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext);
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext);
DatabricksHttpClientFactory.getInstance().removeClient(connectionContext);
DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F10] No regression test for the Connection.close() post-condition (Medium)

This one-line change alters observable behavior: after connection.close(), DatabricksHttpClientFactory.getInstance().getClient(ctx, TELEMETRY) now returns null (previously always returned a new client). Telemetry pushes submitted during close are now silently dropped.

No existing test was modified; grep shows no test asserts the new post-close invariant. The "3085 tests pass" claim only proves nothing else broke, not that the new behavior is covered.

Fix: Add one focused assertion in the existing DatabricksConnection test class:

@Test
void getClientReturnsNullAfterConnectionClose() {
  DatabricksConnection conn = ...;
  conn.close();
  assertNull(DatabricksHttpClientFactory.getInstance()
      .getClient(conn.getConnectionContext(), HttpClientType.TELEMETRY));
}

Flagged by test reviewer.

.getClient(connectionContext, HttpClientType.TELEMETRY);
if (httpClient == null) {
// Connection was closed — HTTP client factory rejected the request to prevent socket leaks.
LOGGER.debug("Skipping telemetry push: connection has been closed");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F12] Rejection log at DEBUG; no metric for dropped telemetry (Medium)

Two invisibility paths added by this PR:

  1. DatabricksHttpClientFactory.getClient — rejection logged at LOGGER.debug("Rejecting getClient() for closed connection {} with type {}"). Below default production log level. If a non-TELEMETRY caller hits this path due to a future bug or code change, operators never see it.
  2. TelemetryPushClient.pushEvent on null — LOGGER.debug("Skipping telemetry push: connection has been closed"). In aggressive close scenarios (shutdown, failover, pool eviction), every dropped push is invisible. No counter. Operators cannot answer "is telemetry working?"

Fix:

  • Raise the getClient rejection log to WARN for non-TELEMETRY types (preserves DEBUG for the expected path).
  • Add an AtomicLong droppedOnClosedCount on the factory; emit a periodic aggregated WARN when > 0, or expose via JMX/existing metrics.
  • Include Thread.currentThread().getName() in the rejection log for diagnosis.

Flagged by 3 reviewers (ops, agent-compat, maintainability).

* here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector
* exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This
* prevents the socket leak described in GitHub issue #1325.
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F13] Duplicated registry pattern across two factories (Medium)

closedConnectionUuids here and closedConnections in DatabricksHttpClientFactory are the same concept implemented twice with different names. Both are ConcurrentHashMap.newKeySet() of connection UUIDs, both checked before compute* in the same way, both added on close. Neither is bounded.

Fix: Extract a ClosedConnectionRegistry helper in com.databricks.jdbc.common with markClosed(uuid) / isClosed(uuid) / clear(). Back it with a bounded cache (Caffeine, e.g. expireAfterWrite(5 min) or max ~10K entries LRU). Both factories hold a reference (or share a singleton). Solves the duplication AND F2's memory growth in one change.

Flagged by 3 reviewers (maintainability, agent-compat, ops).

@@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) {

public IDatabricksHttpClient getClient(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F16] @Nullable annotation + irreversibility are undocumented (Low/Medium)

The codebase already uses javax.annotation.Nullable (e.g., SqlParameter.java:10, DatabricksSession.java:102,110, IDatabricksSession.java:21,24). This PR silently widens getClient to a nullable return without annotation or Javadoc update. Agents and humans adding future callers will omit the null check — see F3 for the consequence.

Also document that closeConnection is irreversible — the UUID is blacklisted for the lifetime of the factory. The field comment says "permanently closed" but the public method Javadoc only says "permanently closes all HTTP clients" — reads as "closes them once," not "blacklists UUID forever."

Fix:

/**
 * @return the HTTP client, or {@code null} if {@link #closeConnection} has been
 *         called for this context. Callers MUST null-check when the call can
 *         race with connection close (e.g., async tasks).
 */
@Nullable
public IDatabricksHttpClient getClient(
    IDatabricksConnectionContext context, HttpClientType type) { ... }

Add to closeConnection Javadoc: "This is irreversible — the UUID is blacklisted for the lifetime of this factory. Do not call until you're certain the context will never be used again." Same on TelemetryClientFactory.closeTelemetryClient.

@@ -0,0 +1,208 @@
# RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F17] RCA doc in docs/ is unprecedented and point-in-time (Low)

docs/ currently contains only enduring reference material (LOGGING.md, TESTING.md, JDBC_METHOD_INVENTORY.md, JDBC_SPEC_COVERAGE_ANALYSIS.md, features/). This adds a 208-line root-cause analysis for a single bug.

Issues:

  • Sets precedent without an established convention (no docs/rca/README.md policy, no numbering scheme).
  • Hardcodes line numbers ("line 421", "line 172") that will rot on the next edit to DatabricksConnection.close().
  • Content is point-in-time and will never be updated.
  • Duplicates what should live in the PR description / JIRA / issue [BUG] Leaked Socket prevents CRaC checkpointing #1325.

Fix: Delete this file and fold the content into the PR description + issue #1325 comments. If in-tree RCAs are desired, establish a convention first (e.g., docs/rca/NNN-title.md + a README.md policy).

Flagged by 3 reviewers (maintainability, agent-compat, language).

*/
@ExtendWith(MockitoExtension.class)
@MockitoSettings(strictness = Strictness.LENIENT)
public class TelemetryHttpClientLeakTest {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F20] Test hygiene: Strictness.LENIENT, unused stub, empty catch (Low)

  • Line with @MockitoSettings(strictness = Strictness.LENIENT) silently masks unused-stub bugs — combined with broad any() matchers, tests can pass while stubbing the wrong method. Remove LENIENT and fix any resulting unused-stub errors.
  • when(ctx.getHost()).thenReturn(host) is unused; the production keyOf path uses getHostForOAuth. Would fail under strict Mockito. Remove.
  • The empty catch around when(ctx.getHostUrl()).thenReturn(...) silently swallows checked exceptions during setup. Use doReturn("https://" + host).when(ctx).getHostUrl() to bypass the checked exception cleanly.

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Code Review Squad Report

Merge Safety Score: 10/100 — CRITICAL RISK
Reviewers: 9/9 delivered (security, architecture, language, ops, performance, test, maintainability, agent-compat, devils-advocate)

Executive Summary

Targets a real cross-thread race from issue #1325 and the defense-in-depth intent is sound. But the implementation has fundamental issues that defeat the fix:

  1. The headline guard does not close the race it claims to fixcontains() check sits outside computeIfAbsent (flagged by 4 reviewers) — see inline F1.
  2. One unchecked exception from exportAllPendingTelemetryDetails() skips ALL cleanupF6.
  3. closeConnection() NPEs on null UUID — asymmetric with getClient()'s null guard — F5.
  4. Bounded socket leak traded for unbounded heap leak — two append-only process-wide sets; hits the long-running CRaC workloads this PR targets — F2.
  5. Tests are strictly single-threaded — don't exercise the cross-thread race; two of three are tautological or mis-mocked — F4.
  6. getClient() silently widened to nullable — 7 existing callers are not null-safe — F3.

Recommend blocking merge until F1, F2, F3, F4, F5, F6 are addressed.

Findings by Severity

High (blocker) — posted as inline comments

  • F1 TOCTOU race in getClient() / getTelemetryClient() (4 reviewers)
  • F2 Unbounded heap growth in closed-UUID sets (6 reviewers)
  • F3 getClient() silently nullable; 7 unsafe callers (6 reviewers)
  • F4 Tests don't reproduce the cross-thread race (2 reviewers)
  • F5 closeConnection() NPEs on null UUID (2 reviewers)
  • F6 Reordered close is not fail-safe — one exception re-opens the leak (ops)

Medium — posted as inline comments

  • F7 No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution
  • F8 removeClient(ctx) is a public footgun with zero production callers
  • F9 Reordered close has a concurrency regression under concurrent close of same UUID
  • F10 No regression test for Connection.close() post-condition
  • F12 Rejection log at DEBUG; no metric for dropped telemetry
  • F13 Duplicated registry pattern across two factories

Low — summarized here (not posted inline to reduce noise)

  • F11 The close-ordering fix itself is untested. A future refactor reverting the reorder would pass every test because the guard alone would catch it.
  • F14 TelemetryClientFactory.getTelemetryClient() runs isTelemetryAllowedForConnection() (which can reach into feature-flag infra + HTTP) before the closedConnectionUuids.contains check. Move the closed-UUID check to be the first statement.
  • F15 closedConnectionUuids.add(uuid) happens inside computeIfPresent — which runs AFTER the collector export. During export the UUID isn't marked yet, so events can still submit pool flushes that hit F1's race in the HTTP factory.
  • F16 Missing @Nullable on getClient + irreversibility of closeConnection undocumented (posted inline).
  • F17 RCA doc in docs/ is unprecedented and point-in-time (posted inline).
  • F18 Naming inconsistency: closedConnectionUuids vs closedConnections — subsumed by F13.
  • F19 closeConnection() on DatabricksHttpClientFactory reads as "close a JDBC Connection". Consider markConnectionClosed() / onConnectionClosed().
  • F20 Test hygiene: Strictness.LENIENT, empty catch, unused getHost stub (posted inline).
  • F21 context.getConnectionUuid() called twice per getClient — cache to a local.
  • F22 Issue [BUG] Leaked Socket prevents CRaC checkpointing #1325 rationale repeated ~5× across files. One authoritative comment + one-liners referencing it.
  • F23 Close-latency regression: collector export now runs synchronously on close() before client close. Performance reviewer concluded net latency is unchanged; flag for follow-up if telemetry endpoint ever degrades.
  • F24 TelemetryClient.close() cancels the periodic flush task AFTER flush(true).get(). If the scheduler is mid-dispatch when cancel runs, a late TelemetryPushTask can still land on the pool and hit F1's race.

Verified but No Action

  • Close-sequence reorder does NOT add an extra sync HTTP round-trip (performance reviewer confirmed).
  • No security concerns on auth/TLS/secrets/data exposure.
  • DatabricksDriverFeatureFlagsContextFactory.removeInstance DOES call shutdown() cancelling its scheduler (verified) — the feature-flag scheduled-thread NPE concern from F3 is narrow, not broad.

Suggested Minimum Path to Merge

  1. F1: fuse guard + map op into a single compute(...) (both factories).
  2. F5: null-guard closedConnections.add(uuid) / closedConnectionUuids.add(uuid) (or assert non-null with a documented invariant).
  3. F6: mark UUIDs closed up-front; wrap exportAllPendingTelemetryDetails in try/catch-swallow; wrap DatabricksConnection.close() cleanup chain in try/finally.
  4. F2: replace sets with a bounded Caffeine cache OR flip to an allowlist of open UUIDs (solves F1 + F2 in one change).
  5. F3: return a rejecting IDatabricksHttpClient sentinel (symmetric with NoopTelemetryClient) so all callers stay safe.
  6. F4: add one real cross-thread test with CountDownLatch; rewrite or delete the tautological Test #1 and the mis-mocked Test Refactoring packages and adding some skeletons for ResultSet and Statement #3.

Review generated by Code Review Squad (9 parallel specialized reviewers). Feedback welcome in #code-review-squad-feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Leaked Socket prevents CRaC checkpointing

2 participants