Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 208 additions & 0 deletions docs/RCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F17] RCA doc in docs/ is unprecedented and point-in-time (Low)

docs/ currently contains only enduring reference material (LOGGING.md, TESTING.md, JDBC_METHOD_INVENTORY.md, JDBC_SPEC_COVERAGE_ANALYSIS.md, features/). This adds a 208-line root-cause analysis for a single bug.

Issues:

  • Sets precedent without an established convention (no docs/rca/README.md policy, no numbering scheme).
  • Hardcodes line numbers ("line 421", "line 172") that will rot on the next edit to DatabricksConnection.close().
  • Content is point-in-time and will never be updated.
  • Duplicates what should live in the PR description / JIRA / issue [BUG] Leaked Socket prevents CRaC checkpointing #1325.

Fix: Delete this file and fold the content into the PR description + issue #1325 comments. If in-tree RCAs are desired, establish a convention first (e.g., docs/rca/NNN-title.md + a README.md policy).

Flagged by 3 reviewers (maintainability, agent-compat, language).


## Problem Statement

After `Connection.close()`, a `DatabricksHttpClient` with type `TELEMETRY` can remain in
`DatabricksHttpClientFactory.instances`, keeping a TCP socket open indefinitely. This prevents
CRaC (Coordinated Restore at Checkpoint) from completing because CRaC requires all sockets to be
closed before a checkpoint can be taken.

**Reporter**: @jnd77 (follow-up to #1233)
**Affected versions**: 3.x (confirmed on 3.3.1)
**Symptom**: Intermittent — depends on timing of telemetry flush tasks relative to connection close.

## Root Cause

The bug is a **cross-thread race condition** in the connection close path involving two independent
mechanisms that can re-create HTTP clients after they've been removed.

### Close Sequence (DatabricksConnection.close())

```
Line 421: session.close()
Line 422: TelemetryClientFactory.closeTelemetryClient(ctx)
Line 423: DatabricksClientConfiguratorManager.removeInstance(ctx)
Line 424: DatabricksDriverFeatureFlagsContextFactory.removeInstance(ctx)
Line 425: DatabricksHttpClientFactory.removeClient(ctx) // removes all HTTP clients
Line 426: DatabricksThreadContextHolder.clearAllContext()
```

### Race Condition 1: TelemetryClient re-creation after close

Inside `TelemetryClientFactory.closeTelemetryClient()`, the ordering was:

1. **Remove the TelemetryClientHolder** from the map via `computeIfPresent` -> calls
`TelemetryClient.close()` -> `flush(true).get()` (synchronous flush)
2. **Export pending TelemetryCollector events** via `collector.exportAllPendingTelemetryDetails()`

Step 2 calls `TelemetryHelper.exportTelemetryLog()` which calls
`TelemetryClientFactory.getTelemetryClient(ctx)`. Since the holder was already removed in Step 1,
`getTelemetryClient()` sees `existing == null` and **creates a brand new TelemetryClient** with a
new periodic flush scheduler. This orphaned client is never closed.

### Race Condition 2: TELEMETRY HTTP client re-creation after removeClient

`TelemetryClient.close()` calls `flush(true).get()` which submits a `TelemetryPushTask` to the
shared 10-thread executor pool. The task calls:

```
TelemetryPushClient.pushEvent()
-> DatabricksHttpClientFactory.getClient(ctx, HttpClientType.TELEMETRY)
```

If this task executes **after** `DatabricksHttpClientFactory.removeClient(ctx)` at line 425,
`computeIfAbsent` creates a **new** `DatabricksHttpClient(TELEMETRY)` that nobody will ever close.
This leaked HTTP client holds an open TCP socket.

### Why it's intermittent

The reporter notes the issue is "random." This is because:
- The race window is between `flush().get()` completing on the main thread and the actual
`TelemetryPushTask.run()` executing on the pool thread
- It only triggers when there are pending telemetry events at close time
- GC pauses and CPU scheduling widen or narrow the window

### Previous fix (#1235) and why it was incomplete

PR #1235 fixed the `DatabricksClientConfiguratorManager` leak (SDK connection manager not being
closed). But it did not address:
1. The telemetry client re-creation in `closeTelemetryClient()`
2. The HTTP client re-creation via `computeIfAbsent` after `removeClient()`

## Fix

The fix addresses both race conditions with a defense-in-depth approach:

### Fix 1: TelemetryClientFactory — Prevent TelemetryClient re-creation

**File**: `TelemetryClientFactory.java`

- **Added `closedConnectionUuids` set**: Tracks connection UUIDs that have been closed.
`getTelemetryClient()` checks this set and returns `NoopTelemetryClient` for closed connections
instead of creating a new orphaned `TelemetryClient`.

- **Reordered `closeTelemetryClient()`**: Export pending `TelemetryCollector` events **BEFORE**
closing the `TelemetryClient`. This ensures the export uses the existing client (still in the
holder map) rather than triggering re-creation after the holder is removed.

- The UUID is added to `closedConnectionUuids` inside the `computeIfPresent` lambda so only
connections that actually had a telemetry client get tracked (avoids poisoning the set during
test setup/cleanup).

### Fix 2: DatabricksHttpClientFactory — Prevent HTTP client re-creation

**File**: `DatabricksHttpClientFactory.java`

- **Added `closedConnections` set**: Tracks connection UUIDs that have been permanently closed.

- **New `closeConnection()` method**: Marks the connection as permanently closed and removes all
HTTP clients. After this call, `getClient()` returns `null` for that connection, preventing
`computeIfAbsent` from creating orphaned `DatabricksHttpClient` instances.

- `removeClient()` is unchanged — it still allows re-creation for non-close use cases (e.g.,
client reset/reconnect scenarios used in tests).

### Fix 3: DatabricksConnection — Use permanent close

**File**: `DatabricksConnection.java`

- Changed `removeClient(connectionContext)` to `closeConnection(connectionContext)` to use the
permanent close semantics that prevent HTTP client re-creation.

### Fix 4: TelemetryPushClient — Null guard

**File**: `TelemetryPushClient.java`

- `pushEvent()` now handles `null` return from `getClient()` gracefully (logs and returns early)
instead of throwing NPE. This is the safety net for delayed push tasks that fire after the
connection is closed.

## Reproduction and Verification Plan

### Automated Tests (TelemetryHttpClientLeakTest.java)

Three unit tests reproduce the two race conditions:

#### Test 1: `testGetTelemetryClientAfterCloseReCreatesClient`

Reproduces Race Condition 1.

**Steps**:
1. Create a mock connection context with telemetry enabled
2. Call `getTelemetryClient(ctx)` — creates a `TelemetryClient` in the holder map
3. Call `closeTelemetryClient(ctx)` — removes the holder
4. Call `getTelemetryClient(ctx)` again (simulates what `exportAllPendingTelemetryDetails` does)
5. **Assert**: The returned client should be `NoopTelemetryClient`, not a new `TelemetryClient`

**Before fix**: Returns a new `TelemetryClient` (FAIL — orphaned client created)
**After fix**: Returns `NoopTelemetryClient` (PASS — no leak)

#### Test 2: `testGetClientReturnsNullAfterCloseConnection`

Reproduces Race Condition 2.

**Steps**:
1. Create a mock connection context
2. Call `DatabricksHttpClientFactory.closeConnection(ctx)` (simulates `DatabricksConnection.close()`)
3. Call `getClient(ctx, HttpClientType.TELEMETRY)` (simulates delayed `TelemetryPushTask`)
4. **Assert**: Returns `null` (not a new `DatabricksHttpClient`)

**Before fix**: Creates a new `DatabricksHttpClient` via `computeIfAbsent` (FAIL — leaked socket)
**After fix**: Returns `null` (PASS — no leak)

#### Test 3: `testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient`

End-to-end test with pending telemetry collector events.

**Steps**:
1. Create a telemetry client and record pending latency events in `TelemetryCollector`
2. Mock `exportTelemetryLog` to call `getTelemetryClient(ctx)` (simulating the real export path)
3. Call `closeTelemetryClient(ctx)` which triggers the collector export
4. **Assert**: No new `TelemetryClient` holders exist after close

### Running the tests

```bash
# Run just the leak reproduction tests
mvn test -pl jdbc-core -Dtest=TelemetryHttpClientLeakTest -Djacoco.skip=true

# Run all telemetry tests (existing + new)
mvn test -pl jdbc-core -Dtest="TelemetryClientFactoryTest,TelemetryClientTest,TelemetryPushClientTest,TelemetryCollectorManagerTest,TelemetryCollectorTest,TelemetryHelperTest,TelemetryHttpClientLeakTest" -Djacoco.skip=true

# Run full unit test suite
mvn test -pl jdbc-core -Djacoco.skip=true -Dgroups='!Jvm17PlusAndArrowToNioReflectionDisabled'
```

### Manual verification (with CRaC-enabled JDK)

Use the reporter's reproducer from issue #1233 to verify 0 sockets remain after close:

1. Build the driver: `mvn clean install -DskipTests`
2. Set environment variables:
```bash
export DATABRICKS_AUTH_TOKEN=<token>
export DATABRICKS_CONNECTION_STRING="jdbc:databricks://<host>:443/default;transportMode=http;ssl=1;httpPath=<path>;AuthMech=3;UID=token"
```
3. Run the socket leak reproducer (from issue #1233) which:
- Opens a connection, executes `SELECT 1`, closes the connection
- Calls `GlobalAsyncHttpClient.releaseClient()`
- Checks for remaining TCP sockets via `ss -tnp state established dst :443`
4. **Expected**: 0 sockets after close
5. Run the CRaC checkpoint reproducer:
- Same steps but calls `Core.checkpointRestore()` after close
- **Expected**: Checkpoint succeeds without `CheckpointOpenSocketException`

### Regression testing

The fix does not change any public API or behavior for active connections. It only prevents
resource re-creation after close. The full unit test suite (3085 tests) passes with 0 failures.

## Files Changed

| File | Change |
|------|--------|
| `TelemetryClientFactory.java` | Added `closedConnectionUuids` guard, reordered close sequence |
| `DatabricksHttpClientFactory.java` | Added `closedConnections` guard, new `closeConnection()` method |
| `DatabricksConnection.java` | Use `closeConnection()` instead of `removeClient()` |
| `TelemetryPushClient.java` | Null guard for `getClient()` return value |
| `TelemetryHttpClientLeakTest.java` | 3 reproduction tests |
Original file line number Diff line number Diff line change
Expand Up @@ -422,7 +422,7 @@ public void close() throws SQLException {
TelemetryClientFactory.getInstance().closeTelemetryClient(connectionContext);
DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext);
DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext);
DatabricksHttpClientFactory.getInstance().removeClient(connectionContext);
DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F10] No regression test for the Connection.close() post-condition (Medium)

This one-line change alters observable behavior: after connection.close(), DatabricksHttpClientFactory.getInstance().getClient(ctx, TELEMETRY) now returns null (previously always returned a new client). Telemetry pushes submitted during close are now silently dropped.

No existing test was modified; grep shows no test asserts the new post-close invariant. The "3085 tests pass" claim only proves nothing else broke, not that the new behavior is covered.

Fix: Add one focused assertion in the existing DatabricksConnection test class:

@Test
void getClientReturnsNullAfterConnectionClose() {
  DatabricksConnection conn = ...;
  conn.close();
  assertNull(DatabricksHttpClientFactory.getInstance()
      .getClient(conn.getConnectionContext(), HttpClientType.TELEMETRY));
}

Flagged by test reviewer.

DatabricksThreadContextHolder.clearAllContext();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import com.databricks.jdbc.log.JdbcLogger;
import com.databricks.jdbc.log.JdbcLoggerFactory;
import java.io.IOException;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;

public class DatabricksHttpClientFactory {
Expand All @@ -17,6 +18,14 @@ public class DatabricksHttpClientFactory {
private final ConcurrentHashMap<SimpleEntry<String, HttpClientType>, DatabricksHttpClient>
instances = new ConcurrentHashMap<>();

/**
* Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from
* re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard,
* delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets.
* See GitHub issue #1325.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F2] Unbounded heap growth — PR trades a bounded socket leak for an unbounded heap leak (High)

closedConnections is a process-wide ConcurrentHashMap.newKeySet() that is append-only for the JVM lifetime — no TTL, no cap, no eviction, no removal path. TelemetryClientFactory.closedConnectionUuids has the same shape and clears only in reset() (test-only).

Per-entry cost ≈ 120 B × 2 sets ≈ ~240 B per closed connection. 100 closes/sec (realistic for HikariCP under load) → ~2 GB/month. This hits exactly the long-lived CRaC workloads this PR targets.

Fix options (in preference order):

  1. Bound via Caffeine Cache with expireAfterWrite(5-10 minutes) — the race window is seconds, not JVM lifetime.
  2. Flip to an allowlist: insert UUID in getClient on first use, remove in closeConnection. Bounded to live connections and solves F1's TOCTOU for free.
  3. Encode the tombstone in the existing instances map.

Flagged by 6 reviewers (performance, maintainability, security, ops, devils-advocate, architecture).

*/
private final Set<String> closedConnections = ConcurrentHashMap.newKeySet();

private DatabricksHttpClientFactory() {
// Private constructor to prevent instantiation
}
Expand All @@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) {

public IDatabricksHttpClient getClient(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F16] @Nullable annotation + irreversibility are undocumented (Low/Medium)

The codebase already uses javax.annotation.Nullable (e.g., SqlParameter.java:10, DatabricksSession.java:102,110, IDatabricksSession.java:21,24). This PR silently widens getClient to a nullable return without annotation or Javadoc update. Agents and humans adding future callers will omit the null check — see F3 for the consequence.

Also document that closeConnection is irreversible — the UUID is blacklisted for the lifetime of the factory. The field comment says "permanently closed" but the public method Javadoc only says "permanently closes all HTTP clients" — reads as "closes them once," not "blacklists UUID forever."

Fix:

/**
 * @return the HTTP client, or {@code null} if {@link #closeConnection} has been
 *         called for this context. Callers MUST null-check when the call can
 *         race with connection close (e.g., async tasks).
 */
@Nullable
public IDatabricksHttpClient getClient(
    IDatabricksConnectionContext context, HttpClientType type) { ... }

Add to closeConnection Javadoc: "This is irreversible — the UUID is blacklisted for the lifetime of this factory. Do not call until you're certain the context will never be used again." Same on TelemetryClientFactory.closeTelemetryClient.

IDatabricksConnectionContext context, HttpClientType type) {
// Prevent creating new HTTP clients for connections that have been closed.
// This guards against delayed TelemetryPushTask executions that call
// getClient(ctx, TELEMETRY) after removeClient(ctx) has already run.
String connectionUuid = context.getConnectionUuid();
if (connectionUuid != null && closedConnections.contains(connectionUuid)) {
LOGGER.debug(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F3] getClient() silently widened to nullable return — 7 unverified callers (High)

Pre-PR getClient() was non-null. Post-PR it can return null. Only TelemetryPushClient.pushEvent() was updated. Verified call sites that store or dereference without a null check:

File:line Usage Null-safe?
DatabricksThriftAccessor.java:662 passed to DatabricksHttpTTransport ctor No
DatabricksTokenFederationProvider.java:74 stored in this.hc No
AzureMSICredentialProvider.java:43 stored in this.httpClient No
PrivateKeyClientCredentialProvider.java:27 stored No
DatabricksDriverFeatureFlagsContext.java:92 runs on a background scheduler; execute() caught by catch(Exception) at TRACE → silent failure of FF refresh No
ArrowStreamResult.java:53, 153 ctor param → NPE later in chunk download No

No @Nullable annotation despite the codebase using javax.annotation.Nullable elsewhere (SqlParameter.java:10, DatabricksSession.java:102). Contract change is invisible to callers.

Fix (preferred): return a rejecting sentinel IDatabricksHttpClient whose execute() throws ConnectionClosedException (symmetric to NoopTelemetryClient). All callers stay safe by default.
Alternative: split into getClient() (throws) + getClientOrNull() (used only by TelemetryPushClient), or add @Nullable + Javadoc and null-check every caller.

Flagged by 6 reviewers.

"Rejecting getClient() for closed connection {} with type {}",
context.getConnectionUuid(),
type);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F1] TOCTOU race — the PR does not actually close the race it claims to fix (High)

The guard reads closedConnections.contains(uuid) then falls through to instances.computeIfAbsent(...). These are two separate operations on two separate maps with no mutual exclusion. Interleaving:

  1. Thread A (delayed TelemetryPushTask on the 10-thread pool): evaluates closedConnections.contains(uuid) → false.
  2. Thread B (DatabricksConnection.close()): runs closeConnection() — adds UUID, wipes instances.
  3. Thread A resumes computeIfAbsent → creates a new DatabricksHttpClient(TELEMETRY) → leaks a TCP socket.

This is literally the race the RCA describes on the old code. The new code narrows but does not close it.

Fix — fuse guard and map op atomically:

String uuid = context.getConnectionUuid();
return instances.compute(
    getClientKey(uuid, type),
    (k, existing) -> {
      if (existing != null) return existing;
      if (uuid != null && closedConnections.contains(uuid)) return null;
      return new DatabricksHttpClient(context, type);
    });

Apply the same pattern in TelemetryClientFactory.getTelemetryClient().

Flagged independently by 4 reviewers (language, security, performance, devils-advocate).

return null;
}
return instances.computeIfAbsent(
getClientKey(context.getConnectionUuid(), type),
k -> new DatabricksHttpClient(context, type));
}

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F8] removeClient(ctx) is a public footgun with zero production callers (Medium)

After this PR, the no-arg removeClient(IDatabricksConnectionContext) overload has exactly one caller in the repo: DatabricksHttpClientTest.java:240 (a test). All production call sites were migrated to closeConnection() by this PR.

The RCA explicitly identifies removeClient's "allows re-creation via computeIfAbsent" behavior as the root cause of #1325. Shipping two nearly-identically-named public methods where one silently re-enables the bug this PR fixes is a future-maintainer trap — the name "removeClient" reads as the obvious choice on IDE autocomplete.

Fix: Delete the no-arg overload and migrate the test; or mark @VisibleForTesting / @Deprecated with a pointer to closeConnection. At minimum, rename to signal intent (e.g., evictClientsForReconnect).

Flagged by 4 reviewers (maintainability, architecture, devils-advocate, agent-compat).

* Removes and closes all HTTP clients for the given connection. Does NOT mark the connection as
* closed — the client can be re-created by a subsequent getClient() call.
*/
public void removeClient(IDatabricksConnectionContext context) {
for (HttpClientType type : HttpClientType.values()) {
removeClient(context, type);
}
}

/**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F7] No @VisibleForTesting reset() on DatabricksHttpClientFactory — test pollution (Medium)

TelemetryClientFactory.reset() (in this PR at the hunk around line 243) clears closedConnectionUuids. DatabricksHttpClientFactory has no equivalent. It's a JVM singleton and closedConnections is an instance field.

Consequence: any test that calls closeConnection(ctx) permanently adds UUID to closedConnections for the rest of the test JVM. A later test reusing the same UUID gets null from getClient() with no symptom except "my test fails mysteriously when run after another test."

Fix: Add @VisibleForTesting void reset() on DatabricksHttpClientFactory that clears both instances and closedConnections. Call in @BeforeEach/@AfterEach of TelemetryHttpClientLeakTest.

* Permanently closes all HTTP clients for the given connection and prevents new ones from being
* created. This should be called from DatabricksConnection.close() to prevent delayed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F5] closeConnection() NPEs on null UUID — inconsistent with getClient()'s null guard (High)

ConcurrentHashMap.newKeySet() rejects null elements — closedConnections.add(null) throws NPE. Yet getClient() in the same file at the new lines if (connectionUuid != null && closedConnections.contains(connectionUuid)) explicitly null-guards, implying nulls are considered possible. The two paths disagree on the invariant.

Same issue in TelemetryClientFactory.closeTelemetryClient at the closedConnectionUuids.add(connectionUuid) calls inside the computeIfPresent lambdas — no null-guard.

Fix: Pick one invariant. Either (a) assert non-null in closeConnection/closeTelemetryClient and document the constraint on IDatabricksConnectionContext.getConnectionUuid(), or (b) mirror the getClient guard:

String uuid = context.getConnectionUuid();
if (uuid != null) closedConnections.add(uuid);

Flagged by 2 reviewers (architecture, language).

* TelemetryPushTask executions from creating orphaned HTTP clients (issue #1325).
*/
public void closeConnection(IDatabricksConnectionContext context) {
closedConnections.add(context.getConnectionUuid());
removeClient(context);
}

public void removeClient(IDatabricksConnectionContext context, HttpClientType type) {
DatabricksHttpClient instance =
instances.remove(getClientKey(context.getConnectionUuid(), type));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,14 @@ public class TelemetryClientFactory {
@VisibleForTesting
final Map<String, TelemetryClientHolder> noauthTelemetryClientHolders = new ConcurrentHashMap<>();

/**
* Tracks connection UUIDs that have been closed. When a connection is closed, its UUID is added
* here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector
* exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This
* prevents the socket leak described in GitHub issue #1325.
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F13] Duplicated registry pattern across two factories (Medium)

closedConnectionUuids here and closedConnections in DatabricksHttpClientFactory are the same concept implemented twice with different names. Both are ConcurrentHashMap.newKeySet() of connection UUIDs, both checked before compute* in the same way, both added on close. Neither is bounded.

Fix: Extract a ClosedConnectionRegistry helper in com.databricks.jdbc.common with markClosed(uuid) / isClosed(uuid) / clear(). Back it with a bounded cache (Caffeine, e.g. expireAfterWrite(5 min) or max ~10K entries LRU). Both factories hold a reference (or share a singleton). Solves the duplication AND F2's memory growth in one change.

Flagged by 3 reviewers (maintainability, agent-compat, ops).

@VisibleForTesting final Set<String> closedConnectionUuids = ConcurrentHashMap.newKeySet();

private final ExecutorService telemetryExecutorService;
private ScheduledExecutorService sharedSchedulerService;

Expand Down Expand Up @@ -78,6 +86,14 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti
if (!isTelemetryAllowedForConnection(connectionContext)) {
return NoopTelemetryClient.getInstance();
}
// Prevent re-creation of TelemetryClient for connections that have been closed.
// Without this guard, code paths that call getTelemetryClient() after
// closeTelemetryClient() (e.g., TelemetryCollector.exportAllPendingTelemetryDetails
// or delayed TelemetryPushTask flush) would create an orphaned TelemetryClient
// whose periodic flush creates leaked TELEMETRY HTTP clients (issue #1325).
if (closedConnectionUuids.contains(connectionContext.getConnectionUuid())) {
return NoopTelemetryClient.getInstance();
}
DatabricksConfig databricksConfig =
TelemetryHelper.getDatabricksConfigSafely(connectionContext);
if (databricksConfig != null) {
Expand Down Expand Up @@ -137,41 +153,52 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti
/**
* Closes telemetry client for a connection. Thread-safe: computeIfPresent ensures atomic locking,
* preventing race conditions between connection removal and addition.
*
* <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient()
* from re-creating a TelemetryClient during or after the close sequence. Pending
* TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F9] Reordered close has a concurrency regression under concurrent close of same UUID (Medium)

New ordering: (1) removeCollector, (2) exportAllPendingTelemetryDetails() which calls getTelemetryClient(ctx) through the factory, (3) computeIfPresent removes holder and marks UUID closed.

Single-threaded: fine. Under concurrent close of the same UUID (legal via the public API):

  • Thread A starts step 2's export loop.
  • Thread B completes step 3 first (removes holder, marks UUID closed).
  • Thread A's next getTelemetryClient(ctx) now returns NoopTelemetryClient → events silently dropped.
  • Or a concurrent re-open races in between, and step 2's getTelemetryClient creates a new TelemetryClient (the exact original bug).

Old ordering was order-insensitive because holder-remove+close was atomic under computeIfPresent.

Fix: Mark UUID closed BEFORE export, and export via a directly-held TelemetryClient reference (either look it up from the holder map first, or have the removal return the client). Don't route the export back through the factory.

Flagged by architecture reviewer.

* flushed through the existing client. See GitHub issue #1325.
*/
public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
String key = TelemetryHelper.keyOf(connectionContext);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F6] Reordered close is not fail-safe — one exception skips ALL cleanup and re-opens the entire leak (High)

The new ordering runs exportAllPendingTelemetryDetails() BEFORE the two computeIfPresent blocks that (a) mark UUID closed, (b) close the client, (c) remove the holder.

If exportAllPendingTelemetryDetails() or the exportTelemetryLog it calls throws any unchecked exception, execution exits closeTelemetryClient and propagates up to DatabricksConnection.close(). Consequences:

  • TelemetryClient stays in the holder map with its periodic flush scheduler alive
  • closedConnectionUuids never updated → future getTelemetryClient() won't return Noop
  • DatabricksHttpClientFactory.closeConnection() on the next line of DatabricksConnection.close() never runs

One export-time exception silently re-opens the entire leak the PR set out to fix. The old ordering was actually more robust (cleanup ran first, export was a trailer).

Fix:

  1. Mark UUIDs closed unconditionally up-front (before export).
  2. Run holder cleanup.
  3. Run exportAllPendingTelemetryDetails last, wrapped in try/catch that logs-and-swallows.
  4. In DatabricksConnection.close() wrap the cleanup chain in try/finally so closeConnection(ctx) always runs even if earlier steps throw.

Flagged by ops reviewer.

String connectionUuid = connectionContext.getConnectionUuid();
// Atomically remove connection and close client if no connections remain for this key

// Export pending TelemetryCollector events BEFORE closing the TelemetryClient.
// This ensures the export uses the existing TelemetryClient (via the holder map)
// rather than triggering re-creation after the holder is removed.
TelemetryCollector collector =
TelemetryCollectorManager.getInstance().removeCollector(connectionContext);
if (collector != null) {
collector.exportAllPendingTelemetryDetails();
}

// Mark the connection as closed to prevent getTelemetryClient() from re-creating a
// TelemetryClient if called by delayed flush tasks or collector exports (issue #1325).
// This is done inside computeIfPresent so it only applies to connections that actually
// had a telemetry client registered.
telemetryClientHolders.computeIfPresent(
key,
(k, holder) -> {
holder.connectionUuids.remove(connectionUuid);
closedConnectionUuids.add(connectionUuid);
if (holder.connectionUuids.isEmpty()) {
closeTelemetryClient(holder.client, "telemetry client");
return null;
}
return holder;
});
// Atomically remove connection and close client if no connections remain for this key
noauthTelemetryClientHolders.computeIfPresent(
key,
(k, holder) -> {
holder.connectionUuids.remove(connectionUuid);
closedConnectionUuids.add(connectionUuid);
if (holder.connectionUuids.isEmpty()) {
closeTelemetryClient(holder.client, "unauthenticated telemetry client");
return null;
}
return holder;
});

// Export and remove the TelemetryCollector for this connection
TelemetryCollector collector =
TelemetryCollectorManager.getInstance().removeCollector(connectionContext);
if (collector != null) {
// Export any remaining telemetry before removing
collector.exportAllPendingTelemetryDetails();
}

// Clean up cached connection parameters to prevent memory leaks
TelemetryHelper.removeConnectionParameters(connectionContext.getConnectionUuid());
}
Expand Down Expand Up @@ -216,6 +243,7 @@ public void reset() {
// Clear the maps
telemetryClientHolders.clear();
noauthTelemetryClientHolders.clear();
closedConnectionUuids.clear();

// Clear cached connection parameters
TelemetryHelper.clearConnectionParameterCache();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@ public void pushEvent(TelemetryRequest request) throws Exception {
IDatabricksHttpClient httpClient =
DatabricksHttpClientFactory.getInstance()
.getClient(connectionContext, HttpClientType.TELEMETRY);
if (httpClient == null) {
// Connection was closed — HTTP client factory rejected the request to prevent socket leaks.
LOGGER.debug("Skipping telemetry push: connection has been closed");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F12] Rejection log at DEBUG; no metric for dropped telemetry (Medium)

Two invisibility paths added by this PR:

  1. DatabricksHttpClientFactory.getClient — rejection logged at LOGGER.debug("Rejecting getClient() for closed connection {} with type {}"). Below default production log level. If a non-TELEMETRY caller hits this path due to a future bug or code change, operators never see it.
  2. TelemetryPushClient.pushEvent on null — LOGGER.debug("Skipping telemetry push: connection has been closed"). In aggressive close scenarios (shutdown, failover, pool eviction), every dropped push is invisible. No counter. Operators cannot answer "is telemetry working?"

Fix:

  • Raise the getClient rejection log to WARN for non-TELEMETRY types (preserves DEBUG for the expected path).
  • Add an AtomicLong droppedOnClosedCount on the factory; emit a periodic aggregated WARN when > 0, or expose via JMX/existing metrics.
  • Include Thread.currentThread().getName() in the rejection log for diagnosis.

Flagged by 3 reviewers (ops, agent-compat, maintainability).

return;
}
String path =
isAuthenticated
? PathConstants.TELEMETRY_PATH
Expand Down
Loading