databricks · gopalldb · Mar 26, 2026 · vikrantpuppala · Apr 22, 2026 · vikrantpuppala
diff --git a/docs/RCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.md b/docs/RCA_SOCKET_LEAK_TELEMETRY_HTTP_CLIENT.md
@@ -0,0 +1,208 @@
+# RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325)
+
+## Problem Statement
+
+After `Connection.close()`, a `DatabricksHttpClient` with type `TELEMETRY` can remain in
+`DatabricksHttpClientFactory.instances`, keeping a TCP socket open indefinitely. This prevents
+CRaC (Coordinated Restore at Checkpoint) from completing because CRaC requires all sockets to be
+closed before a checkpoint can be taken.
+
+**Reporter**: @jnd77 (follow-up to #1233)
+**Affected versions**: 3.x (confirmed on 3.3.1)
+**Symptom**: Intermittent — depends on timing of telemetry flush tasks relative to connection close.
+
+## Root Cause
+
+The bug is a **cross-thread race condition** in the connection close path involving two independent
+mechanisms that can re-create HTTP clients after they've been removed.
+
+### Close Sequence (DatabricksConnection.close())
+
+```
+Line 421: session.close()
+Line 422: TelemetryClientFactory.closeTelemetryClient(ctx)
+Line 423: DatabricksClientConfiguratorManager.removeInstance(ctx)
+Line 424: DatabricksDriverFeatureFlagsContextFactory.removeInstance(ctx)
+Line 425: DatabricksHttpClientFactory.removeClient(ctx)          // removes all HTTP clients
+Line 426: DatabricksThreadContextHolder.clearAllContext()
+```
+
+### Race Condition 1: TelemetryClient re-creation after close
+
+Inside `TelemetryClientFactory.closeTelemetryClient()`, the ordering was:
+
+1. **Remove the TelemetryClientHolder** from the map via `computeIfPresent` -> calls
+   `TelemetryClient.close()` -> `flush(true).get()` (synchronous flush)
+2. **Export pending TelemetryCollector events** via `collector.exportAllPendingTelemetryDetails()`
+
+Step 2 calls `TelemetryHelper.exportTelemetryLog()` which calls
+`TelemetryClientFactory.getTelemetryClient(ctx)`. Since the holder was already removed in Step 1,
+`getTelemetryClient()` sees `existing == null` and **creates a brand new TelemetryClient** with a
+new periodic flush scheduler. This orphaned client is never closed.
+
+### Race Condition 2: TELEMETRY HTTP client re-creation after removeClient
+
+`TelemetryClient.close()` calls `flush(true).get()` which submits a `TelemetryPushTask` to the
+shared 10-thread executor pool. The task calls:
+
+```
+TelemetryPushClient.pushEvent()
+  -> DatabricksHttpClientFactory.getClient(ctx, HttpClientType.TELEMETRY)
+```
+
+If this task executes **after** `DatabricksHttpClientFactory.removeClient(ctx)` at line 425,
+`computeIfAbsent` creates a **new** `DatabricksHttpClient(TELEMETRY)` that nobody will ever close.
+This leaked HTTP client holds an open TCP socket.
+
+### Why it's intermittent
+
+The reporter notes the issue is "random." This is because:
+- The race window is between `flush().get()` completing on the main thread and the actual
+  `TelemetryPushTask.run()` executing on the pool thread
+- It only triggers when there are pending telemetry events at close time
+- GC pauses and CPU scheduling widen or narrow the window
+
+### Previous fix (#1235) and why it was incomplete
+
+PR #1235 fixed the `DatabricksClientConfiguratorManager` leak (SDK connection manager not being
+closed). But it did not address:
+1. The telemetry client re-creation in `closeTelemetryClient()`
+2. The HTTP client re-creation via `computeIfAbsent` after `removeClient()`
+
+## Fix
+
+The fix addresses both race conditions with a defense-in-depth approach:
+
+### Fix 1: TelemetryClientFactory — Prevent TelemetryClient re-creation
+
+**File**: `TelemetryClientFactory.java`
+
+- **Added `closedConnectionUuids` set**: Tracks connection UUIDs that have been closed.
+  `getTelemetryClient()` checks this set and returns `NoopTelemetryClient` for closed connections
+  instead of creating a new orphaned `TelemetryClient`.
+
+- **Reordered `closeTelemetryClient()`**: Export pending `TelemetryCollector` events **BEFORE**
+  closing the `TelemetryClient`. This ensures the export uses the existing client (still in the
+  holder map) rather than triggering re-creation after the holder is removed.
+
+- The UUID is added to `closedConnectionUuids` inside the `computeIfPresent` lambda so only
+  connections that actually had a telemetry client get tracked (avoids poisoning the set during
+  test setup/cleanup).
+
+### Fix 2: DatabricksHttpClientFactory — Prevent HTTP client re-creation
+
+**File**: `DatabricksHttpClientFactory.java`
+
+- **Added `closedConnections` set**: Tracks connection UUIDs that have been permanently closed.
+
+- **New `closeConnection()` method**: Marks the connection as permanently closed and removes all
+  HTTP clients. After this call, `getClient()` returns `null` for that connection, preventing
+  `computeIfAbsent` from creating orphaned `DatabricksHttpClient` instances.
+
+- `removeClient()` is unchanged — it still allows re-creation for non-close use cases (e.g.,
+  client reset/reconnect scenarios used in tests).
+
+### Fix 3: DatabricksConnection — Use permanent close
+
+**File**: `DatabricksConnection.java`
+
+- Changed `removeClient(connectionContext)` to `closeConnection(connectionContext)` to use the
+  permanent close semantics that prevent HTTP client re-creation.
+
+### Fix 4: TelemetryPushClient — Null guard
+
+**File**: `TelemetryPushClient.java`
+
+- `pushEvent()` now handles `null` return from `getClient()` gracefully (logs and returns early)
+  instead of throwing NPE. This is the safety net for delayed push tasks that fire after the
+  connection is closed.
+
+## Reproduction and Verification Plan
+
+### Automated Tests (TelemetryHttpClientLeakTest.java)
+
+Three unit tests reproduce the two race conditions:
+
+#### Test 1: `testGetTelemetryClientAfterCloseReCreatesClient`
+
+Reproduces Race Condition 1.
+
+**Steps**:
+1. Create a mock connection context with telemetry enabled
+2. Call `getTelemetryClient(ctx)` — creates a `TelemetryClient` in the holder map
+3. Call `closeTelemetryClient(ctx)` — removes the holder
+4. Call `getTelemetryClient(ctx)` again (simulates what `exportAllPendingTelemetryDetails` does)
+5. **Assert**: The returned client should be `NoopTelemetryClient`, not a new `TelemetryClient`
+
+**Before fix**: Returns a new `TelemetryClient` (FAIL — orphaned client created)
+**After fix**: Returns `NoopTelemetryClient` (PASS — no leak)
+
+#### Test 2: `testGetClientReturnsNullAfterCloseConnection`
+
+Reproduces Race Condition 2.
+
+**Steps**:
+1. Create a mock connection context
+2. Call `DatabricksHttpClientFactory.closeConnection(ctx)` (simulates `DatabricksConnection.close()`)
+3. Call `getClient(ctx, HttpClientType.TELEMETRY)` (simulates delayed `TelemetryPushTask`)
+4. **Assert**: Returns `null` (not a new `DatabricksHttpClient`)
+
+**Before fix**: Creates a new `DatabricksHttpClient` via `computeIfAbsent` (FAIL — leaked socket)
+**After fix**: Returns `null` (PASS — no leak)
+
+#### Test 3: `testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient`
+
+End-to-end test with pending telemetry collector events.
+
+**Steps**:
+1. Create a telemetry client and record pending latency events in `TelemetryCollector`
+2. Mock `exportTelemetryLog` to call `getTelemetryClient(ctx)` (simulating the real export path)
+3. Call `closeTelemetryClient(ctx)` which triggers the collector export
+4. **Assert**: No new `TelemetryClient` holders exist after close
+
+### Running the tests
+
+```bash
+# Run just the leak reproduction tests
+mvn test -pl jdbc-core -Dtest=TelemetryHttpClientLeakTest -Djacoco.skip=true
+
+# Run all telemetry tests (existing + new)
+mvn test -pl jdbc-core -Dtest="TelemetryClientFactoryTest,TelemetryClientTest,TelemetryPushClientTest,TelemetryCollectorManagerTest,TelemetryCollectorTest,TelemetryHelperTest,TelemetryHttpClientLeakTest" -Djacoco.skip=true
+
+# Run full unit test suite
+mvn test -pl jdbc-core -Djacoco.skip=true -Dgroups='!Jvm17PlusAndArrowToNioReflectionDisabled'
+```
+
+### Manual verification (with CRaC-enabled JDK)
+
+Use the reporter's reproducer from issue #1233 to verify 0 sockets remain after close:
+
+1. Build the driver: `mvn clean install -DskipTests`
+2. Set environment variables:
+   ```bash
+   export DATABRICKS_AUTH_TOKEN=<token>
+   export DATABRICKS_CONNECTION_STRING="jdbc:databricks://<host>:443/default;transportMode=http;ssl=1;httpPath=<path>;AuthMech=3;UID=token"
+   ```
+3. Run the socket leak reproducer (from issue #1233) which:
+   - Opens a connection, executes `SELECT 1`, closes the connection
+   - Calls `GlobalAsyncHttpClient.releaseClient()`
+   - Checks for remaining TCP sockets via `ss -tnp state established dst :443`
+4. **Expected**: 0 sockets after close
+5. Run the CRaC checkpoint reproducer:
+   - Same steps but calls `Core.checkpointRestore()` after close
+   - **Expected**: Checkpoint succeeds without `CheckpointOpenSocketException`
+
+### Regression testing
+
+The fix does not change any public API or behavior for active connections. It only prevents
+resource re-creation after close. The full unit test suite (3085 tests) passes with 0 failures.
+
+## Files Changed
+
+| File | Change |
+|------|--------|
+| `TelemetryClientFactory.java` | Added `closedConnectionUuids` guard, reordered close sequence |
+| `DatabricksHttpClientFactory.java` | Added `closedConnections` guard, new `closeConnection()` method |
+| `DatabricksConnection.java` | Use `closeConnection()` instead of `removeClient()` |
+| `TelemetryPushClient.java` | Null guard for `getClient()` return value |
+| `TelemetryHttpClientLeakTest.java` | 3 reproduction tests |
diff --git a/src/main/java/com/databricks/jdbc/api/impl/DatabricksConnection.java b/src/main/java/com/databricks/jdbc/api/impl/DatabricksConnection.java
@@ -422,7 +422,7 @@ public void close() throws SQLException {
     TelemetryClientFactory.getInstance().closeTelemetryClient(connectionContext);
     DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext);
     DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext);
-    DatabricksHttpClientFactory.getInstance().removeClient(connectionContext);
+    DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext);
     DatabricksThreadContextHolder.clearAllContext();
   }
 

diff --git a/src/main/java/com/databricks/jdbc/dbclient/impl/http/DatabricksHttpClientFactory.java b/src/main/java/com/databricks/jdbc/dbclient/impl/http/DatabricksHttpClientFactory.java
@@ -8,6 +8,7 @@
 import com.databricks.jdbc.log.JdbcLogger;
 import com.databricks.jdbc.log.JdbcLoggerFactory;
 import java.io.IOException;
+import java.util.Set;
 import java.util.concurrent.ConcurrentHashMap;
 
 public class DatabricksHttpClientFactory {
@@ -17,6 +18,14 @@ public class DatabricksHttpClientFactory {
   private final ConcurrentHashMap<SimpleEntry<String, HttpClientType>, DatabricksHttpClient>
       instances = new ConcurrentHashMap<>();
 
+  /**
+   * Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from
+   * re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard,
+   * delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets.
+   * See GitHub issue #1325.
+   */
+  private final Set<String> closedConnections = ConcurrentHashMap.newKeySet();
+
   private DatabricksHttpClientFactory() {
     // Private constructor to prevent instantiation
   }
@@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) {
 
   public IDatabricksHttpClient getClient(
       IDatabricksConnectionContext context, HttpClientType type) {
+    // Prevent creating new HTTP clients for connections that have been closed.
+    // This guards against delayed TelemetryPushTask executions that call
+    // getClient(ctx, TELEMETRY) after removeClient(ctx) has already run.
+    String connectionUuid = context.getConnectionUuid();
+    if (connectionUuid != null && closedConnections.contains(connectionUuid)) {
+      LOGGER.debug(
+          "Rejecting getClient() for closed connection {} with type {}",
+          context.getConnectionUuid(),
+          type);
+      return null;
+    }
     return instances.computeIfAbsent(
         getClientKey(context.getConnectionUuid(), type),
         k -> new DatabricksHttpClient(context, type));
   }
 
+  /**
+   * Removes and closes all HTTP clients for the given connection. Does NOT mark the connection as
+   * closed — the client can be re-created by a subsequent getClient() call.
+   */
   public void removeClient(IDatabricksConnectionContext context) {
     for (HttpClientType type : HttpClientType.values()) {
       removeClient(context, type);
     }
   }
 
+  /**
+   * Permanently closes all HTTP clients for the given connection and prevents new ones from being
+   * created. This should be called from DatabricksConnection.close() to prevent delayed
+   * TelemetryPushTask executions from creating orphaned HTTP clients (issue #1325).
+   */
+  public void closeConnection(IDatabricksConnectionContext context) {
+    closedConnections.add(context.getConnectionUuid());
+    removeClient(context);
+  }
+
   public void removeClient(IDatabricksConnectionContext context, HttpClientType type) {
     DatabricksHttpClient instance =
         instances.remove(getClientKey(context.getConnectionUuid(), type));

diff --git a/src/main/java/com/databricks/jdbc/telemetry/TelemetryClientFactory.java b/src/main/java/com/databricks/jdbc/telemetry/TelemetryClientFactory.java
@@ -34,6 +34,14 @@ public class TelemetryClientFactory {
   @VisibleForTesting
   final Map<String, TelemetryClientHolder> noauthTelemetryClientHolders = new ConcurrentHashMap<>();
 
+  /**
+   * Tracks connection UUIDs that have been closed. When a connection is closed, its UUID is added
+   * here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector
+   * exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This
+   * prevents the socket leak described in GitHub issue #1325.
+   */
+  @VisibleForTesting final Set<String> closedConnectionUuids = ConcurrentHashMap.newKeySet();
+
   private final ExecutorService telemetryExecutorService;
   private ScheduledExecutorService sharedSchedulerService;
 
@@ -78,6 +86,14 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti
     if (!isTelemetryAllowedForConnection(connectionContext)) {
       return NoopTelemetryClient.getInstance();
     }
+    // Prevent re-creation of TelemetryClient for connections that have been closed.
+    // Without this guard, code paths that call getTelemetryClient() after
+    // closeTelemetryClient() (e.g., TelemetryCollector.exportAllPendingTelemetryDetails
+    // or delayed TelemetryPushTask flush) would create an orphaned TelemetryClient
+    // whose periodic flush creates leaked TELEMETRY HTTP clients (issue #1325).
+    if (closedConnectionUuids.contains(connectionContext.getConnectionUuid())) {
+      return NoopTelemetryClient.getInstance();
+    }
     DatabricksConfig databricksConfig =
         TelemetryHelper.getDatabricksConfigSafely(connectionContext);
     if (databricksConfig != null) {
@@ -137,41 +153,52 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti
   /**
    * Closes telemetry client for a connection. Thread-safe: computeIfPresent ensures atomic locking,
    * preventing race conditions between connection removal and addition.
+   *
+   * <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient()
+   * from re-creating a TelemetryClient during or after the close sequence. Pending
+   * TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are
+   * flushed through the existing client. See GitHub issue #1325.
    */
   public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) {
     String key = TelemetryHelper.keyOf(connectionContext);
     String connectionUuid = connectionContext.getConnectionUuid();
-    // Atomically remove connection and close client if no connections remain for this key
+
+    // Export pending TelemetryCollector events BEFORE closing the TelemetryClient.
+    // This ensures the export uses the existing TelemetryClient (via the holder map)
+    // rather than triggering re-creation after the holder is removed.
+    TelemetryCollector collector =
+        TelemetryCollectorManager.getInstance().removeCollector(connectionContext);
+    if (collector != null) {
+      collector.exportAllPendingTelemetryDetails();
+    }
+
+    // Mark the connection as closed to prevent getTelemetryClient() from re-creating a
+    // TelemetryClient if called by delayed flush tasks or collector exports (issue #1325).
+    // This is done inside computeIfPresent so it only applies to connections that actually
+    // had a telemetry client registered.
     telemetryClientHolders.computeIfPresent(
         key,
         (k, holder) -> {
           holder.connectionUuids.remove(connectionUuid);
+          closedConnectionUuids.add(connectionUuid);
           if (holder.connectionUuids.isEmpty()) {
             closeTelemetryClient(holder.client, "telemetry client");
             return null;
           }
           return holder;
         });
-    // Atomically remove connection and close client if no connections remain for this key
     noauthTelemetryClientHolders.computeIfPresent(
         key,
         (k, holder) -> {
           holder.connectionUuids.remove(connectionUuid);
+          closedConnectionUuids.add(connectionUuid);
           if (holder.connectionUuids.isEmpty()) {
             closeTelemetryClient(holder.client, "unauthenticated telemetry client");
             return null;
           }
           return holder;
         });
 
-    // Export and remove the TelemetryCollector for this connection
-    TelemetryCollector collector =
-        TelemetryCollectorManager.getInstance().removeCollector(connectionContext);
-    if (collector != null) {
-      // Export any remaining telemetry before removing
-      collector.exportAllPendingTelemetryDetails();
-    }
-
     // Clean up cached connection parameters to prevent memory leaks
     TelemetryHelper.removeConnectionParameters(connectionContext.getConnectionUuid());
   }
@@ -216,6 +243,7 @@ public void reset() {
     // Clear the maps
     telemetryClientHolders.clear();
     noauthTelemetryClientHolders.clear();
+    closedConnectionUuids.clear();
 
     // Clear cached connection parameters
     TelemetryHelper.clearConnectionParameterCache();

diff --git a/src/main/java/com/databricks/jdbc/telemetry/TelemetryPushClient.java b/src/main/java/com/databricks/jdbc/telemetry/TelemetryPushClient.java
@@ -47,6 +47,11 @@ public void pushEvent(TelemetryRequest request) throws Exception {
     IDatabricksHttpClient httpClient =
         DatabricksHttpClientFactory.getInstance()
             .getClient(connectionContext, HttpClientType.TELEMETRY);
+    if (httpClient == null) {
+      // Connection was closed — HTTP client factory rejected the request to prevent socket leaks.
+      LOGGER.debug("Skipping telemetry push: connection has been closed");
+      return;
+    }
     String path =
         isAuthenticated
             ? PathConstants.TELEMETRY_PATH