-
Notifications
You must be signed in to change notification settings - Fork 39
Fix telemetry HTTP client socket leak preventing CRaC checkpoint #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| # RCA: Leaked Socket Prevents CRaC Checkpointing (Issue #1325) | ||
|
|
||
| ## Problem Statement | ||
|
|
||
| After `Connection.close()`, a `DatabricksHttpClient` with type `TELEMETRY` can remain in | ||
| `DatabricksHttpClientFactory.instances`, keeping a TCP socket open indefinitely. This prevents | ||
| CRaC (Coordinated Restore at Checkpoint) from completing because CRaC requires all sockets to be | ||
| closed before a checkpoint can be taken. | ||
|
|
||
| **Reporter**: @jnd77 (follow-up to #1233) | ||
| **Affected versions**: 3.x (confirmed on 3.3.1) | ||
| **Symptom**: Intermittent — depends on timing of telemetry flush tasks relative to connection close. | ||
|
|
||
| ## Root Cause | ||
|
|
||
| The bug is a **cross-thread race condition** in the connection close path involving two independent | ||
| mechanisms that can re-create HTTP clients after they've been removed. | ||
|
|
||
| ### Close Sequence (DatabricksConnection.close()) | ||
|
|
||
| ``` | ||
| Line 421: session.close() | ||
| Line 422: TelemetryClientFactory.closeTelemetryClient(ctx) | ||
| Line 423: DatabricksClientConfiguratorManager.removeInstance(ctx) | ||
| Line 424: DatabricksDriverFeatureFlagsContextFactory.removeInstance(ctx) | ||
| Line 425: DatabricksHttpClientFactory.removeClient(ctx) // removes all HTTP clients | ||
| Line 426: DatabricksThreadContextHolder.clearAllContext() | ||
| ``` | ||
|
|
||
| ### Race Condition 1: TelemetryClient re-creation after close | ||
|
|
||
| Inside `TelemetryClientFactory.closeTelemetryClient()`, the ordering was: | ||
|
|
||
| 1. **Remove the TelemetryClientHolder** from the map via `computeIfPresent` -> calls | ||
| `TelemetryClient.close()` -> `flush(true).get()` (synchronous flush) | ||
| 2. **Export pending TelemetryCollector events** via `collector.exportAllPendingTelemetryDetails()` | ||
|
|
||
| Step 2 calls `TelemetryHelper.exportTelemetryLog()` which calls | ||
| `TelemetryClientFactory.getTelemetryClient(ctx)`. Since the holder was already removed in Step 1, | ||
| `getTelemetryClient()` sees `existing == null` and **creates a brand new TelemetryClient** with a | ||
| new periodic flush scheduler. This orphaned client is never closed. | ||
|
|
||
| ### Race Condition 2: TELEMETRY HTTP client re-creation after removeClient | ||
|
|
||
| `TelemetryClient.close()` calls `flush(true).get()` which submits a `TelemetryPushTask` to the | ||
| shared 10-thread executor pool. The task calls: | ||
|
|
||
| ``` | ||
| TelemetryPushClient.pushEvent() | ||
| -> DatabricksHttpClientFactory.getClient(ctx, HttpClientType.TELEMETRY) | ||
| ``` | ||
|
|
||
| If this task executes **after** `DatabricksHttpClientFactory.removeClient(ctx)` at line 425, | ||
| `computeIfAbsent` creates a **new** `DatabricksHttpClient(TELEMETRY)` that nobody will ever close. | ||
| This leaked HTTP client holds an open TCP socket. | ||
|
|
||
| ### Why it's intermittent | ||
|
|
||
| The reporter notes the issue is "random." This is because: | ||
| - The race window is between `flush().get()` completing on the main thread and the actual | ||
| `TelemetryPushTask.run()` executing on the pool thread | ||
| - It only triggers when there are pending telemetry events at close time | ||
| - GC pauses and CPU scheduling widen or narrow the window | ||
|
|
||
| ### Previous fix (#1235) and why it was incomplete | ||
|
|
||
| PR #1235 fixed the `DatabricksClientConfiguratorManager` leak (SDK connection manager not being | ||
| closed). But it did not address: | ||
| 1. The telemetry client re-creation in `closeTelemetryClient()` | ||
| 2. The HTTP client re-creation via `computeIfAbsent` after `removeClient()` | ||
|
|
||
| ## Fix | ||
|
|
||
| The fix addresses both race conditions with a defense-in-depth approach: | ||
|
|
||
| ### Fix 1: TelemetryClientFactory — Prevent TelemetryClient re-creation | ||
|
|
||
| **File**: `TelemetryClientFactory.java` | ||
|
|
||
| - **Added `closedConnectionUuids` set**: Tracks connection UUIDs that have been closed. | ||
| `getTelemetryClient()` checks this set and returns `NoopTelemetryClient` for closed connections | ||
| instead of creating a new orphaned `TelemetryClient`. | ||
|
|
||
| - **Reordered `closeTelemetryClient()`**: Export pending `TelemetryCollector` events **BEFORE** | ||
| closing the `TelemetryClient`. This ensures the export uses the existing client (still in the | ||
| holder map) rather than triggering re-creation after the holder is removed. | ||
|
|
||
| - The UUID is added to `closedConnectionUuids` inside the `computeIfPresent` lambda so only | ||
| connections that actually had a telemetry client get tracked (avoids poisoning the set during | ||
| test setup/cleanup). | ||
|
|
||
| ### Fix 2: DatabricksHttpClientFactory — Prevent HTTP client re-creation | ||
|
|
||
| **File**: `DatabricksHttpClientFactory.java` | ||
|
|
||
| - **Added `closedConnections` set**: Tracks connection UUIDs that have been permanently closed. | ||
|
|
||
| - **New `closeConnection()` method**: Marks the connection as permanently closed and removes all | ||
| HTTP clients. After this call, `getClient()` returns `null` for that connection, preventing | ||
| `computeIfAbsent` from creating orphaned `DatabricksHttpClient` instances. | ||
|
|
||
| - `removeClient()` is unchanged — it still allows re-creation for non-close use cases (e.g., | ||
| client reset/reconnect scenarios used in tests). | ||
|
|
||
| ### Fix 3: DatabricksConnection — Use permanent close | ||
|
|
||
| **File**: `DatabricksConnection.java` | ||
|
|
||
| - Changed `removeClient(connectionContext)` to `closeConnection(connectionContext)` to use the | ||
| permanent close semantics that prevent HTTP client re-creation. | ||
|
|
||
| ### Fix 4: TelemetryPushClient — Null guard | ||
|
|
||
| **File**: `TelemetryPushClient.java` | ||
|
|
||
| - `pushEvent()` now handles `null` return from `getClient()` gracefully (logs and returns early) | ||
| instead of throwing NPE. This is the safety net for delayed push tasks that fire after the | ||
| connection is closed. | ||
|
|
||
| ## Reproduction and Verification Plan | ||
|
|
||
| ### Automated Tests (TelemetryHttpClientLeakTest.java) | ||
|
|
||
| Three unit tests reproduce the two race conditions: | ||
|
|
||
| #### Test 1: `testGetTelemetryClientAfterCloseReCreatesClient` | ||
|
|
||
| Reproduces Race Condition 1. | ||
|
|
||
| **Steps**: | ||
| 1. Create a mock connection context with telemetry enabled | ||
| 2. Call `getTelemetryClient(ctx)` — creates a `TelemetryClient` in the holder map | ||
| 3. Call `closeTelemetryClient(ctx)` — removes the holder | ||
| 4. Call `getTelemetryClient(ctx)` again (simulates what `exportAllPendingTelemetryDetails` does) | ||
| 5. **Assert**: The returned client should be `NoopTelemetryClient`, not a new `TelemetryClient` | ||
|
|
||
| **Before fix**: Returns a new `TelemetryClient` (FAIL — orphaned client created) | ||
| **After fix**: Returns `NoopTelemetryClient` (PASS — no leak) | ||
|
|
||
| #### Test 2: `testGetClientReturnsNullAfterCloseConnection` | ||
|
|
||
| Reproduces Race Condition 2. | ||
|
|
||
| **Steps**: | ||
| 1. Create a mock connection context | ||
| 2. Call `DatabricksHttpClientFactory.closeConnection(ctx)` (simulates `DatabricksConnection.close()`) | ||
| 3. Call `getClient(ctx, HttpClientType.TELEMETRY)` (simulates delayed `TelemetryPushTask`) | ||
| 4. **Assert**: Returns `null` (not a new `DatabricksHttpClient`) | ||
|
|
||
| **Before fix**: Creates a new `DatabricksHttpClient` via `computeIfAbsent` (FAIL — leaked socket) | ||
| **After fix**: Returns `null` (PASS — no leak) | ||
|
|
||
| #### Test 3: `testCloseTelemetryClientWithPendingCollectorEventsReCreatesClient` | ||
|
|
||
| End-to-end test with pending telemetry collector events. | ||
|
|
||
| **Steps**: | ||
| 1. Create a telemetry client and record pending latency events in `TelemetryCollector` | ||
| 2. Mock `exportTelemetryLog` to call `getTelemetryClient(ctx)` (simulating the real export path) | ||
| 3. Call `closeTelemetryClient(ctx)` which triggers the collector export | ||
| 4. **Assert**: No new `TelemetryClient` holders exist after close | ||
|
|
||
| ### Running the tests | ||
|
|
||
| ```bash | ||
| # Run just the leak reproduction tests | ||
| mvn test -pl jdbc-core -Dtest=TelemetryHttpClientLeakTest -Djacoco.skip=true | ||
|
|
||
| # Run all telemetry tests (existing + new) | ||
| mvn test -pl jdbc-core -Dtest="TelemetryClientFactoryTest,TelemetryClientTest,TelemetryPushClientTest,TelemetryCollectorManagerTest,TelemetryCollectorTest,TelemetryHelperTest,TelemetryHttpClientLeakTest" -Djacoco.skip=true | ||
|
|
||
| # Run full unit test suite | ||
| mvn test -pl jdbc-core -Djacoco.skip=true -Dgroups='!Jvm17PlusAndArrowToNioReflectionDisabled' | ||
| ``` | ||
|
|
||
| ### Manual verification (with CRaC-enabled JDK) | ||
|
|
||
| Use the reporter's reproducer from issue #1233 to verify 0 sockets remain after close: | ||
|
|
||
| 1. Build the driver: `mvn clean install -DskipTests` | ||
| 2. Set environment variables: | ||
| ```bash | ||
| export DATABRICKS_AUTH_TOKEN=<token> | ||
| export DATABRICKS_CONNECTION_STRING="jdbc:databricks://<host>:443/default;transportMode=http;ssl=1;httpPath=<path>;AuthMech=3;UID=token" | ||
| ``` | ||
| 3. Run the socket leak reproducer (from issue #1233) which: | ||
| - Opens a connection, executes `SELECT 1`, closes the connection | ||
| - Calls `GlobalAsyncHttpClient.releaseClient()` | ||
| - Checks for remaining TCP sockets via `ss -tnp state established dst :443` | ||
| 4. **Expected**: 0 sockets after close | ||
| 5. Run the CRaC checkpoint reproducer: | ||
| - Same steps but calls `Core.checkpointRestore()` after close | ||
| - **Expected**: Checkpoint succeeds without `CheckpointOpenSocketException` | ||
|
|
||
| ### Regression testing | ||
|
|
||
| The fix does not change any public API or behavior for active connections. It only prevents | ||
| resource re-creation after close. The full unit test suite (3085 tests) passes with 0 failures. | ||
|
|
||
| ## Files Changed | ||
|
|
||
| | File | Change | | ||
| |------|--------| | ||
| | `TelemetryClientFactory.java` | Added `closedConnectionUuids` guard, reordered close sequence | | ||
| | `DatabricksHttpClientFactory.java` | Added `closedConnections` guard, new `closeConnection()` method | | ||
| | `DatabricksConnection.java` | Use `closeConnection()` instead of `removeClient()` | | ||
| | `TelemetryPushClient.java` | Null guard for `getClient()` return value | | ||
| | `TelemetryHttpClientLeakTest.java` | 3 reproduction tests | | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -422,7 +422,7 @@ public void close() throws SQLException { | |
| TelemetryClientFactory.getInstance().closeTelemetryClient(connectionContext); | ||
| DatabricksClientConfiguratorManager.getInstance().removeInstance(connectionContext); | ||
| DatabricksDriverFeatureFlagsContextFactory.removeInstance(connectionContext); | ||
| DatabricksHttpClientFactory.getInstance().removeClient(connectionContext); | ||
| DatabricksHttpClientFactory.getInstance().closeConnection(connectionContext); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F10] No regression test for the This one-line change alters observable behavior: after No existing test was modified; grep shows no test asserts the new post-close invariant. The "3085 tests pass" claim only proves nothing else broke, not that the new behavior is covered. Fix: Add one focused assertion in the existing DatabricksConnection test class: @Test
void getClientReturnsNullAfterConnectionClose() {
DatabricksConnection conn = ...;
conn.close();
assertNull(DatabricksHttpClientFactory.getInstance()
.getClient(conn.getConnectionContext(), HttpClientType.TELEMETRY));
}Flagged by test reviewer. |
||
| DatabricksThreadContextHolder.clearAllContext(); | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -8,6 +8,7 @@ | ||||||||||||||||||||||
| import com.databricks.jdbc.log.JdbcLogger; | |||||||||||||||||||||||
| import com.databricks.jdbc.log.JdbcLoggerFactory; | |||||||||||||||||||||||
| import java.io.IOException; | |||||||||||||||||||||||
| import java.util.Set; | |||||||||||||||||||||||
| import java.util.concurrent.ConcurrentHashMap; | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| public class DatabricksHttpClientFactory { | |||||||||||||||||||||||
|
|
@@ -17,6 +18,14 @@ public class DatabricksHttpClientFactory { | ||||||||||||||||||||||
| private final ConcurrentHashMap<SimpleEntry<String, HttpClientType>, DatabricksHttpClient> | |||||||||||||||||||||||
| instances = new ConcurrentHashMap<>(); | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| /** | |||||||||||||||||||||||
| * Tracks connection UUIDs for which removeClient() has been called. Prevents getClient() from | |||||||||||||||||||||||
| * re-creating HTTP clients for closed connections via computeIfAbsent. Without this guard, | |||||||||||||||||||||||
| * delayed TelemetryPushTask executions can create orphaned HTTP clients that leak TCP sockets. | |||||||||||||||||||||||
| * See GitHub issue #1325. | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F2] Unbounded heap growth — PR trades a bounded socket leak for an unbounded heap leak (High)
Per-entry cost ≈ 120 B × 2 sets ≈ ~240 B per closed connection. 100 closes/sec (realistic for HikariCP under load) → ~2 GB/month. This hits exactly the long-lived CRaC workloads this PR targets. Fix options (in preference order):
Flagged by 6 reviewers (performance, maintainability, security, ops, devils-advocate, architecture). |
|||||||||||||||||||||||
| */ | |||||||||||||||||||||||
| private final Set<String> closedConnections = ConcurrentHashMap.newKeySet(); | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| private DatabricksHttpClientFactory() { | |||||||||||||||||||||||
| // Private constructor to prevent instantiation | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
|
|
@@ -31,17 +40,42 @@ public IDatabricksHttpClient getClient(IDatabricksConnectionContext context) { | ||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| public IDatabricksHttpClient getClient( | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F16] The codebase already uses Also document that Fix: /**
* @return the HTTP client, or {@code null} if {@link #closeConnection} has been
* called for this context. Callers MUST null-check when the call can
* race with connection close (e.g., async tasks).
*/
@Nullable
public IDatabricksHttpClient getClient(
IDatabricksConnectionContext context, HttpClientType type) { ... }Add to |
|||||||||||||||||||||||
| IDatabricksConnectionContext context, HttpClientType type) { | |||||||||||||||||||||||
| // Prevent creating new HTTP clients for connections that have been closed. | |||||||||||||||||||||||
| // This guards against delayed TelemetryPushTask executions that call | |||||||||||||||||||||||
| // getClient(ctx, TELEMETRY) after removeClient(ctx) has already run. | |||||||||||||||||||||||
| String connectionUuid = context.getConnectionUuid(); | |||||||||||||||||||||||
| if (connectionUuid != null && closedConnections.contains(connectionUuid)) { | |||||||||||||||||||||||
| LOGGER.debug( | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F3] Pre-PR
No Fix (preferred): return a rejecting sentinel Flagged by 6 reviewers. |
|||||||||||||||||||||||
| "Rejecting getClient() for closed connection {} with type {}", | |||||||||||||||||||||||
| context.getConnectionUuid(), | |||||||||||||||||||||||
| type); | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F1] TOCTOU race — the PR does not actually close the race it claims to fix (High) The guard reads
This is literally the race the RCA describes on the old code. The new code narrows but does not close it. Fix — fuse guard and map op atomically: String uuid = context.getConnectionUuid();
return instances.compute(
getClientKey(uuid, type),
(k, existing) -> {
if (existing != null) return existing;
if (uuid != null && closedConnections.contains(uuid)) return null;
return new DatabricksHttpClient(context, type);
});Apply the same pattern in Flagged independently by 4 reviewers (language, security, performance, devils-advocate). |
|||||||||||||||||||||||
| return null; | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
| return instances.computeIfAbsent( | |||||||||||||||||||||||
| getClientKey(context.getConnectionUuid(), type), | |||||||||||||||||||||||
| k -> new DatabricksHttpClient(context, type)); | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| /** | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F8] After this PR, the no-arg The RCA explicitly identifies Fix: Delete the no-arg overload and migrate the test; or mark Flagged by 4 reviewers (maintainability, architecture, devils-advocate, agent-compat). |
|||||||||||||||||||||||
| * Removes and closes all HTTP clients for the given connection. Does NOT mark the connection as | |||||||||||||||||||||||
| * closed — the client can be re-created by a subsequent getClient() call. | |||||||||||||||||||||||
| */ | |||||||||||||||||||||||
| public void removeClient(IDatabricksConnectionContext context) { | |||||||||||||||||||||||
| for (HttpClientType type : HttpClientType.values()) { | |||||||||||||||||||||||
| removeClient(context, type); | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| /** | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F7] No
Consequence: any test that calls Fix: Add |
|||||||||||||||||||||||
| * Permanently closes all HTTP clients for the given connection and prevents new ones from being | |||||||||||||||||||||||
| * created. This should be called from DatabricksConnection.close() to prevent delayed | |||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F5]
Same issue in Fix: Pick one invariant. Either (a) assert non-null in String uuid = context.getConnectionUuid();
if (uuid != null) closedConnections.add(uuid);Flagged by 2 reviewers (architecture, language). |
|||||||||||||||||||||||
| * TelemetryPushTask executions from creating orphaned HTTP clients (issue #1325). | |||||||||||||||||||||||
| */ | |||||||||||||||||||||||
| public void closeConnection(IDatabricksConnectionContext context) { | |||||||||||||||||||||||
| closedConnections.add(context.getConnectionUuid()); | |||||||||||||||||||||||
| removeClient(context); | |||||||||||||||||||||||
| } | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| public void removeClient(IDatabricksConnectionContext context, HttpClientType type) { | |||||||||||||||||||||||
| DatabricksHttpClient instance = | |||||||||||||||||||||||
| instances.remove(getClientKey(context.getConnectionUuid(), type)); | |||||||||||||||||||||||
|
|
|||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,14 @@ public class TelemetryClientFactory { | |
| @VisibleForTesting | ||
| final Map<String, TelemetryClientHolder> noauthTelemetryClientHolders = new ConcurrentHashMap<>(); | ||
|
|
||
| /** | ||
| * Tracks connection UUIDs that have been closed. When a connection is closed, its UUID is added | ||
| * here so that subsequent getTelemetryClient() calls (e.g., from delayed flush tasks or collector | ||
| * exports) return NoopTelemetryClient instead of re-creating an orphaned TelemetryClient. This | ||
| * prevents the socket leak described in GitHub issue #1325. | ||
| */ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F13] Duplicated registry pattern across two factories (Medium)
Fix: Extract a Flagged by 3 reviewers (maintainability, agent-compat, ops). |
||
| @VisibleForTesting final Set<String> closedConnectionUuids = ConcurrentHashMap.newKeySet(); | ||
|
|
||
| private final ExecutorService telemetryExecutorService; | ||
| private ScheduledExecutorService sharedSchedulerService; | ||
|
|
||
|
|
@@ -78,6 +86,14 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti | |
| if (!isTelemetryAllowedForConnection(connectionContext)) { | ||
| return NoopTelemetryClient.getInstance(); | ||
| } | ||
| // Prevent re-creation of TelemetryClient for connections that have been closed. | ||
| // Without this guard, code paths that call getTelemetryClient() after | ||
| // closeTelemetryClient() (e.g., TelemetryCollector.exportAllPendingTelemetryDetails | ||
| // or delayed TelemetryPushTask flush) would create an orphaned TelemetryClient | ||
| // whose periodic flush creates leaked TELEMETRY HTTP clients (issue #1325). | ||
| if (closedConnectionUuids.contains(connectionContext.getConnectionUuid())) { | ||
| return NoopTelemetryClient.getInstance(); | ||
| } | ||
| DatabricksConfig databricksConfig = | ||
| TelemetryHelper.getDatabricksConfigSafely(connectionContext); | ||
| if (databricksConfig != null) { | ||
|
|
@@ -137,41 +153,52 @@ public ITelemetryClient getTelemetryClient(IDatabricksConnectionContext connecti | |
| /** | ||
| * Closes telemetry client for a connection. Thread-safe: computeIfPresent ensures atomic locking, | ||
| * preventing race conditions between connection removal and addition. | ||
| * | ||
| * <p>The connection UUID is added to closedConnectionUuids FIRST to prevent getTelemetryClient() | ||
| * from re-creating a TelemetryClient during or after the close sequence. Pending | ||
| * TelemetryCollector events are exported BEFORE the TelemetryClient is closed, so they are | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F9] Reordered close has a concurrency regression under concurrent close of same UUID (Medium) New ordering: (1) Single-threaded: fine. Under concurrent close of the same UUID (legal via the public API):
Old ordering was order-insensitive because holder-remove+close was atomic under Fix: Mark UUID closed BEFORE export, and export via a directly-held Flagged by architecture reviewer. |
||
| * flushed through the existing client. See GitHub issue #1325. | ||
| */ | ||
| public void closeTelemetryClient(IDatabricksConnectionContext connectionContext) { | ||
| String key = TelemetryHelper.keyOf(connectionContext); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F6] Reordered close is not fail-safe — one exception skips ALL cleanup and re-opens the entire leak (High) The new ordering runs If
One export-time exception silently re-opens the entire leak the PR set out to fix. The old ordering was actually more robust (cleanup ran first, export was a trailer). Fix:
Flagged by ops reviewer. |
||
| String connectionUuid = connectionContext.getConnectionUuid(); | ||
| // Atomically remove connection and close client if no connections remain for this key | ||
|
|
||
| // Export pending TelemetryCollector events BEFORE closing the TelemetryClient. | ||
| // This ensures the export uses the existing TelemetryClient (via the holder map) | ||
| // rather than triggering re-creation after the holder is removed. | ||
| TelemetryCollector collector = | ||
| TelemetryCollectorManager.getInstance().removeCollector(connectionContext); | ||
| if (collector != null) { | ||
| collector.exportAllPendingTelemetryDetails(); | ||
| } | ||
|
|
||
| // Mark the connection as closed to prevent getTelemetryClient() from re-creating a | ||
| // TelemetryClient if called by delayed flush tasks or collector exports (issue #1325). | ||
| // This is done inside computeIfPresent so it only applies to connections that actually | ||
| // had a telemetry client registered. | ||
| telemetryClientHolders.computeIfPresent( | ||
| key, | ||
| (k, holder) -> { | ||
| holder.connectionUuids.remove(connectionUuid); | ||
| closedConnectionUuids.add(connectionUuid); | ||
| if (holder.connectionUuids.isEmpty()) { | ||
| closeTelemetryClient(holder.client, "telemetry client"); | ||
| return null; | ||
| } | ||
| return holder; | ||
| }); | ||
| // Atomically remove connection and close client if no connections remain for this key | ||
| noauthTelemetryClientHolders.computeIfPresent( | ||
| key, | ||
| (k, holder) -> { | ||
| holder.connectionUuids.remove(connectionUuid); | ||
| closedConnectionUuids.add(connectionUuid); | ||
| if (holder.connectionUuids.isEmpty()) { | ||
| closeTelemetryClient(holder.client, "unauthenticated telemetry client"); | ||
| return null; | ||
| } | ||
| return holder; | ||
| }); | ||
|
|
||
| // Export and remove the TelemetryCollector for this connection | ||
| TelemetryCollector collector = | ||
| TelemetryCollectorManager.getInstance().removeCollector(connectionContext); | ||
| if (collector != null) { | ||
| // Export any remaining telemetry before removing | ||
| collector.exportAllPendingTelemetryDetails(); | ||
| } | ||
|
|
||
| // Clean up cached connection parameters to prevent memory leaks | ||
| TelemetryHelper.removeConnectionParameters(connectionContext.getConnectionUuid()); | ||
| } | ||
|
|
@@ -216,6 +243,7 @@ public void reset() { | |
| // Clear the maps | ||
| telemetryClientHolders.clear(); | ||
| noauthTelemetryClientHolders.clear(); | ||
| closedConnectionUuids.clear(); | ||
|
|
||
| // Clear cached connection parameters | ||
| TelemetryHelper.clearConnectionParameterCache(); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -47,6 +47,11 @@ public void pushEvent(TelemetryRequest request) throws Exception { | |
| IDatabricksHttpClient httpClient = | ||
| DatabricksHttpClientFactory.getInstance() | ||
| .getClient(connectionContext, HttpClientType.TELEMETRY); | ||
| if (httpClient == null) { | ||
| // Connection was closed — HTTP client factory rejected the request to prevent socket leaks. | ||
| LOGGER.debug("Skipping telemetry push: connection has been closed"); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [F12] Rejection log at DEBUG; no metric for dropped telemetry (Medium) Two invisibility paths added by this PR:
Fix:
Flagged by 3 reviewers (ops, agent-compat, maintainability). |
||
| return; | ||
| } | ||
| String path = | ||
| isAuthenticated | ||
| ? PathConstants.TELEMETRY_PATH | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[F17] RCA doc in
docs/is unprecedented and point-in-time (Low)docs/currently contains only enduring reference material (LOGGING.md, TESTING.md, JDBC_METHOD_INVENTORY.md, JDBC_SPEC_COVERAGE_ANALYSIS.md, features/). This adds a 208-line root-cause analysis for a single bug.Issues:
docs/rca/README.mdpolicy, no numbering scheme).DatabricksConnection.close().Fix: Delete this file and fold the content into the PR description + issue #1325 comments. If in-tree RCAs are desired, establish a convention first (e.g.,
docs/rca/NNN-title.md+ aREADME.mdpolicy).Flagged by 3 reviewers (maintainability, agent-compat, language).