Skip to content

fix(ws): break reconnect-handler Arc cycle that leaked WebSocket clients (#325)#334

Open
skyc1e wants to merge 3 commits intoPolymarket:mainfrom
skyc1e:fix/ws-reconnect-handler-arc-cycle
Open

fix(ws): break reconnect-handler Arc cycle that leaked WebSocket clients (#325)#334
skyc1e wants to merge 3 commits intoPolymarket:mainfrom
skyc1e:fix/ws-reconnect-handler-arc-cycle

Conversation

@skyc1e
Copy link
Copy Markdown

@skyc1e skyc1e commented Apr 14, 2026

Summary

Fixes #325. SubscriptionManager::start_reconnection_handler spawned a detached tokio task that held a strong Arc<SubscriptionManager> — but the task also owned a cloned ConnectionManager whose watch::Sender is what the task awaits on. The Sender couldn't close until every clone dropped, and the strong Arc inside the task kept the manager (and that Sender clone) alive indefinitely. A classic "task keeps itself alive by waiting on a signal only it can send" cycle, producing one leaked WebSocket + task + subscription manager per dropped WsClient / rtds::Client.

The same bug existed in both src/clob/ws/subscription.rs and src/rtds/subscription.rs; both are fixed here.

Fix

Smallest diff that actually breaks the cycle:

  1. start_reconnection_handler now returns JoinHandle<()> (both clob::ws and rtds). Signature is the only public-ish change; all internal callers updated.
  2. New ws::task::AbortOnDrop — thin JoinHandle<()> wrapper whose Drop calls abort(). Single helper lives alongside the cycle it's paired with, documented with the full failure mode so future readers don't have to reconstruct it.
  3. ChannelResources (clob) and ClientInner<S> (rtds) each store an AbortOnDrop next to their Arc<SubscriptionManager>. When the owning client (or a single removed channel, via deauthenticate) drops, the wrapper drops, the task is aborted, its stack locals — including the strong Arc clone — are released, and the whole graph drops normally.

Why AbortOnDrop instead of a direct Drop for ChannelResources / Drop for ClientInner:

  • ClientInner has a field-move flow (authenticate / deauthenticate) that destructures the struct. Once impl Drop is added to a struct, partial-move-and-rebuild patterns stop compiling. Wrapping the handle isolates the Drop to the wrapper and lets the outer struct stay move-friendly.
  • Keeps the "this is why we bother" rationale colocated with the primitive that does the aborting, not scattered across two unrelated subsystems.

Scope

  • src/clob/ws/subscription.rs — return JoinHandle<()>, doc the contract; add the regression test.
  • src/clob/ws/client.rsChannelResources gains reconnect_handle: AbortOnDrop (annotated #[expect(dead_code, reason = "Field held only for its Drop side effect")]).
  • src/rtds/subscription.rs — return JoinHandle<()>, doc the contract.
  • src/rtds/client.rsClientInner gains the same reconnect_handle: AbortOnDrop field; authenticate / deauthenticate thread it through.
  • src/ws/task.rs (new, pub(crate)) — the AbortOnDrop wrapper + module-level docs for the cycle.
  • src/ws/mod.rs — register the new private submodule.

No public API / type breakage. The start_reconnection_handler return type changes from () to JoinHandle<()>, but the method is pub on a type exposed primarily through the higher-level Clients, and no internal or downstream caller (that I can see) relied on the () return.

Test plan

  • New regression test clob::ws::subscription::reconnect_handler_tests::aborting_reconnect_handle_releases_subscription_manager — constructs a SubscriptionManager + calls start_reconnection_handler, confirms Arc::strong_count >= 2 while the task is live (proves the cycle pre-condition), drops both, and polls Weak::strong_count until it reaches zero (or fails after 2s with the observed count). Without this PR's fix the strong count never decrements and the test times out.
  • ws::task::tests::abort_on_drop_cancels_pending_task — pending future never completes after wrapper is dropped.
  • ws::task::tests::abort_on_drop_is_noop_for_finished_taskabort() on an already-finished task doesn't panic.
  • cargo check / cargo testI don't have a Rust toolchain available in this sandbox, so the tests above are my best-effort and the broader crate test suite is unverified from my end. Happy to iterate if CI surfaces anything.

Notes

  • Issue reporter explicitly recommended this approach ("Option 2: JoinHandle + Drop"); I followed it.
  • Scope stays narrow on purpose — no call-site rewrites, no changes to the ConnectionManager contract, no new dependencies. The existing tokio feature-gating on ws / rtds already covers the new module.
  • Same-shape fix applied to RTDS because the same code pattern has the same bug there; the reporter only filed it against clob::ws but missing the RTDS side would leave a silent leak on that path.

Note

Medium Risk
Touches WebSocket client lifecycle by changing reconnection-handler task ownership/cancellation; if mishandled it could stop automatic resubscription or abort tasks unexpectedly, but changes are localized and covered by new regression tests.

Overview
Fixes a WebSocket resource leak (issue #325) by making both CLOB and RTDS SubscriptionManager::start_reconnection_handler return a JoinHandle<()> and ensuring callers retain and cancel that task on teardown.

Introduces ws::task::AbortOnDrop (registered as ws::task) and stores it alongside each channel/client’s Arc<SubscriptionManager> (ChannelResources in CLOB WS and ClientInner in RTDS), so dropping the client/channel aborts the reconnection task and breaks the Arc/watch reference cycle.

Adds regression tests in both subscription modules plus RTDS client teardown tests to assert SubscriptionManager instances are released after dropping the abort handle/client.

Reviewed by Cursor Bugbot for commit d42a0d4. Bugbot is set up for automated code reviews on this repo. Configure here.

`SubscriptionManager::start_reconnection_handler` (both
`clob::ws::subscription` and `rtds::subscription`) spawns a detached
`tokio` task that holds a strong `Arc<SubscriptionManager>` clone. The
task also owns a cloned `ConnectionManager`, whose `watch::Sender` is
what the task awaits on for state changes. Because the `Sender` only
closes when every clone of it drops, and the strong `Arc` clone inside
the spawned task prevents the owning manager (and therefore that
`ConnectionManager` clone) from ever dropping, the task can never exit
on its own — a reference cycle that leaked the entire channel (task,
WebSocket, subscription manager) for the lifetime of the process every
time a user dropped a `WsClient`.

`start_reconnection_handler` now returns its `JoinHandle<()>`. A new
`ws::task::AbortOnDrop` wrapper (thin `JoinHandle` + `Drop` that calls
`abort()`) is stored next to the `Arc<SubscriptionManager>` inside
`clob::ws::ChannelResources` and `rtds::ClientInner`. When the owning
client drops, the wrapper drops, the task is aborted, its stack locals
(including the strong `Arc` clone) are released, and the whole graph
can drop normally.

Public API is unchanged. A regression test in `clob::ws::subscription`
constructs a `SubscriptionManager`, drops it together with the
`AbortOnDrop`, and asserts the `Weak` downgrade cannot upgrade — i.e.
the strong count reached zero. Unit tests for `AbortOnDrop` cover the
cancel-pending and already-finished paths.

Closes Polymarket#325
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 90.85366% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.89%. Comparing base (77264a4) to head (d42a0d4).

Files with missing lines Patch % Lines
src/rtds/client.rs 92.20% 6 Missing ⚠️
src/clob/ws/subscription.rs 90.00% 3 Missing ⚠️
src/rtds/subscription.rs 89.65% 3 Missing ⚠️
src/ws/task.rs 88.46% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #334      +/-   ##
==========================================
+ Coverage   85.54%   86.89%   +1.34%     
==========================================
  Files          32       33       +1     
  Lines        5167     5325     +158     
==========================================
+ Hits         4420     4627     +207     
+ Misses        747      698      -49     
Flag Coverage Δ
rust 86.89% <90.85%> (+1.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

skyc1e added 2 commits April 15, 2026 01:20
Codecov flagged the reconnect-handle plumbing on the RTDS side as
uncovered — `Client::new`, `Client::authenticate`,
`Client::deauthenticate`, and `SubscriptionManager::start_reconnection_handler`
each had zero patch coverage, so the regression fence for issue Polymarket#325
only caught the clob::ws path.

Add focused regression tests without widening the public surface:

* `rtds::subscription::reconnect_handler_tests` mirrors the clob::ws
  test: constructs an `Arc<SubscriptionManager>`, calls
  `start_reconnection_handler`, wraps in `AbortOnDrop`, drops, and
  asserts the `Weak` downgrade cannot upgrade. Proves the same Arc
  cycle is broken on the RTDS code path.
* `rtds::client::teardown_tests` covers the three `reconnect_handle`
  field sites in `Client::{new, authenticate, deauthenticate}` via
  `tokio::test`s that poll a `Weak<SubscriptionManager>` after drop.
  `authenticate` and `deauthenticate` tests also assert the manager
  is _still_ reachable mid-round-trip, ensuring the `AbortOnDrop` is
  forwarded into the new `ClientInner` rather than dropped early.
* A `#[cfg(test)]`-only `inner_subscriptions_for_test` accessor on
  `Client<S>` provides the weak-ref probe without exposing the
  private `ClientInner.subscriptions` field. Gated behind
  `#[expect(clippy::multiple_inherent_impl, ...)]` with a reason, per
  crate style.

These tests exercise the same `AbortOnDrop::new` + `Drop` paths the
existing `ws::task::tests` already cover, transitively raising
`ws/task.rs` coverage as well.
Zorak556 added a commit to Zorak556/polymarket-client-sdk that referenced this pull request Apr 17, 2026
… sites

Local fork patches in b270478 and e79fc91 made `subscribe_user_events`,
`subscribe_orders`, `subscribe_trades`, and `heartbeats_active` async
without updating the SDK's own tests/examples. The breakage was masked
by b4a5614's library compilation errors; PR Polymarket#334 fixed the lib and
exposed the stale call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WebSocket connection leak: start_reconnection_handler creates Arc refcount cycle, clients never drop

1 participant