fix(realtime): fix watchdog reconnect using stale websocket transport#402
fix(realtime): fix watchdog reconnect using stale websocket transport#402Odatas wants to merge 5 commits intoDanielhiversen:masterfrom
Conversation
The watchdog failed to reconnect after connection loss because sub_manager was not reset to None after close_async(). This caused _create_sub_manager() to return early, reusing the old transport with an expired token and websocketSubscriptionUrl. Changes: - realtime.py: reset session and sub_manager to None in finally block after close_async() so _create_sub_manager() builds a fresh transport - realtime.py: add on_reconnect callback to TibberRT, called before each reconnect to fetch a fresh websocketSubscriptionUrl via update_info() - realtime.py: guard sub_endpoint setter to skip sub_manager replacement when URL is unchanged, preventing orphaned websocket connections - realtime.py: replace assert statements with explicit RuntimeError - home.py: remove redundant update_info() call from rt_resubscribe(), reconnect orchestration now handled entirely by the watchdog - __init__.py: pass update_info as on_reconnect callback to TibberRT - __init__.py: fix misplaced docstring in set_access_token() Fixes: home-assistant/core#162395
Add tests covering the three bug fixes: - sub_manager is reset to None after watchdog closes connection - on_reconnect callback is called before _create_sub_manager() - sub_endpoint setter skips sub_manager replacement when URL is unchanged
|
Im aware of #401 which takes a different architectural approach. This PR focuses on minimal targeted fixes to the existing watchdog. Which might be short time useable until testing and review of the archtectur change is finished. |
…one after failed reconnect
|
Added a follow-up fix to the watchdog loop: when The health check block is now wrapped in This was observed in practice after the related HA Core fix Will continue the stability monitoring. |
This commit addresses two intertwined issues that caused the Tibber integration to fail and permanently drop the realtime connection during network or authentication errors: 1. Fixed HTTP retries: Added `RetryableHttpExceptionError` to the `except` block in `execute()`. Previously, retriable HTTP errors (like 504 Gateway Timeouts) bypassed the internal `retry=3` mechanism and immediately crashed the request. 2. Safe token update & realtime recovery: Refactored `set_access_token()` to use a state rollback and a `try...except...else` block. If `update_info()` fails completely, the `_access_token` is reverted to its previous state, and the method raises the error without starting the watchdog. This prevents "zombie" watchdogs from looping on bad tokens and ensures downstream clients (like Home Assistant) can safely retry the update later. The realtime connection is now only restarted in the `else` block if the token validation was successful.
📝 Follow-up:I obserfved a critical edge case during long-term testing, specifically when the Tibber API returns transient HTTP errors (like 504 Gateway Timeouts) during a token refresh. The ProblemEven with the improved watchdog architecture, a failure during
The Solution in this Commit
This commit completes the "self-healing" capability of the connection management by ensuring that network instability during configuration changes does not lead to a permanent loss of the realtime stream. This whole PR might be obsolete now with the new Architekture of teh #401 . I will continue testing until tommorow and if no errors come up advance the PR to review. But if the Review of the new architecture is allready close to finish it might be best to not merge this pr after all. Either way i will have a look at the new architecture once its merged. And check if some improvements might be smart to carry over. |
|
Encountered no error. 25 Token renews overnight not a single one failed. Testing concluded. |
Update (2026-04-17)Edge case observed in production: after a server-side connection reset Fix: reset We only reset on |
Problem
The Tibber realtime watchdog failed to reconnect after connection loss,
causing sensors to stop updating until a manual integration reload.
The root cause was two bugs in
realtime.py:Bug 1 — Stale transport reused after disconnect
After
close_async(),sub_managerwas never reset toNone. Since_create_sub_manager()guards withif self.sub_manager is not None: return,the watchdog reconnected using the old transport with the expired
websocketSubscriptionUrltoken, causing persistent4403 Invalid tokenerrors in a retry loop.
Bug 2 — Orphaned WebSocket on reconnect
The
sub_endpointsetter unconditionally replacedsub_managerwith a newunconnected client whenever called. During reconnect,
_resubscribe_homes()→
rt_resubscribe()→update_info()triggered the setter whileconnect_async()had already opened a WebSocket, orphaning that connection.Additionally, the watchdog had no way to fetch a fresh
websocketSubscriptionUrlbefore reconnecting, since
TibberRThad no reference back toTibber.update_info().Fix
realtime.pysessionandsub_managertoNonein afinallyblock afterclose_async()so_create_sub_manager()always builds a fresh transportwith current credentials
on_reconnectcallback parameter toTibberRT.__init__()— calledbefore each reconnect attempt to fetch a fresh
websocketSubscriptionUrlvia
Tibber.update_info()sub_endpointsetter to skipsub_managerreplacement when URLis unchanged, preventing orphaned WebSocket connections
assertstatements with explicitRuntimeErrorand proper loggingso failures are visible in the log instead of silently killing the watchdog task
Noneguard forsub_managerin the watchdog loop to handle reconnectbackoff correctly
__init__.pyupdate_infoason_reconnectcallback toTibberRTset_access_token()(was after early return,never executed)
home.pyself._tibber_control.update_info()call fromrt_resubscribe()— reconnect orchestration is now handled entirelyby the watchdog via the
on_reconnectcallbackTesting
New tests in
test_realtime.py:test_watchdog_resets_sub_manager_after_close— verifiessub_managerisNoneafter watchdog closes connection so_create_sub_manager()buildsa fresh transport instead of reusing the stale one (Bug 1)
test_on_reconnect_callback_called_before_reconnect— verifieson_reconnectis called before
_create_sub_manager()so fresh credentials are used (Bug 1+2)test_sub_endpoint_setter_skips_replacement_on_same_url— verifies thatsetting the same URL does not replace a running
sub_manager, preventingorphaned WebSocket connections (Bug 2)