fix(realtime): fix watchdog reconnect using stale websocket transport by Odatas · Pull Request #402 · Danielhiversen/pyTibber

Odatas · 2026-04-06T10:34:10Z

Problem

The Tibber realtime watchdog failed to reconnect after connection loss,
causing sensors to stop updating until a manual integration reload.

The root cause was two bugs in realtime.py:

Bug 1 — Stale transport reused after disconnect
After close_async(), sub_manager was never reset to None. Since
_create_sub_manager() guards with if self.sub_manager is not None: return,
the watchdog reconnected using the old transport with the expired
websocketSubscriptionUrl token, causing persistent 4403 Invalid token
errors in a retry loop.

Bug 2 — Orphaned WebSocket on reconnect
The sub_endpoint setter unconditionally replaced sub_manager with a new
unconnected client whenever called. During reconnect, _resubscribe_homes()
→ rt_resubscribe() → update_info() triggered the setter while
connect_async() had already opened a WebSocket, orphaning that connection.

Additionally, the watchdog had no way to fetch a fresh websocketSubscriptionUrl
before reconnecting, since TibberRT had no reference back to Tibber.update_info().

Fix

realtime.py

Reset session and sub_manager to None in a finally block after
close_async() so _create_sub_manager() always builds a fresh transport
with current credentials
Add on_reconnect callback parameter to TibberRT.__init__() — called
before each reconnect attempt to fetch a fresh websocketSubscriptionUrl
via Tibber.update_info()
Guard sub_endpoint setter to skip sub_manager replacement when URL
is unchanged, preventing orphaned WebSocket connections
Replace assert statements with explicit RuntimeError and proper logging
so failures are visible in the log instead of silently killing the watchdog task
Add None guard for sub_manager in the watchdog loop to handle reconnect
backoff correctly

__init__.py

Pass update_info as on_reconnect callback to TibberRT
Fix misplaced docstring in set_access_token() (was after early return,
never executed)

home.py

Remove redundant self._tibber_control.update_info() call from
rt_resubscribe() — reconnect orchestration is now handled entirely
by the watchdog via the on_reconnect callback

Testing

All 23 unit tests passing (3 new tests added)
Initial Home Assistant testing — watchdog stable, data received continuously
Long-term stability monitoring (1 week) — in progress

New tests in test_realtime.py:

test_watchdog_resets_sub_manager_after_close — verifies sub_manager is
None after watchdog closes connection so _create_sub_manager() builds
a fresh transport instead of reusing the stale one (Bug 1)
test_on_reconnect_callback_called_before_reconnect — verifies on_reconnect
is called before _create_sub_manager() so fresh credentials are used (Bug 1+2)
test_sub_endpoint_setter_skips_replacement_on_same_url — verifies that
setting the same URL does not replace a running sub_manager, preventing
orphaned WebSocket connections (Bug 2)

The watchdog failed to reconnect after connection loss because sub_manager was not reset to None after close_async(). This caused _create_sub_manager() to return early, reusing the old transport with an expired token and websocketSubscriptionUrl. Changes: - realtime.py: reset session and sub_manager to None in finally block after close_async() so _create_sub_manager() builds a fresh transport - realtime.py: add on_reconnect callback to TibberRT, called before each reconnect to fetch a fresh websocketSubscriptionUrl via update_info() - realtime.py: guard sub_endpoint setter to skip sub_manager replacement when URL is unchanged, preventing orphaned websocket connections - realtime.py: replace assert statements with explicit RuntimeError - home.py: remove redundant update_info() call from rt_resubscribe(), reconnect orchestration now handled entirely by the watchdog - __init__.py: pass update_info as on_reconnect callback to TibberRT - __init__.py: fix misplaced docstring in set_access_token() Fixes: home-assistant/core#162395

Add tests covering the three bug fixes: - sub_manager is reset to None after watchdog closes connection - on_reconnect callback is called before _create_sub_manager() - sub_endpoint setter skips sub_manager replacement when URL is unchanged

Odatas · 2026-04-06T10:48:53Z

Im aware of #401 which takes a different architectural approach. This PR focuses on minimal targeted fixes to the existing watchdog. Which might be short time useable until testing and review of the archtectur change is finished.

…one after failed reconnect

Odatas · 2026-04-06T18:39:38Z

Added a follow-up fix to the watchdog loop: when _on_reconnect() fails
due to a transient error (e.g. Tibber API returning 503/504), sub_manager
is set to None but the loop guard if self.sub_manager is None: continue
was preventing the reconnect block from being reached, causing the watchdog
to spin silently without ever retrying.

The health check block is now wrapped in if self.sub_manager: so the loop
falls through to the reconnect logic when sub_manager is None.

This was observed in practice after the related HA Core fix
home-assistant/core#167283 (awaiting set_access_token) was deployed. The token refresh correctly restarted the watchdog after a transient 503, but the watchdog itself would not retry on its own during the outage window. However after the Token Timeout HA Core successfully reconected and healed the watchdog.

Will continue the stability monitoring.

This commit addresses two intertwined issues that caused the Tibber integration to fail and permanently drop the realtime connection during network or authentication errors: 1. Fixed HTTP retries: Added `RetryableHttpExceptionError` to the `except` block in `execute()`. Previously, retriable HTTP errors (like 504 Gateway Timeouts) bypassed the internal `retry=3` mechanism and immediately crashed the request. 2. Safe token update & realtime recovery: Refactored `set_access_token()` to use a state rollback and a `try...except...else` block. If `update_info()` fails completely, the `_access_token` is reverted to its previous state, and the method raises the error without starting the watchdog. This prevents "zombie" watchdogs from looping on bad tokens and ensures downstream clients (like Home Assistant) can safely retry the update later. The realtime connection is now only restarted in the `else` block if the token validation was successful.

Odatas · 2026-04-11T12:34:53Z

📝 Follow-up:

I obserfved a critical edge case during long-term testing, specifically when the Tibber API returns transient HTTP errors (like 504 Gateway Timeouts) during a token refresh.

The Problem

Even with the improved watchdog architecture, a failure during set_access_token (e.g., triggered by Home Assistant) led to a broken state:

Bypassed Retries: pytibber currently ignores RetryableHttpExceptionError (like 504s) in its internal retry logic, causing immediate failure even for transient issues.
State Corruption: The _access_token was overwritten before validation. If update_info() failed, subsequent retry attempts by the caller were blocked by the if access_token == self._access_token: return guard, leaving the integration dead until a manual restart.
Zombie Watchdogs: Forcing a reconnect on a failed token update risked starting a watchdog that loops indefinitely with invalid or unverified credentials.

The Solution in this Commit

HTTP Retries: Added RetryableHttpExceptionError to the execute() retry logic. Transient errors are now retried 3 times internally before escalating to the caller.
Atomic Token Update (Rollback): Implemented a state rollback in set_access_token. We now preserve the old_token and only "commit" the new one if update_info() succeeds. If it fails after all retries, we revert to the previous token and raise the exception.
Clean Orchestration: Used a try...except...else block to ensure realtime.reconnect() is only triggered if the token was successfully validated. This prevents the watchdog from starting with bad data.

This commit completes the "self-healing" capability of the connection management by ensuring that network instability during configuration changes does not lead to a permanent loss of the realtime stream.

This whole PR might be obsolete now with the new Architekture of teh #401 . I will continue testing until tommorow and if no errors come up advance the PR to review. But if the Review of the new architecture is allready close to finish it might be best to not merge this pr after all.

Either way i will have a look at the new architecture once its merged. And check if some improvements might be smart to carry over.

Odatas · 2026-04-12T14:12:51Z

Encountered no error. 25 Token renews overnight not a single one failed. Testing concluded.

… on retry

Odatas · 2026-04-17T12:20:41Z

Update (2026-04-17)

Edge case observed in production: after a server-side connection reset
(ConnectionResetError), connect_async() fails on the first attempt.
Subsequent retries then receive 4403 Invalid token, likely because
Tibber invalidated the session server-side (e.g. during a rolling
deployment or server restart) in the window between us fetching the
URL and attempting to connect, a race condition.

Fix: reset sub_manager = None when 4403 is detected in the connect
block, forcing a fresh URL fetch via _on_reconnect() on the next attempt.

We only reset on 4403 specifically — transient network failures don't
invalidate the URL so a full reset there would cause unnecessary API calls.

Odatas added 2 commits April 6, 2026 11:48

fix(realtime): continue watchdog reconnect loop when sub_manager is N…

85059de

…one after failed reconnect

Odatas marked this pull request as ready for review April 12, 2026 14:13

fix(realtime): reset sub_manager on 4403 to force fresh websocket URL…

be5afc7

… on retry

Odatas mentioned this pull request Apr 19, 2026

Tibber integration stops working after brief internet interruption home-assistant/core#168480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(realtime): fix watchdog reconnect using stale websocket transport#402

fix(realtime): fix watchdog reconnect using stale websocket transport#402
Odatas wants to merge 5 commits intoDanielhiversen:masterfrom
Odatas:fix/watchdog_reconnect

Odatas commented Apr 6, 2026 •

edited

Loading

Uh oh!

Odatas commented Apr 6, 2026

Uh oh!

Odatas commented Apr 6, 2026

Uh oh!

Odatas commented Apr 11, 2026 •

edited

Loading

Uh oh!

Odatas commented Apr 12, 2026 •

edited

Loading

Uh oh!

Odatas commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Odatas commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

Odatas commented Apr 6, 2026

Uh oh!

Odatas commented Apr 6, 2026

Uh oh!

Odatas commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Follow-up:

The Problem

The Solution in this Commit

Uh oh!

Odatas commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Odatas commented Apr 17, 2026

Update (2026-04-17)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Odatas commented Apr 6, 2026 •

edited

Loading

Odatas commented Apr 11, 2026 •

edited

Loading

Odatas commented Apr 12, 2026 •

edited

Loading