Skip to content

fix(realtime): fix watchdog reconnect using stale websocket transport#402

Open
Odatas wants to merge 5 commits intoDanielhiversen:masterfrom
Odatas:fix/watchdog_reconnect
Open

fix(realtime): fix watchdog reconnect using stale websocket transport#402
Odatas wants to merge 5 commits intoDanielhiversen:masterfrom
Odatas:fix/watchdog_reconnect

Conversation

@Odatas
Copy link
Copy Markdown

@Odatas Odatas commented Apr 6, 2026

Problem

The Tibber realtime watchdog failed to reconnect after connection loss,
causing sensors to stop updating until a manual integration reload.

The root cause was two bugs in realtime.py:

Bug 1 — Stale transport reused after disconnect
After close_async(), sub_manager was never reset to None. Since
_create_sub_manager() guards with if self.sub_manager is not None: return,
the watchdog reconnected using the old transport with the expired
websocketSubscriptionUrl token, causing persistent 4403 Invalid token
errors in a retry loop.

Bug 2 — Orphaned WebSocket on reconnect
The sub_endpoint setter unconditionally replaced sub_manager with a new
unconnected client whenever called. During reconnect, _resubscribe_homes()
rt_resubscribe()update_info() triggered the setter while
connect_async() had already opened a WebSocket, orphaning that connection.

Additionally, the watchdog had no way to fetch a fresh websocketSubscriptionUrl
before reconnecting, since TibberRT had no reference back to Tibber.update_info().

Fix

realtime.py

  • Reset session and sub_manager to None in a finally block after
    close_async() so _create_sub_manager() always builds a fresh transport
    with current credentials
  • Add on_reconnect callback parameter to TibberRT.__init__() — called
    before each reconnect attempt to fetch a fresh websocketSubscriptionUrl
    via Tibber.update_info()
  • Guard sub_endpoint setter to skip sub_manager replacement when URL
    is unchanged, preventing orphaned WebSocket connections
  • Replace assert statements with explicit RuntimeError and proper logging
    so failures are visible in the log instead of silently killing the watchdog task
  • Add None guard for sub_manager in the watchdog loop to handle reconnect
    backoff correctly

__init__.py

  • Pass update_info as on_reconnect callback to TibberRT
  • Fix misplaced docstring in set_access_token() (was after early return,
    never executed)

home.py

  • Remove redundant self._tibber_control.update_info() call from
    rt_resubscribe() — reconnect orchestration is now handled entirely
    by the watchdog via the on_reconnect callback

Testing

  • All 23 unit tests passing (3 new tests added)
  • Initial Home Assistant testing — watchdog stable, data received continuously
  • Long-term stability monitoring (1 week) — in progress

New tests in test_realtime.py:

  • test_watchdog_resets_sub_manager_after_close — verifies sub_manager is
    None after watchdog closes connection so _create_sub_manager() builds
    a fresh transport instead of reusing the stale one (Bug 1)
  • test_on_reconnect_callback_called_before_reconnect — verifies on_reconnect
    is called before _create_sub_manager() so fresh credentials are used (Bug 1+2)
  • test_sub_endpoint_setter_skips_replacement_on_same_url — verifies that
    setting the same URL does not replace a running sub_manager, preventing
    orphaned WebSocket connections (Bug 2)

Odatas added 2 commits April 6, 2026 11:48
The watchdog failed to reconnect after connection loss because
sub_manager was not reset to None after close_async(). This caused
_create_sub_manager() to return early, reusing the old transport
with an expired token and websocketSubscriptionUrl.

Changes:
- realtime.py: reset session and sub_manager to None in finally block
  after close_async() so _create_sub_manager() builds a fresh transport
- realtime.py: add on_reconnect callback to TibberRT, called before
  each reconnect to fetch a fresh websocketSubscriptionUrl via
  update_info()
- realtime.py: guard sub_endpoint setter to skip sub_manager replacement
  when URL is unchanged, preventing orphaned websocket connections
- realtime.py: replace assert statements with explicit RuntimeError
- home.py: remove redundant update_info() call from rt_resubscribe(),
  reconnect orchestration now handled entirely by the watchdog
- __init__.py: pass update_info as on_reconnect callback to TibberRT
- __init__.py: fix misplaced docstring in set_access_token()

Fixes: home-assistant/core#162395
Add tests covering the three bug fixes:
- sub_manager is reset to None after watchdog closes connection
- on_reconnect callback is called before _create_sub_manager()
- sub_endpoint setter skips sub_manager replacement when URL is unchanged
@Odatas
Copy link
Copy Markdown
Author

Odatas commented Apr 6, 2026

Im aware of #401 which takes a different architectural approach. This PR focuses on minimal targeted fixes to the existing watchdog. Which might be short time useable until testing and review of the archtectur change is finished.

@Odatas
Copy link
Copy Markdown
Author

Odatas commented Apr 6, 2026

Added a follow-up fix to the watchdog loop: when _on_reconnect() fails
due to a transient error (e.g. Tibber API returning 503/504), sub_manager
is set to None but the loop guard if self.sub_manager is None: continue
was preventing the reconnect block from being reached, causing the watchdog
to spin silently without ever retrying.

The health check block is now wrapped in if self.sub_manager: so the loop
falls through to the reconnect logic when sub_manager is None.

This was observed in practice after the related HA Core fix
home-assistant/core#167283 (awaiting set_access_token) was deployed. The token refresh correctly restarted the watchdog after a transient 503, but the watchdog itself would not retry on its own during the outage window. However after the Token Timeout HA Core successfully reconected and healed the watchdog.

Will continue the stability monitoring.

This commit addresses two intertwined issues that caused the Tibber integration to fail and permanently drop the realtime connection during network or authentication errors:

1. Fixed HTTP retries: Added `RetryableHttpExceptionError` to the `except` block in `execute()`. Previously, retriable HTTP errors (like 504 Gateway Timeouts) bypassed the internal `retry=3` mechanism and immediately crashed the request.
2. Safe token update & realtime recovery: Refactored `set_access_token()` to use a state rollback and a `try...except...else` block. If `update_info()` fails completely, the `_access_token` is reverted to its previous state, and the method raises the error without starting the watchdog. This prevents "zombie" watchdogs from looping on bad tokens and ensures downstream clients (like Home Assistant) can safely retry the update later. The realtime connection is now only restarted in the `else` block if the token validation was successful.
@Odatas
Copy link
Copy Markdown
Author

Odatas commented Apr 11, 2026

📝 Follow-up:

I obserfved a critical edge case during long-term testing, specifically when the Tibber API returns transient HTTP errors (like 504 Gateway Timeouts) during a token refresh.

The Problem

Even with the improved watchdog architecture, a failure during set_access_token (e.g., triggered by Home Assistant) led to a broken state:

  • Bypassed Retries: pytibber currently ignores RetryableHttpExceptionError (like 504s) in its internal retry logic, causing immediate failure even for transient issues.
  • State Corruption: The _access_token was overwritten before validation. If update_info() failed, subsequent retry attempts by the caller were blocked by the if access_token == self._access_token: return guard, leaving the integration dead until a manual restart.
  • Zombie Watchdogs: Forcing a reconnect on a failed token update risked starting a watchdog that loops indefinitely with invalid or unverified credentials.

The Solution in this Commit

  • HTTP Retries: Added RetryableHttpExceptionError to the execute() retry logic. Transient errors are now retried 3 times internally before escalating to the caller.
  • Atomic Token Update (Rollback): Implemented a state rollback in set_access_token. We now preserve the old_token and only "commit" the new one if update_info() succeeds. If it fails after all retries, we revert to the previous token and raise the exception.
  • Clean Orchestration: Used a try...except...else block to ensure realtime.reconnect() is only triggered if the token was successfully validated. This prevents the watchdog from starting with bad data.

This commit completes the "self-healing" capability of the connection management by ensuring that network instability during configuration changes does not lead to a permanent loss of the realtime stream.

This whole PR might be obsolete now with the new Architekture of teh #401 . I will continue testing until tommorow and if no errors come up advance the PR to review. But if the Review of the new architecture is allready close to finish it might be best to not merge this pr after all.

Either way i will have a look at the new architecture once its merged. And check if some improvements might be smart to carry over.

@Odatas
Copy link
Copy Markdown
Author

Odatas commented Apr 12, 2026

Encountered no error. 25 Token renews overnight not a single one failed. Testing concluded.

@Odatas Odatas marked this pull request as ready for review April 12, 2026 14:13
@Odatas
Copy link
Copy Markdown
Author

Odatas commented Apr 17, 2026

Update (2026-04-17)

Edge case observed in production: after a server-side connection reset
(ConnectionResetError), connect_async() fails on the first attempt.
Subsequent retries then receive 4403 Invalid token, likely because
Tibber invalidated the session server-side (e.g. during a rolling
deployment or server restart) in the window between us fetching the
URL and attempting to connect, a race condition.

Fix: reset sub_manager = None when 4403 is detected in the connect
block, forcing a fresh URL fetch via _on_reconnect() on the next attempt.

We only reset on 4403 specifically — transient network failures don't
invalidate the URL so a full reset there would cause unnecessary API calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant