Skip to content

fix(eth): bound sync RPC calls with timeout and recycle wedged web3 client#1175

Open
RezaRahemtola wants to merge 1 commit into
mainfrom
fix/eth-sync-rpc-timeout-recycle
Open

fix(eth): bound sync RPC calls with timeout and recycle wedged web3 client#1175
RezaRahemtola wants to merge 1 commit into
mainfrom
fix/eth-sync-rpc-timeout-recycle

Conversation

@RezaRahemtola

Copy link
Copy Markdown
Member

Symptom

On a running CCN, ETH sync periodically stalls: the reported ETH height stops advancing and the node falls further and further behind the chain head ("ETH height keeps increasing" in monitoring = the remaining/lag metric growing). The only known recovery is docker compose down && docker compose up -d. It comes back, then recurs later.

Investigation

Reproduced/diagnosed against a live affected node:

  • pyaleph_status_chain_eth_last_committed_height was frozen, while …_reference_total (chain head) kept climbing → …_height_remaining_total grows. So sync is stalled, not runaway.
  • The worker process was at ~0% CPU — blocked on an await, not spinning in a loop.
  • A fresh web3 client pointed at the same RPC endpoint returned block_number in ~0.2–0.4s → the RPC is healthy; only the node's in-process client is stuck.

Root cause

The EthereumConnector builds a single long-lived AsyncWeb3 client once (in .new()) and reuses it for the process lifetime. When its underlying TCP connection goes stale, the awaited RPC calls in the sync loop hang indefinitely:

  • self.web3_client.eth.block_number (in _get_all_logs_in_batches)
  • self.web3_client.eth.get_logs({...}) (in _get_logs_in_block_range)

The provider's request_kwargs timeout does not reliably abort this wedge. And fetch_sync_events_task only has except Exception — a hung await raises nothing, so the retry loop never fires. The committed height never advances, the node falls behind, and only a process restart (which builds a fresh client) recovers it.

The fix

  1. Bound the RPC awaits with asyncio.wait_for(..., timeout=client_timeout) around both get_logs and block_number. A wedge now raises asyncio.TimeoutError instead of hanging forever.
  2. Recycle the client on timeout_reset_web3_client() best-effort disconnects the stale provider (itself timeout-bounded) and rebuilds web3_client + contract atomically, so the next attempt uses a brand-new connection (the same thing the manual restart was doing).
  3. The TimeoutError is re-raised and propagates to fetch_sync_events_task's existing except Exception, which logs, sleeps poll_interval, and retries automatically with the fresh client — no manual restart needed.

__init__ gains optional, defaulted params (rpc_url/chain_id/client_timeout/contract_address) so the connector can rebuild itself; .new() wires them from config. Happy-path behavior is unchanged (a successful call just gains an upper time bound).

Testing

  • New tests/chains/test_ethereum_timeout.py: a hung get_logs and a hung block_number each surface as TimeoutError (not misclassified as TooManyLogsInRange), the client is recycled, and the error is catchable by the retry loop. Every test is self-guarded so the suite can't hang.
  • TDD: tests fail on unmodified source, pass with the fix. linting:all green (ruff/black/isort/mypy).

Out of scope / follow-ups

  • DB commit() and the RabbitMQ publish in the same sync path are still un-timed and could hang in the same way someday — worth a follow-up.
  • client_timeout default stays 60s; operators wanting faster self-heal can lower it.

Copilot AI review requested due to automatic review settings June 1, 2026 11:22
foxpatch-aleph
foxpatch-aleph previously approved these changes Jun 1, 2026

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctly addresses the root cause of ETH sync stalls (stale TCP connection on long-lived AsyncWeb3) by bounding RPC awaits with asyncio.wait_for and recycling the web3 client on timeout. The implementation is clean, the test coverage is thorough, and all changes are well-documented inline. No security issues or logic errors found.

src/aleph/chains/ethereum.py (line 197): Minor: int(self.client_timeout) truncates fractional values (e.g. 0.05 → 0). In practice defaults are ~60s so this is harmless, but round() or changing make_web3_client's timeout param to float would be cleaner.

src/aleph/chains/ethereum.py (line 128): Pre-existing: __aexit__ calls disconnect() without a timeout. If the client is wedged during shutdown, this could hang cleanup. Not introduced by this PR, but a follow-up could add asyncio.wait_for here too.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses ETH sync stalls caused by wedged/hung in-process AsyncWeb3 RPC calls by bounding key RPC awaits with timeouts and resetting the web3 client on timeout, plus adding regression tests to ensure timeouts surface and the reset path runs.

Changes:

  • Add asyncio.wait_for(..., timeout=client_timeout) around eth.get_logs and eth.block_number, and recycle the web3 client on asyncio.TimeoutError.
  • Persist connector rebuild parameters (rpc_url, chain_id, client_timeout, contract_address) so _reset_web3_client() can recreate the client/contract.
  • Add tests covering hung get_logs / block_number calls, timeout propagation, and client/contract rebuild behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

File Description
src/aleph/chains/ethereum.py Adds timeout-bounded RPC calls and a reset pathway that can rebuild the wedged AsyncWeb3 client and contract.
tests/chains/test_ethereum_timeout.py Adds regression tests to ensure hung RPC calls raise TimeoutError and exercise the client reset/rebuild behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread tests/chains/test_ethereum_timeout.py Outdated
Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread tests/chains/test_ethereum_timeout.py Outdated
Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread src/aleph/chains/ethereum.py Outdated
Comment thread tests/chains/test_ethereum_timeout.py Outdated
foxpatch-aleph
foxpatch-aleph previously approved these changes Jun 1, 2026

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for ETH sync stalling on stale RPC connections. Wraps blocking RPC calls (get_logs, block_number) with asyncio.wait_for and recycles the wedged web3 client on timeout. The approach is sound: TimeoutError is caught before the TooManyLogsInRange handler, so it's not misclassified; the client rebuild is atomic (swap both web3_client and contract together); and the timeout propagates cleanly to the existing retry loop. Tests cover all four variants (hung get_logs, hung block_number, Exception propagation, full rebuild path) with safety guards to prevent suite hangs.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread tests/chains/test_ethereum_timeout.py Outdated
Comment thread tests/chains/test_ethereum_timeout.py Outdated

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for ETH sync stalling due to wedged web3 client connections. The change bounds RPC calls with asyncio.wait_for timeout and recycles the client on timeout, letting the existing retry loop recover automatically. Tests cover both the get_logs and block_number hang paths, verify reset is triggered, and confirm the full client rebuild path. No backward-compat issues — all new init params have safe defaults.

…lient

The single long-lived AsyncWeb3 client can wedge on a stale TCP connection,
making eth.get_logs / eth.block_number hang forever and freezing the
Ethereum sync loop. Bound those awaits with asyncio.wait_for and, on
timeout, recycle the web3 client so the next retry uses a fresh connection.

Harden the recycle path:
- Bound the provider disconnect with asyncio.wait_for so a wedged
  provider's own disconnect() cannot hang the recovery.
- Rebuild the web3 client and contract atomically (build into locals,
  then swap both together) so a failure in get_contract cannot leave a new
  web3 client paired with the stale contract.
- Annotate client_timeout as float to match its use in asyncio.wait_for.
- Add a pure-mock test exercising the rebuild path (disconnect awaited,
  both web3_client and contract replaced).
@odesenfans odesenfans force-pushed the fix/eth-sync-rpc-timeout-recycle branch from bb9cd64 to f1ccc05 Compare June 1, 2026 12:20

@foxpatch-aleph foxpatch-aleph left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-crafted fix for a real production issue. The root cause (stale TCP connection causing hung awaits) is correctly diagnosed, and the solution (asyncio.wait_for around RPC calls + client recycling on timeout) is sound. The code is clean, the comments are informative, and the tests cover the key scenarios (timeout on get_logs, timeout on block_number, client rebuild). The atomic client/contract swap prevents partial state. Minor nits: shutdown could be delayed up to client_timeout seconds if the provider is wedged, and there's no test for the disconnect-failure-during-reset path, but neither is a blocker.

src/aleph/chains/ethereum.py (line 130): Using self.client_timeout (default 60s) for the shutdown disconnect means a SIGTERM could take up to 60s if the provider is wedged. Consider a smaller hardcoded timeout (e.g. 10s) for the shutdown path to make process termination more responsive.

tests/chains/test_ethereum_timeout.py (line 173): Consider adding a test that verifies _reset_web3_client handles the case where disconnect() itself times out: the except Exception on the disconnect call (line 190 of ethereum.py) is untested.

@odesenfans

Copy link
Copy Markdown
Collaborator

Took a close look at this, mostly around the question of whether web3 already gives us timeouts. Short version: it does, and I think that changes the framing of the fix a bit.

web3's built-in timeout works, it's just multiplied by retries

make_web3_client already passes request_kwargs={"timeout": ...}, and on the installed stack (web3 7.15.0 / aiohttp 3.13.5) that becomes ClientTimeout(total=...), a total timeout covering connect + send + read body. Default is 60s via ethereum.client_timeout.

I tested it against a socket that accepts the connection but never replies:

request_kwargs timeout = 2s  ->  raised builtins.TimeoutError after 11.89s

So it does not hang forever, it raised after roughly 6x the timeout. The reason is that AsyncHTTPProvider ships a default ExceptionRetryConfiguration that retries on TimeoutError (5 retries, backoff 0.125), and both eth_getLogs and eth_blockNumber are in the retry allowlist. So a wedged call gets retried about 5 times, turning a 60s timeout into a worst case of roughly 5 minutes before it finally raises and the existing except Exception loop recovers.

That means the real benefit of the asyncio.wait_for(..., timeout=client_timeout) wrapper is collapsing that worst case from ~5 min back down to 60s, plus the client recycle. That's a genuine improvement, but it's not because the provider timeout "does not reliably abort the wedge", it's because the provider retries the timeout 5x. Could we update the description to reflect that? It also opens up a simpler lever: lowering client_timeout and/or tuning exception_retry_configuration (fewer retries) would let the built-in timeout self-heal in seconds without the outer wrapper. The double-timeout layering (outer wait_for at 60s wrapping inner retries each at 60s) is worth a short comment too, since the outer one will essentially always fire first.

The "stale long-lived TCP connection" mechanism doesn't quite fit

web3 builds its session with TCPConnector(force_close=True, ...), so there's no keep-alive: every RPC call opens a fresh TCP connection. So there isn't a long-lived connection to go stale. Recycling the client does rebuild the cached ClientSession (and its connector/DNS state), so it's not useless, but the stated root cause about a single long-lived connection wedging is probably not what was happening.

Are we sure the hang was in the RPC calls?

The genuinely un-timed awaits (the session.commit() and the RabbitMQ publish in fetch_ethereum_sync_events) are deferred to "out of scope". But since the RPC calls already had a retry-bounded timeout, if a node really sat at 0% CPU with no recovery for a long time, those un-timed awaits are at least as plausible a suspect. Worth confirming the incident was actually in get_logs / block_number before treating those as a follow-up, otherwise this hardens a path that was already bounded.

What I like

  • except asyncio.TimeoutError correctly catches both the outer wait_for and web3's own post-retry TimeoutError (in 3.11+ they're the same class), so it propagates to the retry loop as intended.
  • Atomic swap of web3_client + contract in _reset_web3_client (build locals, then assign) avoids a half-rebuilt state.
  • Disconnect is itself bounded, and __aexit__ got the same treatment.
  • Tests are focused and self-guarded with an outer wait_for(timeout=5), they pass fast, and the existing test_ethereum.py still collects with the new optional params. One caveat: the tests mock get_logs / block_number directly, so they exercise the wait_for wrapper but not the real provider + retry interaction.

Minor

  • No synchronization on the swap: broadcast_messages and fetch_sync_events_task both touch self.web3_client in the same process. Low risk (a coroutine keeps its captured reference), but worth a note.

Overall I'm in favor of merging, the change is safe and a real robustness improvement. Main asks are: fix the description to say "retry multiplication" rather than "timeout doesn't work", consider whether tuning client_timeout / retries is simpler than the wrapper, and double check the hang was really in the RPC path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants