fix(eth): bound sync RPC calls with timeout and recycle wedged web3 client#1175
fix(eth): bound sync RPC calls with timeout and recycle wedged web3 client#1175RezaRahemtola wants to merge 1 commit into
Conversation
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Correctly addresses the root cause of ETH sync stalls (stale TCP connection on long-lived AsyncWeb3) by bounding RPC awaits with asyncio.wait_for and recycling the web3 client on timeout. The implementation is clean, the test coverage is thorough, and all changes are well-documented inline. No security issues or logic errors found.
src/aleph/chains/ethereum.py (line 197): Minor: int(self.client_timeout) truncates fractional values (e.g. 0.05 → 0). In practice defaults are ~60s so this is harmless, but round() or changing make_web3_client's timeout param to float would be cleaner.
src/aleph/chains/ethereum.py (line 128): Pre-existing: __aexit__ calls disconnect() without a timeout. If the client is wedged during shutdown, this could hang cleanup. Not introduced by this PR, but a follow-up could add asyncio.wait_for here too.
There was a problem hiding this comment.
Pull request overview
This PR addresses ETH sync stalls caused by wedged/hung in-process AsyncWeb3 RPC calls by bounding key RPC awaits with timeouts and resetting the web3 client on timeout, plus adding regression tests to ensure timeouts surface and the reset path runs.
Changes:
- Add
asyncio.wait_for(..., timeout=client_timeout)aroundeth.get_logsandeth.block_number, and recycle the web3 client onasyncio.TimeoutError. - Persist connector rebuild parameters (
rpc_url,chain_id,client_timeout,contract_address) so_reset_web3_client()can recreate the client/contract. - Add tests covering hung
get_logs/block_numbercalls, timeout propagation, and client/contract rebuild behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
src/aleph/chains/ethereum.py |
Adds timeout-bounded RPC calls and a reset pathway that can rebuild the wedged AsyncWeb3 client and contract. |
tests/chains/test_ethereum_timeout.py |
Adds regression tests to ensure hung RPC calls raise TimeoutError and exercise the client reset/rebuild behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cd50bdf to
fb69301
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean fix for ETH sync stalling on stale RPC connections. Wraps blocking RPC calls (get_logs, block_number) with asyncio.wait_for and recycles the wedged web3 client on timeout. The approach is sound: TimeoutError is caught before the TooManyLogsInRange handler, so it's not misclassified; the client rebuild is atomic (swap both web3_client and contract together); and the timeout propagates cleanly to the existing retry loop. Tests cover all four variants (hung get_logs, hung block_number, Exception propagation, full rebuild path) with safety guards to prevent suite hangs.
fb69301 to
bb9cd64
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
Clean fix for ETH sync stalling due to wedged web3 client connections. The change bounds RPC calls with asyncio.wait_for timeout and recycles the client on timeout, letting the existing retry loop recover automatically. Tests cover both the get_logs and block_number hang paths, verify reset is triggered, and confirm the full client rebuild path. No backward-compat issues — all new init params have safe defaults.
…lient The single long-lived AsyncWeb3 client can wedge on a stale TCP connection, making eth.get_logs / eth.block_number hang forever and freezing the Ethereum sync loop. Bound those awaits with asyncio.wait_for and, on timeout, recycle the web3 client so the next retry uses a fresh connection. Harden the recycle path: - Bound the provider disconnect with asyncio.wait_for so a wedged provider's own disconnect() cannot hang the recovery. - Rebuild the web3 client and contract atomically (build into locals, then swap both together) so a failure in get_contract cannot leave a new web3 client paired with the stale contract. - Annotate client_timeout as float to match its use in asyncio.wait_for. - Add a pure-mock test exercising the rebuild path (disconnect awaited, both web3_client and contract replaced).
bb9cd64 to
f1ccc05
Compare
foxpatch-aleph
left a comment
There was a problem hiding this comment.
This is a well-crafted fix for a real production issue. The root cause (stale TCP connection causing hung awaits) is correctly diagnosed, and the solution (asyncio.wait_for around RPC calls + client recycling on timeout) is sound. The code is clean, the comments are informative, and the tests cover the key scenarios (timeout on get_logs, timeout on block_number, client rebuild). The atomic client/contract swap prevents partial state. Minor nits: shutdown could be delayed up to client_timeout seconds if the provider is wedged, and there's no test for the disconnect-failure-during-reset path, but neither is a blocker.
src/aleph/chains/ethereum.py (line 130): Using self.client_timeout (default 60s) for the shutdown disconnect means a SIGTERM could take up to 60s if the provider is wedged. Consider a smaller hardcoded timeout (e.g. 10s) for the shutdown path to make process termination more responsive.
tests/chains/test_ethereum_timeout.py (line 173): Consider adding a test that verifies _reset_web3_client handles the case where disconnect() itself times out: the except Exception on the disconnect call (line 190 of ethereum.py) is untested.
|
Took a close look at this, mostly around the question of whether web3 already gives us timeouts. Short version: it does, and I think that changes the framing of the fix a bit. web3's built-in timeout works, it's just multiplied by retries
I tested it against a socket that accepts the connection but never replies: So it does not hang forever, it raised after roughly 6x the timeout. The reason is that That means the real benefit of the The "stale long-lived TCP connection" mechanism doesn't quite fit web3 builds its session with Are we sure the hang was in the RPC calls? The genuinely un-timed awaits (the What I like
Minor
Overall I'm in favor of merging, the change is safe and a real robustness improvement. Main asks are: fix the description to say "retry multiplication" rather than "timeout doesn't work", consider whether tuning |
Symptom
On a running CCN, ETH sync periodically stalls: the reported ETH height stops advancing and the node falls further and further behind the chain head ("ETH height keeps increasing" in monitoring = the remaining/lag metric growing). The only known recovery is
docker compose down && docker compose up -d. It comes back, then recurs later.Investigation
Reproduced/diagnosed against a live affected node:
pyaleph_status_chain_eth_last_committed_heightwas frozen, while…_reference_total(chain head) kept climbing →…_height_remaining_totalgrows. So sync is stalled, not runaway.await, not spinning in a loop.block_numberin ~0.2–0.4s → the RPC is healthy; only the node's in-process client is stuck.Root cause
The
EthereumConnectorbuilds a single long-livedAsyncWeb3client once (in.new()) and reuses it for the process lifetime. When its underlying TCP connection goes stale, the awaited RPC calls in the sync loop hang indefinitely:self.web3_client.eth.block_number(in_get_all_logs_in_batches)self.web3_client.eth.get_logs({...})(in_get_logs_in_block_range)The provider's
request_kwargstimeout does not reliably abort this wedge. Andfetch_sync_events_taskonly hasexcept Exception— a hungawaitraises nothing, so the retry loop never fires. The committed height never advances, the node falls behind, and only a process restart (which builds a fresh client) recovers it.The fix
asyncio.wait_for(..., timeout=client_timeout)around bothget_logsandblock_number. A wedge now raisesasyncio.TimeoutErrorinstead of hanging forever._reset_web3_client()best-effort disconnects the stale provider (itself timeout-bounded) and rebuildsweb3_client+contractatomically, so the next attempt uses a brand-new connection (the same thing the manual restart was doing).TimeoutErroris re-raised and propagates tofetch_sync_events_task's existingexcept Exception, which logs, sleepspoll_interval, and retries automatically with the fresh client — no manual restart needed.__init__gains optional, defaulted params (rpc_url/chain_id/client_timeout/contract_address) so the connector can rebuild itself;.new()wires them from config. Happy-path behavior is unchanged (a successful call just gains an upper time bound).Testing
tests/chains/test_ethereum_timeout.py: a hungget_logsand a hungblock_numbereach surface asTimeoutError(not misclassified asTooManyLogsInRange), the client is recycled, and the error is catchable by the retry loop. Every test is self-guarded so the suite can't hang.linting:allgreen (ruff/black/isort/mypy).Out of scope / follow-ups
commit()and the RabbitMQ publish in the same sync path are still un-timed and could hang in the same way someday — worth a follow-up.client_timeoutdefault stays 60s; operators wanting faster self-heal can lower it.