-
Notifications
You must be signed in to change notification settings - Fork 591
Deterministic state syncs #2177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 30 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
7e59d71
deterministic state sync
marcello33 fdb0082
temp bump heimdall to committed version for testing purposes
marcello33 3ae74d8
fix parsing
marcello33 405e5e6
fix unmarshalling of RecordListVisibleAtHeightResponse
marcello33 1893557
change heimdall dep for testing
marcello33 f48218f
Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss
marcello33 f3446c4
add DeterministicStateSyncBlock to GatherForks
marcello33 2874ed1
Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss
marcello33 2934d3e
Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss
marcello33 47f48bc
better comment on go.mod
marcello33 13a51b8
fix linter
marcello33 96d8797
address comments
marcello33 c6f0d55
address comments
marcello33 a94f1a9
remove omitempty
marcello33 46e1750
update banner
marcello33 653d8b6
timeout for StateSyncEventsAtHeight
marcello33 bdb52b0
address comments
marcello33 cd46ca1
address comments
marcello33 d508db4
added tests
marcello33 360e533
address minor err shadowing and fix lint
marcello33 c1c1e3d
test single endpoint
marcello33 9697c13
update heimdall-v2 dependency to DSS-test branch
marcello33 925c5e5
address comments
marcello33 5dfe949
solve lint issue
marcello33 8d06151
address comments
marcello33 3a98421
address comments
marcello33 a1fa6b9
address comments
marcello33 dce6483
address comments
marcello33 ca9ee85
unify endpoints
marcello33 85521b2
address comments
marcello33 4326272
remove dead code
marcello33 5c6214f
Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss
marcello33 085fdd5
internal/ethapi: add MaxUsedGas field to eth_simulateV1 response (#32…
Rhovian e6bc0e6
Merge branch 'develop' into mardizzone/POS-3441_deterministic-ss
marcello33 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-fork deterministic mode currently logs and returns an empty state-sync set when
StateSyncEventsByTimeerrors. That outcome depends on each node’s local Heimdall connectivity; if some validators succeed and others hit a transient error, they will apply different state updates and compute different state roots for the same block. For consensus-critical finalization, this should fail block processing (return the error) or use a deterministic fallback that guarantees all nodes derive the same result (e.g., only treat specific, globally-observable errors as skippable).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pre-fork path has the same behavior (log + return empty). I'll probably keep this comment open and discuss with the team.
Anyway, the risk is mitigated by multiple factors:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot's comment does makes sense in the worst case scenario - if a BP tries to fetch state sync when there is state sync but fail, its block would be rejected, causing span rotation. @claude could you take a look and see all the possible errors that could be returned from
StateSyncEventsByTimeandStateSyncEvents?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @cffls's task in 1m 36s —— View job
Error Analysis:
StateSyncEventsByTimevsStateSyncEventsStateSyncEventsByTime(HTTP, gRPC, app, failover)StateSyncEvents(HTTP, gRPC, app, failover)CommitStatesBoth the pre-fork (
StateSyncEvents) and post-fork (StateSyncEventsByTime) code paths handle errors identically inCommitStates— log the error and return an empty state sync set withnilerror (bor.go:1776-1778andbor.go:1782-1787). So the concern Jerry raises (BP builds a block with 0 state syncs, other validators reject it due to state root mismatch, causing span rotation) applies equally to both paths. Here are all the possible errors:Errors from
StateSyncEventsByTime(post-fork)HTTP transport (
client.go:275-324) — usesFetchWithRetryper page, with a 1-minute global timeout:url.ParsefailurestateSyncsByTimeURL— malformedurlStringcontext.DeadlineExceededcontext.DeadlineExceededinternalFetchWithTimeout(only surfaces if retries also time out)FetchWithRetry)context.Canceledc.ctxcancelled (node shutdown)ErrShutdownDetectedcloseChclosed during retry loopErrServiceUnavailableHTTPStatusError(wrapsErrNotSuccessfulResponse)FetchWithRetry)ErrNoResponseFetchreturnsErrNoResponsenet.Error(connection refused, DNS, TCP timeout)json.Unmarshal/cdc.UnmarshalJSONerrorhttp.NewRequestWithContexterrorio.ReadAllerror fromMaxBytesReadergRPC transport (
heimdallgrpc/state_sync.go:91-152) — per-page timeout + 1-minute global timeout:context.DeadlineExceededstateSyncTotalTimeoutexpirescontext.DeadlineExceededdefaultTimeout(5s) expirescontext.CanceledIn-process app (
heimdallapp/state_sync.go:43-74) — direct keeper calls, no timeouts:queryServer.GetStateSyncsByTimefails (store corruption, missing data)Failover wrapper (
failover_client.go:113-121) — adds 1-minute timeout beforecallWithFailover:context.DeadlineExceededattemptTimeoutper endpoint incallWithFailoverFailover-eligible errors (triggers cascade):
net.Error,ErrNoResponse,HTTPStatusErrorwith status ≥ 500,context.DeadlineExceeded,context.Canceled(when parent ctx is still alive).Not failover-eligible (returned immediately):
ErrServiceUnavailable(503),ErrShutdownDetected, parentctxcancelled.Errors from
StateSyncEvents(pre-fork)HTTP transport (
client.go:102-153) — uses bareFetchper page (no retry), no global timeout:url.ParsefailurestateSyncURLcontext.DeadlineExceededinternalFetchWithTimeoutFetch, no retry)context.CanceledErrServiceUnavailableHTTPStatusErrorErrNoResponsenet.Errorio.ReadAllerrorgRPC transport (
heimdallgrpc/state_sync.go:22-87) — per-page timeout + 1-minute global:Same error set as
StateSyncEventsByTimegRPC (identical structure).In-process app (
heimdallapp/state_sync.go:15-39):ClerkKeeper.GetEventRecordfailureClerkKeeper.GetEventRecordListWithTimefailureFailover wrapper (
failover_client.go:107-111) — no pre-set timeout (unlikeStateSyncEventsByTime):Same cascade logic as above, but the parent context has no 1-minute cap, so the 30s
attemptTimeoutis the only bound per endpoint.Key difference:
StateSyncEventsByTimeis more resilient thanStateSyncEventsStateSyncEvents(pre-fork)StateSyncEventsByTime(post-fork)FetchFetchWithRetry— retries every 5s until timeoutcontext.WithTimeoutset beforecallWithFailoverSo transient errors (brief 5xx, connection reset) are more likely to surface from
StateSyncEventsthan fromStateSyncEventsByTime.The concern Jerry raises
The worry: if a BP's Heimdall call fails, it builds a block with 0 state syncs. If other validators' Heimdall calls succeed, they derive a different state root and reject the block → the BP misses a slot → potential span rotation.
This is a real scenario that applies to both pre-fork and post-fork paths, since both use the same
log.Error+return empty, nilpattern. The code comments atbor.go:1768-1775document this as an intentional liveness-over-safety tradeoff: the proposer misses a slot rather than crashing entirely, andFetchWithRetry+MultiHeimdallClientfailover make persistent failures unlikely.The risk is mitigated by:
waitUntilHeimdallIsSyncedgates entry — Heimdall must be responsiveFetchWithRetryretries aggressively (new in post-fork path)MultiHeimdallClientcascades across multiple endpointsCommitStatesonly runs at sprint boundaries (every 16 blocks)LastStateId)If the team wants to eliminate the slot-miss risk entirely, the alternative would be to return the error from
CommitStates(which would cause the block production attempt to fail and be retried), but that risks an infinite retry loop if Heimdall is persistently down. The current design accepts a missed slot as the lesser evil.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT @cffls
I left this open intentionally - I'm down to patch this, just wondering if we have to - as it would be drifting from the current pre-fork behavior.
Thanks