feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead by julienrbrt · Pull Request #3262 · evstack/ev-node

julienrbrt · 2026-04-16T17:38:55Z

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

too far ahead

coderabbitai · 2026-04-16T17:39:05Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6aa42221-0ce9-4a5e-8052-f53edc0e736b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-16T17:39:09Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	May 13, 2026, 1:22 PM

claude · 2026-04-16T17:39:12Z

Claude finished @julienrbrt's task in 3m 22s —— View job

Code Review

Overview

This PR adds an automatic DA retriever walkback mechanism triggered when the P2P layer stalls and DA-sourced blocks are too far ahead of the node's current height. The main idea: if P2P is stalled and DA events reference block heights far above what the node has applied, the Subscriber's localDAHeight cursor is rewound so that DA retrieval re-covers those heights and the node can fall back to DA-sourced syncing.

The author's self-note ("do not like what the LLM has produced") is fair — there are genuine design and correctness concerns below.

Core Design Concern: One-height-at-a-time walkback is O(gap × 2s)

block/internal/syncing/syncer.go:574-587

if needsWalkback {
    s.walkbackActive.Store(true)
    const walkbackCooldown = 2 * time.Second
    s.walkbackCooldownUntil.Store(time.Now().Add(walkbackCooldown).UnixNano())
    // ...
    return daHeight - 1   // rewinds by exactly 1
}

The walkback steps back one DA height per 2-second cooldown window. If the DA subscriber has advanced 500 heights ahead of the node (which is realistic after a sustained P2P outage), recovery takes at least 1000 seconds before enough earlier heights are re-fetched.

A better approach would be to jump back directly to the DA height that corresponds to nodeHeight + 1 rather than descending one-by-one. For example:

// Instead of daHeight - 1, jump to the approximate DA height matching nodeHeight
targetDAHeight := max(s.daRetrieverHeight.Load(), estimatedDAHeightForNode)
return targetDAHeight

Even without estimatedDAHeightForNode, returning daHeight / 2 or nodeHeight + 1 (if tracked) would be dramatically faster. Fix this →

`walkbackCheck` silently uses `s.ctx` instead of a context parameter

block/internal/syncing/syncer.go:554

nodeHeight, err := s.store.Height(s.ctx)

The callback signature is func(daHeight uint64, events []common.DAHeightEvent) uint64 — it has no context. Internally it calls the store with the syncer's root context. This works while the syncer is running, but it's an unusual pattern: a callback closure that calls I/O without propagating the caller's context. If the caller's context is cancelled (e.g. during shutdown, where catchupLoop is cancelled), s.ctx may still be live and the store call succeeds, potentially performing unnecessary work. Either pass a context via the WalkbackChecker signature, or document this explicitly.

Missing cooldown test case

block/internal/syncing/syncer_test.go:1648-1765

TestSyncer_walkbackCheck tests the gap detection, contiguity stop, and P2P recovery cases, but there's no test that verifies the cooldown actually blocks a second walkback step. A test like this is needed:

t.Run("cooldown_blocks_consecutive_walkback", func(t *testing.T) {
    // first call returns rewind target and sets cooldown
    got1 := s.walkbackCheck(100, makeEvents(50))
    assert.Equal(t, uint64(99), got1)
    // immediate second call should return 0 (cooldown active)
    got2 := s.walkbackCheck(99, makeEvents(50))
    assert.Equal(t, uint64(0), got2)
})

Without this, a regression that removes the cooldown wouldn't be caught. Fix this →

Walkback can loop through empty DA heights unnecessarily

block/internal/syncing/syncer.go:559-576

needsWalkback := s.walkbackActive.Load()
if len(events) > 0 {
    // only clears walkbackActive if events are contiguous
}
// if events is empty and walkbackActive is true → rewinds again

When walkbackActive is true and HandleCatchup returns an empty events slice (DA height had no blobs — ErrBlobNotFound is swallowed in fetchAndPipeHeight), walkbackCheck is still called with events == nil. Since walkbackActive is already set, needsWalkback = true, and it rewinds again. This means the walkback keeps descending through DA heights that have no blocks at all, consuming the 2-second cooldown each time without making any progress toward closing the gap.

Consider skipping the rewind when events is empty: if there were no blobs at this DA height, rewinding past it cannot help.

`p2pStalled` may not reset after context cancellation

block/internal/syncing/syncer.go:476-531

p2pStalled is set to false only when ProcessHeight returns nil. If p2pWorkerLoop exits because ctx.Done() fires while p2pStalled == true, the field stays true. On the next syncer start (Start creates a fresh Syncer via NewSyncer), this won't matter since atomic.Bool zero-values to false. But if a syncer were ever restarted in place, it could incorrectly trigger walkback. This is likely not an issue in practice (syncer is not re-used) but worth a comment.

`walkbackCheck`: min-height scan allocates work on every `HandleCatchup`

block/internal/syncing/syncer.go:560-570

minHeight := events[0].Header.Height()
for _, e := range events[1:] {
    if e.Header.Height() < minHeight {
        minHeight = e.Header.Height()
    }
}

This runs on every HandleCatchup call, including the steady-state non-stalled path (checked at the top via p2pStalled, so it returns early). Actually, p2pStalled is checked first and returns 0 early, so this loop only runs when P2P is stalled — fine.

Subscriber `RewindTo` signals catchup redundantly

block/internal/da/subscriber.go:181-192

func (s *Subscriber) RewindTo(daHeight uint64) {
    // ...
    if s.localDAHeight.CompareAndSwap(cur, daHeight) {
        s.headReached.Store(false)
        s.signalCatchup()   // <-- redundant
        return
    }
}

RewindTo is called from inside runCatchup's inner loop, which does not exit after a rewind (the next iteration picks up from the rewound height). The signalCatchup() sends to the buffered channel, which will trigger an extra runCatchup call after the current one eventually terminates. This is harmless but creates a spurious extra execution. Minor nit.

`MockSubscriberHandler.HandleCatchup` panics on nil return

block/internal/da/subscriber_test.go:28-31

func (m *MockSubscriberHandler) HandleCatchup(ctx context.Context, height uint64) ([]common.DAHeightEvent, error) {
    args := m.Called(ctx, height)
    return args.Get(0).([]common.DAHeightEvent), args.Error(1)
}

args.Get(0).([]common.DAHeightEvent) will panic if the mock returns a nil interface (as opposed to a nil []common.DAHeightEvent). The test helper function workaround is Return([]common.DAHeightEvent(nil), ...) — which is currently used correctly in the tests, but this is fragile. Fix this →

Architectural observation: `WalkbackChecker` couples DA subscriber to application logic

WalkbackChecker is a generic callback on SubscriberConfig, but it's currently only needed by the DAFollower (not asyncBlockRetriever). The doc comment acknowledges this. Consider whether this belongs on DAFollowerConfig alone (and daFollower calls it after HandleCatchup internally) rather than threading it through the Subscriber primitive. This would keep the subscriber simpler and the coupling more explicit.

Summary

Area	Finding
Design	One-at-a-time walkback is too slow for large gaps
Design	Walkback loops through empty DA heights on each cooldown tick
Correctness	`walkbackCheck` does not take a context, uses `s.ctx` directly
Testing	Missing test for cooldown-blocking behavior
Testing	`MockSubscriberHandler.HandleCatchup` is fragile (potential panic on nil)
Architecture	`WalkbackChecker` on `Subscriber` may be over-generalised

The P2P stall detection logic (p2pStalled flag in p2pWorkerLoop) is clean and correctly distinguishes genuine stalls from DA-event-triggered cancellations. The RewindTo/signalCatchup plumbing in Subscriber is correct. The main practical concern is recovery speed for large gaps.

codecov · 2026-04-22T13:16:05Z

Codecov Report

❌ Patch coverage is 66.66667% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.91%. Comparing base (27eeb48) to head (fd34ba7).

Files with missing lines	Patch %	Lines
block/internal/syncing/syncer.go	57.77%	16 Missing and 3 partials ⚠️
block/internal/syncing/da_follower.go	77.27%	4 Missing and 1 partial ⚠️
block/internal/da/subscriber.go	78.57%	2 Missing and 1 partial ⚠️
block/internal/da/async_block_retriever.go	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3262      +/-   ##
==========================================
+ Coverage   60.85%   60.91%   +0.06%     
==========================================
  Files         127      127              
  Lines       13762    13815      +53     
==========================================
+ Hits         8375     8416      +41     
- Misses       4476     4484       +8     
- Partials      911      915       +4

Flag	Coverage Δ
combined	`60.91% <66.66%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

feat: add automatic DA retriever walkback when P2P stalls and DA blocks

1929851

too far ahead

github-actions Bot assigned julienrbrt Apr 16, 2026

julienrbrt changed the title ~~feat: add automatic DA retriever walkback when P2P stalls and DA blocks~~ feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026

julienrbrt added 6 commits April 16, 2026 19:40

add cl

9b7ecce

Merge branch 'main' into julien/edge-case-p2p-stall

e9591c6

refactor: walk back in syncer

31ceb93

Merge branch 'main' into julien/edge-case-p2p-stall

18267e5

fix cl

869af7e

fix cl

d585f08

julienrbrt added 4 commits April 30, 2026 00:59

Merge branch 'main' into julien/edge-case-p2p-stall

e096d3c

Merge branch 'main' into julien/edge-case-p2p-stall

9813d63

fixes

da38ab4

add cooldown

fd34ba7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 11 commits into
mainfrom
julien/edge-case-p2p-stall

julienrbrt commented Apr 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

julienrbrt commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Overview

Core Design Concern: One-height-at-a-time walkback is O(gap × 2s)

walkbackCheck silently uses s.ctx instead of a context parameter

Missing cooldown test case

Walkback can loop through empty DA heights unnecessarily

p2pStalled may not reset after context cancellation

walkbackCheck: min-height scan allocates work on every HandleCatchup

Subscriber RewindTo signals catchup redundantly

MockSubscriberHandler.HandleCatchup panics on nil return

Architectural observation: WalkbackChecker couples DA subscriber to application logic

Summary

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

julienrbrt commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading

claude Bot commented Apr 16, 2026 •

edited

Loading

`walkbackCheck` silently uses `s.ctx` instead of a context parameter

`p2pStalled` may not reset after context cancellation

`walkbackCheck`: min-height scan allocates work on every `HandleCatchup`

Subscriber `RewindTo` signals catchup redundantly

`MockSubscriberHandler.HandleCatchup` panics on nil return

Architectural observation: `WalkbackChecker` couples DA subscriber to application logic

codecov Bot commented Apr 22, 2026 •

edited

Loading