Skip to content

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262

Draft
julienrbrt wants to merge 11 commits into
mainfrom
julien/edge-case-p2p-stall
Draft

feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead#3262
julienrbrt wants to merge 11 commits into
mainfrom
julien/edge-case-p2p-stall

Conversation

@julienrbrt
Copy link
Copy Markdown
Member

@julienrbrt julienrbrt commented Apr 16, 2026

Overview

Add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead

Early WIP, i do not like what the LLM has produced.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6aa42221-0ce9-4a5e-8052-f53edc0e736b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch julien/edge-case-p2p-stall

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 16, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMay 13, 2026, 1:22 PM

@julienrbrt julienrbrt changed the title feat: add automatic DA retriever walkback when P2P stalls and DA blocks feat: add automatic DA retriever walkback when P2P stalls and DA blocks too far ahead Apr 16, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 16, 2026

Claude finished @julienrbrt's task in 3m 22s —— View job


Code Review

  • Gather context and read changed files
  • Review code quality & architecture
  • Review error handling & resilience
  • Review concurrency patterns
  • Review testing
  • Post final review

Overview

This PR adds an automatic DA retriever walkback mechanism triggered when the P2P layer stalls and DA-sourced blocks are too far ahead of the node's current height. The main idea: if P2P is stalled and DA events reference block heights far above what the node has applied, the Subscriber's localDAHeight cursor is rewound so that DA retrieval re-covers those heights and the node can fall back to DA-sourced syncing.

The author's self-note ("do not like what the LLM has produced") is fair — there are genuine design and correctness concerns below.


Core Design Concern: One-height-at-a-time walkback is O(gap × 2s)

block/internal/syncing/syncer.go:574-587

if needsWalkback {
    s.walkbackActive.Store(true)
    const walkbackCooldown = 2 * time.Second
    s.walkbackCooldownUntil.Store(time.Now().Add(walkbackCooldown).UnixNano())
    // ...
    return daHeight - 1   // rewinds by exactly 1
}

The walkback steps back one DA height per 2-second cooldown window. If the DA subscriber has advanced 500 heights ahead of the node (which is realistic after a sustained P2P outage), recovery takes at least 1000 seconds before enough earlier heights are re-fetched.

A better approach would be to jump back directly to the DA height that corresponds to nodeHeight + 1 rather than descending one-by-one. For example:

// Instead of daHeight - 1, jump to the approximate DA height matching nodeHeight
targetDAHeight := max(s.daRetrieverHeight.Load(), estimatedDAHeightForNode)
return targetDAHeight

Even without estimatedDAHeightForNode, returning daHeight / 2 or nodeHeight + 1 (if tracked) would be dramatically faster. Fix this →


walkbackCheck silently uses s.ctx instead of a context parameter

block/internal/syncing/syncer.go:554

nodeHeight, err := s.store.Height(s.ctx)

The callback signature is func(daHeight uint64, events []common.DAHeightEvent) uint64 — it has no context. Internally it calls the store with the syncer's root context. This works while the syncer is running, but it's an unusual pattern: a callback closure that calls I/O without propagating the caller's context. If the caller's context is cancelled (e.g. during shutdown, where catchupLoop is cancelled), s.ctx may still be live and the store call succeeds, potentially performing unnecessary work. Either pass a context via the WalkbackChecker signature, or document this explicitly.


Missing cooldown test case

block/internal/syncing/syncer_test.go:1648-1765

TestSyncer_walkbackCheck tests the gap detection, contiguity stop, and P2P recovery cases, but there's no test that verifies the cooldown actually blocks a second walkback step. A test like this is needed:

t.Run("cooldown_blocks_consecutive_walkback", func(t *testing.T) {
    // first call returns rewind target and sets cooldown
    got1 := s.walkbackCheck(100, makeEvents(50))
    assert.Equal(t, uint64(99), got1)
    // immediate second call should return 0 (cooldown active)
    got2 := s.walkbackCheck(99, makeEvents(50))
    assert.Equal(t, uint64(0), got2)
})

Without this, a regression that removes the cooldown wouldn't be caught. Fix this →


Walkback can loop through empty DA heights unnecessarily

block/internal/syncing/syncer.go:559-576

needsWalkback := s.walkbackActive.Load()
if len(events) > 0 {
    // only clears walkbackActive if events are contiguous
}
// if events is empty and walkbackActive is true → rewinds again

When walkbackActive is true and HandleCatchup returns an empty events slice (DA height had no blobs — ErrBlobNotFound is swallowed in fetchAndPipeHeight), walkbackCheck is still called with events == nil. Since walkbackActive is already set, needsWalkback = true, and it rewinds again. This means the walkback keeps descending through DA heights that have no blocks at all, consuming the 2-second cooldown each time without making any progress toward closing the gap.

Consider skipping the rewind when events is empty: if there were no blobs at this DA height, rewinding past it cannot help.


p2pStalled may not reset after context cancellation

block/internal/syncing/syncer.go:476-531

p2pStalled is set to false only when ProcessHeight returns nil. If p2pWorkerLoop exits because ctx.Done() fires while p2pStalled == true, the field stays true. On the next syncer start (Start creates a fresh Syncer via NewSyncer), this won't matter since atomic.Bool zero-values to false. But if a syncer were ever restarted in place, it could incorrectly trigger walkback. This is likely not an issue in practice (syncer is not re-used) but worth a comment.


walkbackCheck: min-height scan allocates work on every HandleCatchup

block/internal/syncing/syncer.go:560-570

minHeight := events[0].Header.Height()
for _, e := range events[1:] {
    if e.Header.Height() < minHeight {
        minHeight = e.Header.Height()
    }
}

This runs on every HandleCatchup call, including the steady-state non-stalled path (checked at the top via p2pStalled, so it returns early). Actually, p2pStalled is checked first and returns 0 early, so this loop only runs when P2P is stalled — fine.


Subscriber RewindTo signals catchup redundantly

block/internal/da/subscriber.go:181-192

func (s *Subscriber) RewindTo(daHeight uint64) {
    // ...
    if s.localDAHeight.CompareAndSwap(cur, daHeight) {
        s.headReached.Store(false)
        s.signalCatchup()   // <-- redundant
        return
    }
}

RewindTo is called from inside runCatchup's inner loop, which does not exit after a rewind (the next iteration picks up from the rewound height). The signalCatchup() sends to the buffered channel, which will trigger an extra runCatchup call after the current one eventually terminates. This is harmless but creates a spurious extra execution. Minor nit.


MockSubscriberHandler.HandleCatchup panics on nil return

block/internal/da/subscriber_test.go:28-31

func (m *MockSubscriberHandler) HandleCatchup(ctx context.Context, height uint64) ([]common.DAHeightEvent, error) {
    args := m.Called(ctx, height)
    return args.Get(0).([]common.DAHeightEvent), args.Error(1)
}

args.Get(0).([]common.DAHeightEvent) will panic if the mock returns a nil interface (as opposed to a nil []common.DAHeightEvent). The test helper function workaround is Return([]common.DAHeightEvent(nil), ...) — which is currently used correctly in the tests, but this is fragile. Fix this →


Architectural observation: WalkbackChecker couples DA subscriber to application logic

WalkbackChecker is a generic callback on SubscriberConfig, but it's currently only needed by the DAFollower (not asyncBlockRetriever). The doc comment acknowledges this. Consider whether this belongs on DAFollowerConfig alone (and daFollower calls it after HandleCatchup internally) rather than threading it through the Subscriber primitive. This would keep the subscriber simpler and the coupling more explicit.


Summary

Area Finding
Design One-at-a-time walkback is too slow for large gaps
Design Walkback loops through empty DA heights on each cooldown tick
Correctness walkbackCheck does not take a context, uses s.ctx directly
Testing Missing test for cooldown-blocking behavior
Testing MockSubscriberHandler.HandleCatchup is fragile (potential panic on nil)
Architecture WalkbackChecker on Subscriber may be over-generalised

The P2P stall detection logic (p2pStalled flag in p2pWorkerLoop) is clean and correctly distinguishes genuine stalls from DA-event-triggered cancellations. The RewindTo/signalCatchup plumbing in Subscriber is correct. The main practical concern is recovery speed for large gaps.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.91%. Comparing base (27eeb48) to head (fd34ba7).

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 57.77% 16 Missing and 3 partials ⚠️
block/internal/syncing/da_follower.go 77.27% 4 Missing and 1 partial ⚠️
block/internal/da/subscriber.go 78.57% 2 Missing and 1 partial ⚠️
block/internal/da/async_block_retriever.go 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3262      +/-   ##
==========================================
+ Coverage   60.85%   60.91%   +0.06%     
==========================================
  Files         127      127              
  Lines       13762    13815      +53     
==========================================
+ Hits         8375     8416      +41     
- Misses       4476     4484       +8     
- Partials      911      915       +4     
Flag Coverage Δ
combined 60.91% <66.66%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant