chancloser: don't set ChanStatusCoopBroadcasted before close tx exists#10782
chancloser: don't set ChanStatusCoopBroadcasted before close tx exists#10782jtobin wants to merge 4 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where channels could become stranded in a limbo state during cooperative close negotiation if a peer disconnected before the closing transaction was finalized. By removing the premature setting of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
🔴 PR Severity: CRITICAL
🔴 Critical (2 files)
🟢 Low (1 file + excluded)
AnalysisThis PR modifies Severity bump check: 3 non-excluded files (< 20 threshold), 31 non-excluded lines changed (< 500 threshold), and only one distinct critical package touched — no bump applied. This PR warrants expert review due to the wallet/channel-close logic changes, even though the diff is small. The cooperative close path and RBF transitions directly affect fund safety. To override, add a |
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where ChanStatusCoopBroadcasted was set before a cooperative close transaction actually existed, leading to issues during node restarts. The changes remove the premature marking of the channel as broadcasted in chancloser.go and associated RBF transition logic, ensuring the channel remains in a default state until a transaction is ready. New integration and unit tests were added to verify that channels with ShutdownInfo now correctly resume negotiation after a restart. One review comment regarding an unnecessary variable declaration was identified.
ziggie1984
left a comment
There was a problem hiding this comment.
Nice writeup — the analysis of why MarkCoopBroadcasted(nil, …) is redundant matches what I see tracing readers of ChanStatusCoopBroadcasted through loadActiveChannels, restartCoopClose,
and the chain arbitrator. The fix is correct: ShutdownInfo (persisted by MarkShutdownSent in initChanShutdown) is the right durable signal, and the existing peer/brontide.go:1367-1487
reconnect path already consumes it. A few things before this lands:
The new tests don't actually defend the fix. I verified empirically by cherry-picking the itest commit onto the parent of the fix.
The held HODL HTLC keeps the channel from flushing, so neither BeginNegotiation nor the RBF ChannelFlushing → ClosingNegotiation transition runs before the suspend — the call sites this
PR removes are simply unreached. Same issue with the two new unit tests: TestShutdownInfoChannelStaysActive is tautological (MarkShutdownSent only writes to shutdownInfoKey, can't
mutate chanStatus or the in-memory map by construction), and TestResendChanSyncFailsForCoopBroadcastedLimbo manually creates the limbo via MarkCoopBroadcasted(nil, …) rather than
relying on production code to produce it — so it tests the symptom, not that the fix prevents the state.
Real regression coverage exists on the RBF side, not the legacy side. Removing expectChanPendingClose() from TestRbfChannelFlushingTransitions implicitly turns into a "no unexpected
MarkCoopBroadcasted call" assertion via testify/mock's Called() + AssertExpectations. ✓ But the legacy BeginNegotiation path is unguarded: mockChannel.MarkCoopBroadcasted
(chancloser_test.go:164) is a hand-rolled stub with no call tracking. A regression that re-adds the nil call there would slip through.
Suggestion: drop the itest and the two new unit tests, and add call tracking to the legacy mockChannel — either a markCoopBroadcastedCalls slice with a require.NotContains(t, …, nilTx)
assertion in the existing BeginNegotiation test, or convert it to embed mock.Mock like the RBF side. Few lines, milliseconds to run, actually defends both paths.
Two smaller items:
- ChannelFlushed.FreshFlush is now dead — only producer (peer/brontide.go:4075) sets it unconditionally to true, and the test loops at rbf_coop_test.go:1785, 1822 no longer
differentiate between the two values. Worth removing in this PR or a follow-up. - The PendingChannels/ListChannels visibility change is acknowledged in the PR description but missing from the release note. Operators with monitoring keyed on those buckets will see
channels move differently — channels stay in ListChannels (inactive) until the real close tx is broadcast, and WaitingCloseChannel.ClosingTx is no longer ever empty. Worth a sentence.
Hmm not sure but I think before this change the channel would immediatly move to the pendingClose channel and now with the new change it would only move to the pending close if the close-transaction was actually created which I think is still the right behavior now. A channel can still remain in the |
|
(CI errors from the last run look to be unrelated flakes, from what I can tell.) |
|
what about ?
|
Was removed in 5150469. 👍 👍 |
gijswijs
left a comment
There was a problem hiding this comment.
A few things:
The commit message of abf09c6 starts with a paragraph that is unclear:
Replace the peer-level unit tests and the integration test (dropped in previous rebase) with call tracking on the legacy mockChannel.MarkCoopBroadcasted stub. The old tests either never reached the removed call sites or were tautological.
In Dutch we would say: I can't make chocolate from that.
- I can't make inline comments for code that's untouched, so I'll put it here:
peer/brontide.go:3629-3635
The leading comment in restartCoopClose reads:
If this channel has status
ChanStatusCoopBroadcastedand does not
have a closing transaction, then the cooperative close process was
started but never finished. We'll re-create the chanCloser state
machine and resend Shutdown.
Post-PR, fresh installations cannot reach the "status set + no close tx" state — that was the limbo state being fixed. The branch that handles that case is now exclusively a backwards-compat recovery path for nodes upgrading with pre-existing limbo channels (plus the always-RBF restart path). Without a comment update, the next engineer to touch this code will likely conclude the branch is unreachable and either delete it (breaking upgrade recovery) or be confused by it.
Suggested rewrite:
// If this channel has status ChanStatusCoopBroadcasted and a closing
// transaction was recorded, we just need to rebroadcast (handled by
// the chain arbitrator) and exit. If the status is set but no closing
// tx exists, the channel was left in the pre-PR-#10782 limbo state:
// fall through and re-drive the close negotiation via ShutdownInfo
// or LocalUpfrontShutdownScript.- The interface should explicitly say the tx must be non-nil:
lnwallet/chancloser/interface.go
// MarkCoopBroadcasted persistently marks that the channel close
// transaction has been broadcast. The transaction MUST be non-nil;
// callers must not invoke this until a concrete close tx has been
// constructed. Passing a nil tx historically left channels stranded
// in ChanStatusCoopBroadcasted with no recorded close tx (PR #10782).
MarkCoopBroadcasted(*wire.MsgTx, lntypes.ChannelParty) errorchanneldb.markBroadcastedaccepts nil silentlychanneldb/channel.go:2195-2202,markBroadcastedat 2208-2238
The lowest-level helper accepts closeTx == nil and quietly skips the tx-bucket write (line 2217: if closeTx != nil) while still setting the status flag. We could make this an error rather than a silent skip.
- Worth exploring is the following: Incoming-HTLC acceptance window on coop-close reconnect (Sorry for the giant copy-pasta)
Locations:
- Load gate:
peer/brontide.go:1253-1300(non-default-status branch skipsaddLink) - Default-status fall-through to
addLink:peer/brontide.go:1367-1487 - Outgoing-only disable on resume:
htlcswitch/link.go:3944-3957—PreviouslySentShutdown.WhenSomecalls onlyDisableAdds(Outgoing)at:3946, thenmarkReestablished()at:3961. - Incoming gate:
htlcswitch/link.go:4017—processRemoteUpdateAddHTLCrejects only whenIsFlushing(Incoming). - Where Incoming actually gets disabled: only on receipt of the peer's
Shutdown—peer/brontide.go:5306(legacy) andlnwallet/chancloser/rbf_coop_transitions.go:330, :444viaChanObserver.DisableIncomingAdds→peer/chan_observer.go:90-104→link.DisableAdds(Incoming).
Mechanism. Pre-PR (legacy path), once BeginNegotiation ran, the channel had ChanStatusCoopBroadcasted set. On reconnect, loadActiveChannels took the non-default-status branch at brontide.go:1253, called restartCoopClose, and continued at :1299 — the link was never registered, so no update_add_htlc could reach it. Post-PR, the channel stays ChanStatusDefault throughout shutdown negotiation, so loadActiveChannels falls through to addLink at :1444. resumeLink synchronously calls DisableAdds(Outgoing) before markReestablished(), but never calls DisableAdds(Incoming) — that only runs when we process the peer's resent Shutdown.
Pre-existing for RBF. The PR-removed call in RBF (ChannelFlushing → MarkCoopBroadcasted(nil, true)) was gated on FreshFlush. For RBF closes of already-clean channels (FreshFlush=false), the status was never set early, so this window already existed pre-PR. This PR extends the same window to legacy and to the FreshFlush=true RBF case.
Impact. A spec-violating peer can push an update_add_htlc between our reconnect-time link resume and our receipt of their resent Shutdown. We accept it. No fund loss — the HTLC either succeeds (we get paid) or is cancelled back. The plausible griefing angle is delaying coop close completion: legacy closing_signed exchange requires a clean channel, so an HTLC the peer pushes during the window must resolve first; combined with repeated disconnect/reconnect or a stuck HTLC this could push the close toward force-close. A spec-following peer never sends update_add_htlc after sending Shutdown, so practical attack surface is small.
PoC (security-auditor original, with one correction: update_add_htlc and commit_sig are separate messages, not nested):
Alice + malicious Bob, legacy coop close.
1. Bob → Alice: Shutdown. Alice replies with Shutdown; MarkShutdownSent.
2. Bob disconnects before sending ClosingSigned.
3. Alice loads channel as ChanStatusDefault (post-PR), registers the link.
4. Alice's resumeLink resends Shutdown via PreviouslySentShutdown and
calls DisableAdds(Outgoing). Incoming is NOT disabled. Link is then
markReestablished().
5. Bob reconnects and sends update_add_htlc *before* his own Shutdown.
Alice's processRemoteUpdateAddHTLC sees IsFlushing(Incoming)==false
and accepts the HTLC.
6. Bob then sends Shutdown; DisableIncomingAdds fires — too late.
Remediation options:
- In
loadActiveChannels(post-addLink) whenShutdownInfois present, also calllink.DisableAdds(Incoming)— mirror the outgoing disable already inresumeLink. Concretely: extendPreviouslySentShutdown.WhenSomeatlink.go:3944to disable both directions, or invoke it from the caller afteraddLinkreturns. - Alternatively, disconnect on receiving
update_add_htlcwhile localShutdownInfois set, regardless of whether the peer's Shutdown has arrived. - Add a regression itest:
closechannel → disconnect → reconnect → malicious peer sends UpdateAddHTLC before its own Shutdown → assert HTLC rejected.
Not sure if it's worth the effort tho.
Apart from that only small nits inline.
|
@gijswijs I'm investigating the potential griefing attack now; looks very minor, but if the mitigation is simple and its correctness/righteousness is easy to ascertain, I can address it here. (EDIT: yeah, looks like an easy fix. Need to patch up a couple of comments too; will do all that tomorrow.) |
|
@gijswijs Re: the griefing attack: it's obviously very minor (spec-violating peer can delay coop close; force close remains available), and it seems the "easy" mitigation could cause more severe problems. E.g. if we just disable incoming HTLC adds after persisting Shutdown, and then crash before sending the Shutdown message, then we could reject spec-observant HTLC adds upon restart, since the peer didn't get the message. Given the problem is minor, and the easy solution to it is arguably worse, I'm inclined to ignore it here. Rest of your feedback has been addressed! |
Remove the two call sites that set ChanStatusCoopBroadcasted before a cooperative close transaction exists: - BeginNegotiation in the legacy close path (chancloser.go) - ChannelFlushed handling in the RBF close path (rbf_coop_transitions.go) Both calls passed nil as the close tx, creating a "limbo" state where ChanStatusCoopBroadcasted is set but no close transaction is stored. This is unnecessary because ShutdownInfo — persisted earlier by MarkShutdownSent in initChanShutdown / the RBF ShutdownPending transition — already serves as the durable signal that the shutdown flow was entered. ChanStatusCoopBroadcasted should only be set when a real close transaction exists, which this change preserves.
Add call tracking to the legacy mockChannel so that every MarkCoopBroadcasted invocation is recorded. TestTaprootFastClose now asserts that at least one call was made and that every call carried a non-nil tx, guarding against the limbo state described in lightninglabs/taproot-assets#2108.
FreshFlush is never read in any transition handler. The only
producer (peer/brontide.go) sets it unconditionally to true,
and after the previous commit removed expectChanPendingClose,
the test loops that iterated over {true, false} no longer
differentiate between the two values.
Remove the field, the unconditional assignment, and collapse
the test loops into single sub-tests.
Fixes lightninglabs/taproot-assets#2108.
Lightly-edited summary, ctsy Opus, of the issue and its fix:
Of note: the OP's logs in the linked issue indicate that he landed in the ChanStatusCoopBroadcasted state without any close transaction, which, I believe, could only have occurred if MarkCoopBroadcasted was called with nil. There appear to be no other code paths that set ChanStatusCoopBroadcasted without storing a close tx.
I was pretty curious why those MarkCoopBroadcasted(nil) calls existed, but after investigating them thoroughly, I couldn't really find any good justification for them. Opus's summary of this analysis is also fruitful to include here: