diff --git a/docs/testing-guides/v0.21.0/README.md b/docs/testing-guides/v0.21.0/README.md new file mode 100644 index 0000000000..61d627cca1 --- /dev/null +++ b/docs/testing-guides/v0.21.0/README.md @@ -0,0 +1,76 @@ +# LND v0.21.0 — Release Candidate Testing Guide + +This directory contains structured testing guides for the v0.21.0 +release candidate. Each guide targets one feature or one high-risk +regression surface, and follows the same template so both human RC +testers and automated agents can work through them predictably. + +The RC announcement is [discussion +#10766](https://github.com/lightningnetwork/lnd/discussions/10766). +The full release notes are at +[`docs/release-notes/release-notes-0.21.0.md`](../../release-notes/release-notes-0.21.0.md). + +## How to use this directory + +Each guide is self-contained and follows the layout in +[`_template.md`](./_template.md): + +1. **Prerequisites** — what to build, which backend, which network, + which peers and config flags. +2. **Setup** — copy-pasteable commands to reach the starting state. +3. **Scenarios** — numbered cases, each with a deterministic + pass/fail signal (an exact RPC field value, log line, or exit + code — not "should succeed"). +4. **Failure investigation** — logs and RPCs to query when a scenario + fails. + +**Humans:** pick guides matching the surface you care about and run +through the scenarios. Report results on +[discussion #10766](https://github.com/lightningnetwork/lnd/discussions/10766) +or open an issue if you find a regression. + +**Agents:** the fixed section order is the contract. The Pass/Fail +signal line in each scenario is the verification target. + +## Guides + +Ordered by risk to RC testers. Start at the top. + +### Headline features + +| # | Guide | Summary | +|---|---|---| +| 1 | [Production simple taproot channels](./production-taproot-channels.md) | Feature bits 80/81, optimized scripts, map-based nonce encoding. | +| 2 | [RBF cooperative close for taproot channels](./rbf-taproot-coop-close.md) | MuSig2 JIT nonces, nonce-reuse prevention, `--protocol.rbf-coop-close`. | +| 3 | [Payment store KV→SQL migration](./payment-sql-migration.md) | Automatic migration for `--db.use-native-sql` nodes; bbolt users must `lndinit` first. | +| 4 | [Onion messaging + rate limiting](./onion-messaging.md) | Basic onion message forwarding, pathfinding, per-peer/global rate limiters, channel-presence gate. | + +### High-risk regressions and breaking changes + +| # | Guide | Why it's risky | +|---|---|---| +| 5 | [Closed-channel tombstone (sqlite/postgres downgrade trap)](./closed-channel-tombstone.md) | One-way upgrade on KV-over-SQL backends; downgrading after closes resurrects channels as open. | +| 6 | [Reorg-safe channel closes + MinCLTVDelta change](./reorg-safe-closes.md) | Closes now require 3–6 confs scaled to capacity; `MinCLTVDelta` raised 18→24 (breaking for custom-CLTV invoices). | +| 7 | [`chain_params` network-mismatch DB guard](./chain-params-guard.md) | Native-SQL nodes refuse to start if the DB was previously used on a different network. | +| 8 | [`GetDebugInfo` log opt-in breaking change](./getdebuginfo-log-optin.md) | Clients relying on the `log` field break unless they pass `include_log=true`. | + +### New RPCs / operator features + +| # | Guide | Summary | +|---|---|---| +| 9 | [New payment-adjacent RPCs](./payment-rpcs.md) | `DeleteForwardingHistory`, MuSig2 coordinator nonces, `EstimateFee` inputs, HTLC event invoice failures, `SubscribeChannelEvents` updates. | +| 10 | [Multiple read-only middleware interceptors](./middleware-multiple-readonly.md) | More than one read-only RPC middleware interceptor can register at once. | + +## Reporting results + +- Working as expected: a 👍 reaction on the RC discussion is fine. +- Regression or unexpected behavior: open an issue with the guide + name, scenario number, and the captured output. Link to the issue + in the discussion thread. + +## Authoring new guides + +Copy [`_template.md`](./_template.md) to `.md`, fill in +every section, and add an entry to the table above. Keep the section +order intact. If a section genuinely doesn't apply, write `n/a` — +don't delete the heading. diff --git a/docs/testing-guides/v0.21.0/_template.md b/docs/testing-guides/v0.21.0/_template.md new file mode 100644 index 0000000000..2cc8d7c697 --- /dev/null +++ b/docs/testing-guides/v0.21.0/_template.md @@ -0,0 +1,140 @@ + + +# — v0.21.0 RC Testing Guide + +**PRs:** #XXXX, #YYYY +**Risk:** headline | high-regression | new-rpc | operator-feature +**Audience:** node operators | RPC clients | LSPs | wallet integrators +**Backends affected:** bbolt | sqlite | postgres | all +**Networks:** regtest | signet | testnet | mainnet + +## What this feature does + +One to three sentences in plain English. No marketing language. State +what changed in observable behavior, not internal refactors. + +## Why it matters / what could break + +Concrete failure modes a tester should look for. Examples: +- "If X is wrong, channel force-closes." +- "If Y is wrong, payments stall in `IN_FLIGHT`." +- "If Z is wrong, the node refuses to start after upgrade." + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer, built with ``. +- **Backend:** `bitcoind` / `btcd` / `neutrino`. +- **Network:** regtest unless noted. +- **Peers:** N nodes (Alice, Bob, Carol). State counts, channels, + balances at the start of the scenarios. +- **Config flags:** + ``` + protocol.option-name=value + db.option-name=value + ``` +- **Tools:** `lncli`, `bitcoin-cli`, `jq`, ... + +Define every shell variable used in the Setup and Scenarios blocks: +``` +ALICE_RPC=localhost:10001 +BOB_PUBKEY=03... +``` + +## Setup + +Numbered, copy-pasteable steps to get from "fresh nodes" to the +starting state for the scenarios. End with a single command whose +output proves setup succeeded. + +```bash +# 1. Start nodes +... + +# 2. Fund Alice +... + +# 3. Open Alice→Bob channel +... + +# Setup verification: +lncli --rpcserver=$ALICE_RPC listchannels | jq '.channels | length' +# Expected: 1 +``` + +## Scenarios + +### S1: + +**Goal:** What this scenario proves. + +**Steps:** +```bash +# 1. ... +# 2. ... +``` + +**Expected:** +- Concrete observable 1 (RPC field = value, log contains line, etc.) +- Concrete observable 2. + +**Pass/Fail signal:** +- **PASS** if `lncli ... | jq '.field'` returns `"expected_value"`. +- **FAIL** if the command errors, returns a different value, or the + log shows ``. + +--- + +### S2: + +**Goal:** ... + +**Steps:** ... + +**Expected:** ... + +**Pass/Fail signal:** ... + +--- + +(Add 2–5 scenarios. Cover at least one happy path, one edge case, and +one negative-path / misconfiguration scenario.) + +## Failure investigation + +When a scenario fails, here's where to look first: + +- **Logs:** + - `grep -i "" ~/.lnd/logs/bitcoin/mainnet/lnd.log` + - Subsystems to enable at `debug`: ``, ``. +- **RPCs to query for state:** + - `lncli ` — what to look at and what value indicates the bug. +- **Common bugs / prior regressions:** brief pointers, ideally with + PR / issue numbers. + +## Related itests + +Point to itest cases that exercise this code path. They're not a +substitute for manual scenarios but are useful executable references: +- `itest/lnd__test.go::Test` + +## Out of scope + +What this guide does not test (to prevent scope creep and to direct +testers to the right guide). diff --git a/docs/testing-guides/v0.21.0/chain-params-guard.md b/docs/testing-guides/v0.21.0/chain-params-guard.md new file mode 100644 index 0000000000..38f6eccc90 --- /dev/null +++ b/docs/testing-guides/v0.21.0/chain-params-guard.md @@ -0,0 +1,188 @@ +# `chain_params` Network-Mismatch DB Guard — v0.21.0 RC Testing Guide + +**PR:** #10684 +**Risk:** high-regression +**Audience:** node operators running native-SQL backends, anyone with a multi-network setup +**Backends affected:** sqlite, postgres (with `--db.use-native-sql`) +**Networks:** all + +## What this feature does + +On first startup against v0.21.0, the daemon writes the active +Bitcoin network (mainnet, testnet, signet, regtest) into a new +`chain_params` row in the SQL database. On every subsequent startup, +the daemon compares the stored value against the configured network +and refuses to start if they differ, printing a clear error and +remediation steps. + +This closes a silent-data-corruption hole: previously, accidentally +pointing the same DB at a different network would proceed and start +writing mismatched chain state. + +Applies to PostgreSQL and SQLite native-SQL backends when running +with `--db.use-native-sql=true`. bbolt is unaffected. + +## Why it matters / what could break + +- The guard must fire on **every** network change, not just + mainnet/testnet. Regtest ↔ signet ↔ testnet swaps need to be + caught too. +- The guard must fire **before** any chain-touching subsystem + initializes; otherwise the data corruption it's meant to prevent + has already started. +- The error message must be actionable — operators need to know + how to recover (reset the DB? change the config back?). +- The guard must not interfere with the first-ever startup + (network is unset, so any value is allowed and persisted). + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Backend:** sqlite (easiest) with `--db.use-native-sql=true`. +- **Two networks reachable** in the same machine, e.g. regtest and + signet, or regtest with two different `--regtest` instances using + distinct genesis hashes. (The latter requires + `chainparams.go`-level tweaks; using `regtest` and `signet` is + more practical.) + +## Setup + +```bash +# 1. Fresh data directory. +ALICE_DIR=/tmp/alice-chainparams-test +rm -rf $ALICE_DIR +mkdir -p $ALICE_DIR + +# 2. lnd.conf: +# bitcoin.regtest=1 +# db.use-native-sql=true +# db.backend=sqlite + +# 3. Start lnd, wait for it to come up, then `lncli stop`. +``` + +## Scenarios + +### S1: First startup persists the active network + +**Goal:** A fresh DB plus a configured network results in a +populated `chain_params` row. + +**Steps:** +```bash +# Start lnd on regtest. Wait until lncli getinfo returns OK. +# Stop cleanly. +$LNCLI_A stop + +# Inspect the chain_params table. +sqlite3 $DB "SELECT * FROM chain_params;" +``` + +**Pass/Fail signal:** +- **PASS** if `chain_params` contains exactly one row identifying + regtest (by name or genesis hash, depending on the schema). +- **FAIL** if the table is empty, missing, or has more than one row. + +--- + +### S2: Same network restart proceeds without warning + +**Steps:** Restart lnd against the same DB with the same +configuration. + +**Pass/Fail signal:** +- **PASS** if lnd starts normally, `getinfo` returns OK, and the + startup log contains no chain-params warnings. +- **FAIL** if startup logs a warning or error about chain params + even though nothing changed. + +--- + +### S3: Different network startup is refused with a clear error + +**Goal:** Swap the configured network to `signet` while pointing at +the same SQL DB. lnd must refuse to start. + +**Steps:** +```bash +# Edit lnd.conf: replace bitcoin.regtest=1 with bitcoin.signet=1. +# Start lnd and capture the exit status + stderr. +lnd --lnddir=$ALICE_DIR ...; echo "exit=$?" +``` + +**Pass/Fail signal:** +- **PASS** if all of: + - exit code is non-zero, + - the error message names both the stored network (regtest) and + the configured network (signet), + - the error suggests a remediation (e.g. "use a fresh data + directory, or change your configured network back"), + - no chain-touching subsystems were initialized (search the log + for evidence of chain RPC calls — none should have happened). +- **FAIL** if lnd starts despite the mismatch, exits without a + clear message, or partially initializes before exiting. + +--- + +### S4: Reverting the network restores normal startup + +**Goal:** After hitting the guard, switching the config back to the +stored network must let the node start again — i.e. the guard is +non-destructive. + +**Steps:** Revert `lnd.conf` back to `bitcoin.regtest=1` and start +lnd. + +**Pass/Fail signal:** +- **PASS** if lnd starts normally and `getinfo` returns OK. +- **FAIL** if lnd still refuses (the failed attempt left state + behind that should not have). + +--- + +### S5: bbolt-backed node is unaffected + +**Goal:** Confirm the guard only applies to native-SQL backends. +bbolt operators get the existing behavior (no guard, no false alarm). + +**Steps:** +- Spin up Bob on bbolt (no `db.use-native-sql=true`). +- Repeat S1–S3 against him. + +**Pass/Fail signal:** +- **PASS** if all three startups succeed on bbolt, including the + network swap. (Operators on bbolt still need to know swapping + networks corrupts data — but the guard is not their tool.) +- **FAIL** if the guard fires on bbolt despite the feature being + scoped to native-SQL. + +## Failure investigation + +- **Subsystems:** `LNDB`, `CONF`, `RPCS` at `debug`. +- **Useful log lines:** grep for `chain_params`, `network mismatch`, + `configured network`. +- **Direct SQL:** + ```sql + SELECT * FROM chain_params; + ``` +- **Remediation if a real operator hits this:** the documented + guidance should be "you almost certainly want to revert the + config change; if you genuinely meant to switch networks, point + lnd at a fresh data directory". Verify the error message says + this or something equivalent. + +## Related itests + +- A startup-failure itest for chain-params mismatch should exist + in v0.21.0 — verify it does. If not, this guide highlights a + coverage gap worth filling. + +## Out of scope + +- Re-using a bbolt database across networks (not covered by this + guard). +- Postgres-specific schema differences — assume the guard behaves + identically across sqlite and postgres; spot-check the postgres + side if available. +- Recovery tooling for an operator who already corrupted their DB + before v0.21.0 — not something this guide can fix. diff --git a/docs/testing-guides/v0.21.0/closed-channel-tombstone.md b/docs/testing-guides/v0.21.0/closed-channel-tombstone.md new file mode 100644 index 0000000000..8665cc5d41 --- /dev/null +++ b/docs/testing-guides/v0.21.0/closed-channel-tombstone.md @@ -0,0 +1,259 @@ +# Closed-Channel Tombstone on KV-over-SQL Backends — v0.21.0 RC Testing Guide + +**PR:** #10780 +**Risk:** high-regression (one-way upgrade) +**Audience:** node operators on `sqlite` or `postgres` backends with channels they may close +**Backends affected:** sqlite, postgres (kvdb-on-SQL schema) +**Networks:** all + +> ⚠️ **Downgrade warning.** On sqlite and postgres, once a channel +> is closed under v0.21.0+, the underlying `chanBucket` (revocation +> log, forwarding-package state) **remains on disk**. The close is +> signalled by an `outpointClosed` flip in the outpoint index. +> **Pre-0.21 binaries do not consult that flip when iterating the +> open-channel bucket.** Downgrading to a pre-0.21 binary after +> closing channels on these backends will resurrect those channels +> as "open" in `listchannels`, `pendingchannels`, and the chain-watch +> path. Treat the v0.21.0 upgrade as one-way on sqlite/postgres if +> you close any channels on it. +> +> bbolt and etcd users are unaffected — the close path on those +> backends still deletes the `chanBucket` synchronously. + +## What this feature does + +Before v0.21.0, `CloseChannel` issued a single `DeleteNestedBucket` +for the channel inside the close transaction. On the kvdb-on-SQL +schema (sqlite, postgres) that delete fans out into a row-by-row +`ON DELETE CASCADE` over the channel's revocation log and +forwarding-package bucket. On channels with millions of states this +held the database write-lock for many seconds — long enough to stall +HTLC forwarding, time out `htlcswitch` retries, and trigger +force-close cycles on adjacent channels. + +v0.21.0 changes `CloseChannel` on the kvdb-on-SQL backends to **skip +the cascading delete**. The bulk historical state stays on disk for +the lifetime of the database; the authoritative closed-channel +marker is the existing outpoint-index flip from `outpointOpen` to +`outpointClosed`. Every reader of the open-channel bucket has been +updated to consult the outpoint index before treating a channel as +open. bbolt and etcd retain the synchronous one-shot close path (the +cascade is cheap there). + +The bulk historical state is reclaimed wholesale by the upcoming +native-SQL channel-state migration in a future release. + +## Why it matters / what could break + +- **The downgrade trap above.** This is the one to make sure + operators see in the release notes. +- A reader that forgot to consult the outpoint index → a closed + channel reappears as open in `listchannels`, `pendingchannels`, or + the chain-watch filter. Find it now, not in the wild. +- A new code path (post-v0.21.0) that creates a channel whose + outpoint is in the closed-index but whose chanBucket still has the + old state → potential state confusion on funding-output collision. +- bbolt/etcd path **must** keep the synchronous delete; if a future + refactor moves them onto the tombstone path, those operators lose + the data-reclaim property silently. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Two pairs of test environments:** one on `sqlite` (or `postgres`), + one on `bbolt`. The cross-check between them is the point. +- **Tools:** `lncli`, `bitcoin-cli`, `sqlite3` (or `psql`), `jq`. +- **A pre-0.21 lnd binary** in reach for the downgrade scenario + (S5). Only run S5 against a throwaway copy of the database — the + whole point is that it corrupts the state. + +## Setup + +```bash +# 1. Start Alice on sqlite (or postgres). Start Bob on bbolt +# (so we can cross-check against the other backend). +# 2. Open and confirm a channel between Alice and Bob (any commitment +# type). Push a handful of small payments to populate the +# revocation log. +# 3. Record the channel point. +CP=$($LNCLI_A listchannels | jq -r '.channels[0].channel_point') + +# 4. Note the sqlite DB path on Alice. +DB_A=$ALICE_DIR/data/chain/bitcoin/regtest/channel.db # or your sqlite filename +``` + +## Scenarios + +### S1: Coop close on sqlite/postgres completes quickly (no stall) + +**Goal:** Confirm the regression fix. Closing a channel with a +non-trivial revocation log no longer holds the write-lock for +seconds. + +**Steps:** +```bash +# Time the close. +time $LNCLI_A closechannel \ + --funding_txid=${CP%:*} --output_index=${CP##*:} --block + +bitcoin-cli -regtest generatetoaddress 6 $ADDR +``` + +**Pass/Fail signal:** +- **PASS** if the `closechannel --block` returns in under 1 second + after the close-tx is broadcast (the `--block` wait for 1 conf + dominates the wall-clock; what we care about is no DB-side stall + during close-state-transition). +- **FAIL** if the daemon log shows a `DB write took NNs` warning, or + the close noticeably lags adjacent HTLC traffic. + +--- + +### S2: Closed channel no longer appears in `listchannels` / `pendingchannels` + +**Goal:** Despite the tombstone leaving rows on disk, every reader +must treat the channel as closed. + +**Steps:** +```bash +$LNCLI_A listchannels | jq '.channels | length' +$LNCLI_A pendingchannels | jq '.waiting_close_channels | length' +$LNCLI_A pendingchannels | jq '.pending_force_closing_channels | length' +$LNCLI_A closedchannels | jq '.channels | length' +``` + +**Pass/Fail signal:** +- **PASS** if all of `listchannels`, `waiting_close_channels`, and + `pending_force_closing_channels` return 0, and `closedchannels` + contains the closed channel. +- **FAIL** if `listchannels` still shows the closed channel — the + outpoint-index gate regressed. + +--- + +### S3: The `chanBucket` remains on disk after close (sqlite) + +**Goal:** Verify the tombstone actually skipped the delete. This is +the property that creates the downgrade trap; we want to *see* it, +not assume it. + +**Steps:** +```bash +sqlite3 $DB_A "SELECT COUNT(*) FROM kvstore WHERE key LIKE '%openChannelBucket%';" +sqlite3 $DB_A "SELECT COUNT(*) FROM kvstore WHERE key LIKE '%revocationLog%';" +``` + +(The exact bucket prefix and table layout depends on the kvdb-on-SQL +schema; consult `channeldb` / `kvdb/sqlbase` for the right +predicate. The point is: rows exist for the closed channel.) + +**Pass/Fail signal:** +- **PASS** if both counts are > 0 *and* the corresponding outpoint + appears in the `outpointClosed` index (verify with a second + query). +- **FAIL** if the chanBucket rows are gone (the tombstone behavior + didn't apply on this backend), or if the outpoint flip is missing + (the channel is in limbo). + +--- + +### S4: bbolt control — chanBucket IS deleted + +**Goal:** Cross-check that bbolt/etcd retain the synchronous delete. + +**Steps:** +- Repeat S1+S3 on Bob (bbolt). Verify with the equivalent bbolt + introspection (e.g. `bbolt` CLI or `lndinit dump` against the + channel.db). + +**Pass/Fail signal:** +- **PASS** if on bbolt the chanBucket for the closed channel is + gone (delete still happened) and `closedchannels` still records + the close. Both backends should expose the same operator-facing + state via RPC; only the on-disk representation differs. +- **FAIL** if bbolt left chanBucket rows behind (the change leaked + to the wrong backend) or removed the closed-channel summary. + +--- + +### S5: Downgrade resurrects closed channels (DESTRUCTIVE — copy first) + +**Goal:** Demonstrate the documented downgrade trap on a throwaway +DB copy, so we can confirm the warning is accurate and the surface +is exactly as advertised. + +> Only run this against a copy of the sqlite/postgres database. The +> downgrade is a one-way corruption. + +**Steps:** +```bash +$LNCLI_A stop +cp -r $ALICE_DIR ${ALICE_DIR}.tombstone-test +# Swap the binary back to a pre-0.21 release. +lnd-pre-021 --lnddir=${ALICE_DIR}.tombstone-test ... +``` + +**Pass/Fail signal:** +- **PASS** if `listchannels` against the downgraded daemon shows + the previously-closed channel as open (confirming the warning is + real), and there are no surprises beyond the documented + resurrection (no panics, no force-closes attempted on the + resurrected channel). +- **FAIL** if the downgraded daemon panics, force-closes on + startup, or behaves differently than the release-notes warning + predicts. The warning needs updating. + +After this scenario, **discard the copy**. + +--- + +### S6: HTLC forwarding does not stall during a close + +**Goal:** The original motivation for #10780 — the write-lock held +during the cascade used to stall HTLC forwarding. Confirm it +doesn't anymore. + +**Steps:** +- Open a second channel Alice ↔ Carol so Alice has two channels. +- Start a steady stream of payments through Alice→Bob→Carol (or any + two-channel routing path). +- Mid-stream, close Alice's other channel (the one not on the + payment path). + +**Pass/Fail signal:** +- **PASS** if no payment fails with a routing-layer timeout during + the close, and Alice's log shows no `htlcswitch retry timed out` + or `db write took` warnings. +- **FAIL** if any in-flight payment fails or retries during the + close window. + +## Failure investigation + +- **Subsystems:** `LNDB`, `CRTR`, `HSWC`, `RPCS` at `debug`. +- **Key log lines to watch for:** + - `tombstoning channel` (or whatever the v0.21.0 code emits when + taking the new path) + - `db write took NNms` warnings + - `outpointClosed` index transitions +- **DB introspection:** `sqlite3` direct queries for the channel's + outpoint index entries and chanBucket presence. +- **Cross-reference:** if S2 reports a closed channel still listed + as open, search the readers of `openChannelBucket` for any path + that doesn't consult `outpointIndex` — that's where the bug is. + +## Related itests + +- `itest/lnd_channel_force_close_test.go` and + `itest/lnd_channel_open_test.go` — exercise close paths but may + not specifically cover the tombstone behavior. Worth adding an + itest if one doesn't exist. +- `channeldb` package tests for outpoint-index transitions. + +## Out of scope + +- The native-SQL channel-state migration that will reclaim the + tombstoned data — not in v0.21.0. +- Performance benchmarks of the close path on multi-million-state + channels — qualitative confirmation (no stall) suffices for the + RC. +- bbolt-to-sqlite conversion (use `lndinit`). diff --git a/docs/testing-guides/v0.21.0/getdebuginfo-log-optin.md b/docs/testing-guides/v0.21.0/getdebuginfo-log-optin.md new file mode 100644 index 0000000000..dd2ecadb57 --- /dev/null +++ b/docs/testing-guides/v0.21.0/getdebuginfo-log-optin.md @@ -0,0 +1,140 @@ +# `GetDebugInfo` Log Opt-In Breaking Change — v0.21.0 RC Testing Guide + +**PR:** #10613 +**Risk:** high-regression (breaking) +**Audience:** RPC clients calling `GetDebugInfo`, anyone scripting `lncli getdebuginfo` or `lncli encryptdebugpackage` +**Backends affected:** all +**Networks:** all + +## What this feature does + +`GetDebugInfo` previously returned both the daemon's configuration +map and the contents of the log file by default. v0.21.0 makes the +log content **opt-in**: + +- Default response: configuration only. The `log` field is empty/omitted. +- With `include_log=true` on the gRPC request (or `--include_log` on + `lncli getdebuginfo` / `lncli encryptdebugpackage`): the log file + is included as before. + +This is a real breaking change for any client that consumed the `log` +field without setting the new flag. + +## Why it matters / what could break + +- Monitoring tools or support-package generators that called + `GetDebugInfo` and uploaded the response now upload an empty log + silently. They will not error — they will look healthy while + shipping useless debug bundles. +- Scripts that parsed `lncli getdebuginfo` output for log lines + will start matching nothing. +- The `--include_log` flag must propagate cleanly into the encrypted + debug package; otherwise support-flow encrypted bundles will be + log-free. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **A running lnd node** with some log activity (any regtest setup + with a few RPCs called against it works). +- **Tools:** `lncli`, `jq`, `grpcurl` (to confirm raw gRPC behavior). + +## Scenarios + +### S1: Default `GetDebugInfo` omits the log + +**Goal:** A plain `GetDebugInfo` call returns config only, no log. + +**Steps:** +```bash +# Via lncli: +$LNCLI_A getdebuginfo | jq '.log | length' + +# Via raw gRPC: +grpcurl ... -d '{}' lnrpc.Lightning/GetDebugInfo | jq '.log | length' +``` + +**Pass/Fail signal:** +- **PASS** if both queries return `0` or `null` for the `log` + field, and `config` is populated. +- **FAIL** if `log` is non-empty (the breaking change didn't land), + or if `config` is empty (regression). + +--- + +### S2: `--include_log` opts the log content back in + +**Steps:** +```bash +$LNCLI_A getdebuginfo --include_log | jq '.log | length' +grpcurl ... -d '{"include_log": true}' lnrpc.Lightning/GetDebugInfo | jq '.log | length' +``` + +**Pass/Fail signal:** +- **PASS** if both return a length > 0 and the content matches the + daemon's actual log file (compare against the file on disk). +- **FAIL** if `log` is empty despite `include_log=true`, or if the + content is truncated unexpectedly. + +--- + +### S3: `encryptdebugpackage --include_log` includes the log + +**Goal:** The opt-in flag propagates through the encrypt path. + +**Steps:** +```bash +# Without the flag. +$LNCLI_A encryptdebugpackage --pubkey > /tmp/pkg-nolog.bin + +# With the flag. +$LNCLI_A encryptdebugpackage --pubkey --include_log > /tmp/pkg-withlog.bin + +# Compare sizes. +ls -la /tmp/pkg-nolog.bin /tmp/pkg-withlog.bin +``` + +**Pass/Fail signal:** +- **PASS** if `pkg-withlog.bin` is noticeably larger than + `pkg-nolog.bin` (the log is in there), and decrypting both with + the corresponding private key shows the log section present / + absent respectively. +- **FAIL** if they're the same size (the flag is being ignored), or + if the no-log package still contains log content. + +--- + +### S4: Existing clients that don't set the flag get no log silently + +**Goal:** Confirm the breaking-change behavior matches the +documented contract — no spurious errors, just a quiet omission. + +**Steps:** Call `GetDebugInfo` from a client written against the +v0.20 proto / SDK (i.e. one that doesn't know about `include_log`). + +**Pass/Fail signal:** +- **PASS** if the call succeeds, returns `config` populated, and + `log` empty. No `unknown field` errors, no panics. +- **FAIL** if the call errors out due to the new field, or + surprisingly returns the log anyway. + +## Failure investigation + +- **Subsystems:** `RPCS` at `debug`. +- **What to check if `--include_log` returns empty log:** + - The daemon's `logfile` config — is the log being written to the + expected path? + - File permissions on the log file from the daemon process. + - The proto generation — `git diff` against the proto regenerate + pipeline to confirm `include_log` is wired both ways. + +## Related itests + +- `itest/lnd_macaroons_test.go` or a dedicated debug-info itest if + one exists. Worth adding a unit/integration test if missing. + +## Out of scope + +- Encryption details of `encryptdebugpackage` — see existing docs + for the format. This guide tests the log-inclusion flag only. +- Migrating clients to the new flag — that's a downstream task. diff --git a/docs/testing-guides/v0.21.0/middleware-multiple-readonly.md b/docs/testing-guides/v0.21.0/middleware-multiple-readonly.md new file mode 100644 index 0000000000..0d5b6cc2fa --- /dev/null +++ b/docs/testing-guides/v0.21.0/middleware-multiple-readonly.md @@ -0,0 +1,175 @@ +# Multiple Read-Only RPC Middleware Interceptors — v0.21.0 RC Testing Guide + +**PR:** #10611 +**Risk:** operator-feature +**Audience:** integrators running RPC middleware (audit logging, metrics, policy) +**Backends affected:** all +**Networks:** all + +## What this feature does + +Pre-v0.21.0, only a single read-only RPC middleware interceptor could +register at a time. v0.21.0 lifts that restriction: multiple +clients can register simultaneously with `read_only_mode=true` on +`MiddlewareRegistration`. Each registered middleware receives every +intercepted request/response, none can alter responses. + +The custom-macaroon-caveat middleware mode is unchanged — there can +still be at most one middleware per distinct caveat name, and those +remain mutually exclusive with read-only mode for the same client. + +## Why it matters / what could break + +- Two read-only middlewares connect → both receive intercepts. + Regression: only the first registers, the second errors with the + pre-v0.21 "already registered" message. +- Order-of-delivery to multiple middlewares should be deterministic + (per the middleware-pipeline contract, not arbitrary). +- A read-only middleware disconnecting mid-stream must not break + the pipeline for the others. +- A custom-caveat middleware registered alongside read-only ones + should still work; verify the caveat-vs-read-only mutex is + per-client and not global. +- Registration cleanup on disconnect — if a middleware drops without + unregistering, its slot must be freed so the next attempt can + register. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer, with macaroons enabled + (default). +- **A middleware client** that opens the bidi stream on + `lnrpc.Lightning.RegisterRPCMiddleware`, sends a + `MiddlewareRegistration` with `read_only_mode=true`, and logs + every intercept it receives. The example in + [`docs/macaroons.md`](../../macaroons.md) or the + `lnrpc/lightning.proto` `RegisterRPCMiddleware` description is + the reference. For testing it's enough to write a 50-line + grpcurl wrapper or a small Go program. +- **Tools:** `lncli`, `grpcurl`, `jq`. + +## Scenarios + +### S1: Two read-only middlewares register and both observe an RPC + +**Goal:** Confirm the registration limit is gone and both clients +see the same intercepts. + +**Steps:** +1. Start middleware client A — register with + `middleware_name="mw-a"`, `read_only_mode=true`. Log every + intercept it receives. +2. Start middleware client B — register with + `middleware_name="mw-b"`, `read_only_mode=true`. Log every + intercept it receives. +3. From a separate client, run `lncli getinfo`. + +**Pass/Fail signal:** +- **PASS** if both A and B log a `GetInfo` intercept (request and + response), and `lncli getinfo` returns successfully. +- **FAIL** if B's registration is rejected with an + "already-registered" error, or B never receives any intercepts. + +--- + +### S2: A third read-only middleware can register too + +**Goal:** No magic-number-two limit hidden anywhere. + +**Steps:** Add a third middleware client C with the same +configuration. Issue another `lncli getinfo`. + +**Pass/Fail signal:** +- **PASS** if A, B, and C all log the intercept. +- **FAIL** if registration is rejected at any specific count, or + if any of the three stops receiving intercepts. + +--- + +### S3: One middleware disconnects without affecting the others + +**Goal:** Cleanup-on-disconnect works and the pipeline keeps +intercepting for the remaining clients. + +**Steps:** +1. With A and B registered, drop A's stream (Ctrl-C the client). +2. Wait 2–3 seconds. +3. Run another `lncli getinfo`. +4. Re-register a new A' under the same name. + +**Pass/Fail signal:** +- **PASS** if (a) B still receives the post-disconnect intercept, + (b) lnd's log records A's cleanup (`middleware ... disconnected` + or similar), and (c) the new A' registration succeeds. +- **FAIL** if B stops receiving intercepts after A drops, or if + A''s re-registration is rejected because the slot wasn't freed. + +--- + +### S4: A read-only middleware cannot alter a response + +**Goal:** Property still holds — read-only is read-only. + +**Steps:** From middleware client A, intercept a `GetInfo` +response and try to mutate it (e.g. change `identity_pubkey`) +before sending the `InterceptFeedback`. The framework must reject +the mutation. + +**Pass/Fail signal:** +- **PASS** if `lncli getinfo` returns the unmodified value, and + the daemon log records a rejection (`middleware attempted to + alter response` or similar). The mutating middleware can + optionally be disconnected by the daemon — verify that matches + the contract. +- **FAIL** if the mutation goes through (broken invariant), or + if a benign read-only intercept is incorrectly flagged as a + mutation. + +--- + +### S5: Read-only + custom-caveat middlewares coexist + +**Goal:** A read-only middleware and an independent +custom-caveat middleware can both register at the same time +without interfering. + +**Steps:** +1. Register middleware A with `read_only_mode=true`. +2. Bake a macaroon with caveat name `my-caveat`. +3. Register middleware B with + `custom_macaroon_caveat_name="my-caveat"` (and + `read_only_mode=false`). +4. Call `lncli getinfo` with the caveat-bearing macaroon. + +**Pass/Fail signal:** +- **PASS** if both A (read-only) and B (caveat) receive the + intercept, and the response is returned to the caller. A call + *without* the caveat macaroon must reach A but not B. +- **FAIL** if either registration is rejected with a "mutual + exclusion" error, or if the caveat-targeted middleware + receives intercepts that don't carry the caveat. + +## Failure investigation + +- **Subsystems:** `RPCS`, `RPCSV` (depending on which subsystem + owns the middleware pipeline in v0.21.0). +- **Useful greps:** `middleware`, `register`, `intercept`, + `read_only_mode`. +- **Registration-state check:** if lnd exposes a status RPC for + registered middlewares, query it before/after each scenario. + Otherwise, rely on log lines. + +## Related itests + +- `itest/lnd_macaroons_test.go` and any + `itest/lnd_middleware_test.go` — verify the multi-registration + test exists; add one if not. +- Unit tests in `rpcperms/`. + +## Out of scope + +- Caveat-based macaroons themselves — see + [`docs/macaroons.md`](../../macaroons.md). +- Performance of fan-out to many middlewares — qualitative + confirmation (it works) is sufficient for the RC; sustained + load testing is a separate effort. diff --git a/docs/testing-guides/v0.21.0/onion-messaging.md b/docs/testing-guides/v0.21.0/onion-messaging.md new file mode 100644 index 0000000000..d8d11160d5 --- /dev/null +++ b/docs/testing-guides/v0.21.0/onion-messaging.md @@ -0,0 +1,329 @@ +# Onion Messaging + Rate Limiting — v0.21.0 RC Testing Guide + +**PRs:** #9868, #10089 (basic forwarding), #10612 (pathfinding), #10713 (rate limiting + channel-presence gate), #10754 (loopback drop) +**Risk:** headline +**Audience:** node operators, routing-node operators +**Backends affected:** all +**Networks:** regtest (primary), signet + +For the design and configuration model behind onion-message rate +limiting, read +[`docs/onion_message_rate_limiting.md`](../../onion_message_rate_limiting.md) +first. This guide tests the operational behavior it describes. + +## What this feature does + +v0.21.0 adds basic support for peer-to-peer onion message +**forwarding**. lnd does not yet ship a user-facing tool for +**constructing** onion messages from the operator side — the +`SendOnionMessage` RPC exists but takes pre-built `path_key`/`onion` +bytes and has no `lncli` wrapper. End-to-end RC testing of onion +messaging therefore relies on placing an lnd node **between two +non-lnd nodes** (Core Lightning, Eclair) that *do* expose +construct-and-send commands. The lnd node is the system under test; +the non-lnd nodes are drivers. + +Incoming onion messages on lnd pass through three defenses, in +order: + +1. **Loopback drop** (#10754): if the resolved next hop is the same + peer the message arrived from, drop it. +2. **Channel-presence gate** (#10713): drop messages from peers + without at least one fully open channel, unless + `protocol.onion-msg-relay-all=true`. +3. **Token-bucket rate limiters** (#10713): per-peer and global, + byte-denominated, applied in series (per-peer first). + +## Why it matters / what could break + +- **Channel-presence gate** is the Sybil defense. A regression + here lets cheap-identity attackers burn forwarding resources for + free. +- **Rate limiters** cap operator-borne bandwidth cost. A + regression in the byte-denominated accounting or the per-peer / + global ordering reintroduces the asymmetry the feature exists to + prevent. +- **Loopback drop** closes a traffic-amplification vector — a + missed drop means a hostile peer can bounce messages back at us. +- **Startup validation** (`burst < 65535 bytes`, partial-zero + configurations rejected) — silently accepting an invalid config + would leave operators thinking they had protection. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Topology:** + ``` + Eclair (A, sender) ── lnd (SUT) ── Eclair / CLN (C, recipient) + ``` + with optional **Eclair (B, no-channel peer)** connected to lnd + for the channel-presence gate scenarios. +- **Primary driver: Eclair**, because `sendonionmessage` is a + single high-level call that takes a route + hex message and + builds the onion internally: + ```bash + eclair-cli sendonionmessage \ + --nodeIds=, \ + --message= + ``` + Reference: . +- **Alternative driver: Core Lightning (v24.11+).** Two-step: + 1. `lightning-cli createonion --hops='[...]' --assocdata=00...0 + --onion_size=` builds a Sphinx packet for the route. + Reference: + . + 2. `lightning-cli injectonionmessage --path_key= + --message=` causes CLN to behave as if it had just + received the onion from a peer, unwrapping and forwarding it + to the next hop (lnd, in our topology). Reference: + . +- **Tools:** `lncli`, `eclair-cli`, `lightning-cli` (optional), + `bitcoin-cli`, `jq`. + +Shell variables: +``` +LND_PUBKEY=$(lncli getinfo | jq -r '.identity_pubkey') +A_PUBKEY=$(eclair-cli getinfo | jq -r '.nodeId') # sender +B_PUBKEY=... # no-channel sender +C_PUBKEY=... # recipient +``` + +## Setup + +```bash +# 1. Start all nodes. lnd with default onion-msg limits. +# 2. Connect peers: +# Eclair A ↔ lnd +# Eclair B ↔ lnd (no channel — for S2) +# lnd ↔ Recipient C +# Use eclair-cli connect / lncli connect. +# 3. Open and confirm channels A↔lnd and lnd↔C. +# Do NOT open B↔lnd. + +# Setup verification on lnd: +lncli listchannels | jq '.channels | length' # ≥ 2 +lncli listpeers | jq '.peers | length' # ≥ 3 (A, B, C) +grep -E "onion-msg|OnionMsg" ~/.lnd/logs/*/lnd.log | head # confirm config loaded +``` + +## Scenarios + +### S1: Forward an onion message through lnd — happy path + +**Goal:** A driver-built onion message from Eclair A travels A → +lnd → C and is received by C with no drops on lnd. + +**Steps:** +```bash +# Eclair builds and sends a 2-hop onion message: A → lnd → C. +eclair-cli sendonionmessage \ + --nodeIds=$LND_PUBKEY,$C_PUBKEY \ + --message=$(printf 'hello' | xxd -p) +``` + +**Pass/Fail signal:** +- **PASS** if recipient C logs receipt of an onion message in the + same time window (Eclair logs `Received onion message`; for CLN + recipients, the `onion_message_recv` hook fires). lnd's log + shows no drop or rate-limit events. +- **FAIL** if C receives nothing, or lnd logs a + `channel-presence-gate` drop (A has a channel — the gate must + not trip), or a rate-limit log appears under default + settings for a single small message. + +--- + +### S2: Channel-presence gate drops a no-channel sender + +**Goal:** Eclair B is connected to lnd but has no channel; its +onion message is dropped at lnd's gate. + +**Steps:** +```bash +# From Eclair B (no channel with lnd), attempt a send through lnd. +eclair-cli -a $B_AUTH sendonionmessage \ + --nodeIds=$LND_PUBKEY,$C_PUBKEY \ + --message=$(printf 'should-drop' | xxd -p) +``` + +**Pass/Fail signal:** +- **PASS** if (a) C does not receive the message and (b) lnd's log + records a drop attributable to the channel-presence gate (search + for `no fully open channel` or the equivalent log key — verify + exact wording against the v0.21.0 build). +- **FAIL** if C receives the message (gate broken) or no drop log + appears for B's pubkey (silent drop with no audit trail). + +--- + +### S3: `relay-all` bypasses the channel-presence gate + +**Goal:** With `protocol.onion-msg-relay-all=true`, the no-channel +sender from S2 is no longer gated. Rate limiters still apply. + +**Steps:** +```bash +# Restart lnd with: protocol.onion-msg-relay-all=true +# Re-run the S2 send. +eclair-cli -a $B_AUTH sendonionmessage \ + --nodeIds=$LND_PUBKEY,$C_PUBKEY \ + --message=$(printf 'now-allowed' | xxd -p) +``` + +**Pass/Fail signal:** +- **PASS** if C receives the message and lnd's log no longer + records a channel-presence drop. A per-peer rate-limiter + bucket should now exist for B's pubkey (verify by tripping it, + per S4, with B as the sender). +- **FAIL** if lnd still drops at the gate (escape hatch broken) or + if rate limiting is also bypassed when only the gate should be. + +--- + +### S4: Per-peer rate limiter trips + +**Goal:** Eclair A hammering lnd over its per-peer cap should get +dropped once the bucket empties, with a one-shot info log. + +**Steps:** +```bash +# Restart lnd with a tight per-peer cap: +# protocol.onion-msg-peer-kbps=100 +# protocol.onion-msg-peer-burst-bytes=65540 + +# Fire onion messages near the spec maximum as fast as the script +# can drive Eclair. Use a payload that pads close to 32 KiB so each +# message debits the bucket substantially. +PAYLOAD=$(head -c 32000 /dev/urandom | xxd -p | tr -d '\n') +for i in $(seq 1 50); do + eclair-cli sendonionmessage \ + --nodeIds=$LND_PUBKEY,$C_PUBKEY \ + --message=$PAYLOAD & +done +wait +``` + +**Pass/Fail signal:** +- **PASS** if (a) lnd's log contains exactly **one** + `per-peer onion message rate limit engaged` info line (or the + v0.21.0 equivalent — verify the exact wording) for A's pubkey, + (b) subsequent drops are at trace level only, and (c) C's + receive count is bounded by the configured rate over the test + window. +- **FAIL** if no drops occur (limiter disabled) or the info log is + emitted repeatedly (log-flooding regression). + +--- + +### S5: Global rate limiter trips + +**Goal:** Multiple senders, each under their per-peer cap, +collectively trip the global cap. + +**Steps:** +```bash +# Restart lnd with the per-peer limiter loose and the global tight: +# protocol.onion-msg-peer-kbps=1024 +# protocol.onion-msg-peer-burst-bytes=262144 +# protocol.onion-msg-global-kbps=200 +# protocol.onion-msg-global-burst-bytes=131080 + +# Drive sends from A, B (relay-all enabled to allow B), and a third +# peer if available — all simultaneously, each under its per-peer +# budget. +``` + +**Pass/Fail signal:** +- **PASS** if lnd's log contains exactly one + `global onion message rate limit engaged` info line (no peer + prefix), and subsequent drops are at trace level. +- **FAIL** if no global drops occur, or the line is emitted with a + peer prefix (mis-attribution). + +--- + +### S6: Startup rejects invalid limiter configs + +**Goal:** Mixed-zero (rate=0 with burst>0, or vice versa) and +undersized-burst configs fail startup, as documented in the +rate-limiting design doc. **No driver required.** + +**Steps:** Start lnd with each of these in turn and capture exit +status: + +| Config | Expected to reject? | +|---|---| +| `peer-kbps=0`, `peer-burst-bytes=262144` | yes (rate 0, burst > 0) | +| `peer-kbps=100`, `peer-burst-bytes=0` | yes (rate > 0, burst 0) | +| `peer-kbps=100`, `peer-burst-bytes=32768`| yes (burst < 65535) | +| `peer-kbps=0`, `peer-burst-bytes=0` | no (cleanly disabled) | + +**Pass/Fail signal:** +- **PASS** if the three reject rows fail startup with a clear + error message naming the misconfigured option **before** the + gRPC endpoint comes up, and the disabled row starts cleanly. +- **FAIL** if any of the three invalid configs starts (silent + misconfiguration), or the disabled config errors out. + +--- + +### S7: Loopback drop + +**Goal:** An onion message whose resolved next hop is the sending +peer is dropped at lnd, not forwarded back. + +**Steps:** +```bash +# Eclair A constructs a route lnd → A — i.e. A is the recipient, +# lnd is the only intermediate hop. lnd will receive from A and +# resolve A as the next hop. +eclair-cli sendonionmessage \ + --nodeIds=$LND_PUBKEY,$A_PUBKEY \ + --message=$(printf 'loopback' | xxd -p) +``` + +**Pass/Fail signal:** +- **PASS** if lnd's log records a loopback drop (search for + `next hop is sending peer` or the v0.21.0 equivalent), and A + does not receive the message back over the inbound connection. +- **FAIL** if A receives the message back from lnd (the loopback + drop did not engage) or lnd silently drops with no audit trail. + +## Failure investigation + +- **Subsystems:** `PEER`, `DISC`, `CRTR`, and the onion-message + subsystem (verify exact name in v0.21.0 — probably `ONMSG` or + similar). Set to `debug` for diagnosis. +- **Useful greps:** + - `grep -i "onion message" lnd.log` + - `grep -i "rate limit" lnd.log` + - `grep -i "channel-presence" lnd.log` +- **Driver-side observability:** Eclair logs at `INFO` show + outbound `sendonionmessage` calls and inbound message events. CLN + surfaces inbound messages via the `onion_message_recv` hook. + +## Related itests (not the RC test surface, but useful references) + +- `itest/lnd_onion_message_test.go` — `testOnionMessage` +- `itest/lnd_onion_message_forward_test.go` — + `testOnionMessageForwarding` with `buildForwardNextNodePath`, + `buildForwardSCIDPath`, `buildConcatenatedPath` +- `itest/config_onion_ratelimit_test.go` — limiter config +- `onionmessage/test_utils.go` — `BuildOnionMessage` helper + (`*testing.T`-only) + +These exercise the same behaviors via Go-side construction and are +the maintainers' authoritative test. RC testers shouldn't need to +run them; this guide covers what's observable from a real +deployment with mixed-implementation drivers. + +## Out of scope + +- Pathfinding for onion messages (#10612) — used internally by lnd + but not exposed via a user-facing RPC in v0.21.0, so not + directly testable from outside this release. Confirm via itests. +- BOLT-12 offers / blinded-payment-route construction — separate + feature, not in v0.21.0. +- lnd-as-sender of onion messages — v0.21.0 is forwarding-only + from the operator's perspective. The `SendOnionMessage` RPC is + low-level and has no `lncli` wrapper. diff --git a/docs/testing-guides/v0.21.0/payment-rpcs.md b/docs/testing-guides/v0.21.0/payment-rpcs.md new file mode 100644 index 0000000000..38f54fcc13 --- /dev/null +++ b/docs/testing-guides/v0.21.0/payment-rpcs.md @@ -0,0 +1,221 @@ +# New Payment-Adjacent RPCs — v0.21.0 RC Testing Guide + +**PRs:** #10666 (`DeleteForwardingHistory`), #10436 (MuSig2 coordinator nonces), #10296 (`EstimateFee` inputs), #10520 (HTLC event invoice failures), #10543 (`SubscribeChannelEvents` update events) +**Risk:** new-rpc +**Audience:** RPC clients, routing nodes, MuSig2 coordinator integrators, LSPs +**Backends affected:** all +**Networks:** regtest (primary) + +## What this feature does + +v0.21.0 ships five small RPC additions/updates relevant to payment +and channel flows. They are bundled here because each is too narrow +for its own guide, but each has a clear pass/fail signal worth +checking on the RC. + +1. **`router.DeleteForwardingHistory`** (#10666). Operator RPC to + purge old forwarding events. Cutoff timestamp must be at least 1 + hour in the past. +2. **`MuSig2RegisterCombinedNonce` / `MuSig2GetCombinedNonce`** + (#10436). Lets a coordinator pre-aggregate MuSig2 nonces + externally and register the result. MuSig2 v1.0.0rc2 only. +3. **`EstimateFee` with `inputs`** (#10296). Explicit input + selection for fee estimation; new `inputs` field on + `EstimateFeeRequest`, new `--utxos` flag on + `lncli estimatefee`. +4. **HTLC event invoice-level failure detail** (#10520). routerrpc + HTLC event subscribers now receive specific failure causes for + invoice-validation failures instead of `UNKNOWN`. +5. **`SubscribeChannelEvents` update events** (#10543). + `SubscribeChannelEvents` now emits a channel update event for + state changes, not only open/close/active/inactive. + +## Why each matters / what could break + +- **`DeleteForwardingHistory`** is destructive. The 1-hour guard + must hold; off-by-one or misinterpreted timestamps could wipe + recent data. +- **MuSig2 combined-nonce RPCs** are signing-protocol territory. + Wrong nonce aggregation produces invalid signatures. +- **`EstimateFee` inputs**: a request that names a non-existent or + spent UTXO must fail clearly, not silently produce a meaningless + estimate. +- **HTLC invoice failure detail**: routing nodes that grep failures + out of subscriber streams will mis-classify if `UNKNOWN` still + surfaces for invoice-validation cases. +- **`SubscribeChannelEvents` update events**: any client iterating + over event-type values may need to handle a new variant. If the + daemon emits a malformed event, subscribers can disconnect or + crash. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Peers:** Alice and Bob (with a channel for the payment-adjacent + scenarios); Carol for forwarding-history scenarios so Alice can + forward Bob → Carol traffic. +- **MuSig2 coordinator client** for S2 — typically the `signrpc` + test client from `signer/musig2_test.go` or your own. +- **Tools:** `lncli`, `grpcurl`, `jq`, `bitcoin-cli`. + +## Scenarios + +### S1: `DeleteForwardingHistory` deletes old events; 1-hour guard rejects recent cutoffs + +**Goal:** Confirm both the success path and the safety guard. + +**Steps:** +```bash +# Drive some forwards through Alice (need a 3-node setup). +# Wait a bit, then query forwarding history. +$LNCLI_A fwdinghistory | jq '.forwarding_events | length' > /tmp/fwd-pre.txt + +# Capture a timestamp at least 1h in the past. +CUTOFF_OK=$(date -d '2 hours ago' +%s) # GNU date +# macOS: CUTOFF_OK=$(date -v-2H +%s) +CUTOFF_BAD=$(date -d '5 minutes ago' +%s) + +# Delete old events (success). +grpcurl ... -d "{\"end_time_ns\": \"$((CUTOFF_OK*1000000000))\"}" \ + routerrpc.Router/DeleteForwardingHistory + +# Try a recent cutoff (should error). +grpcurl ... -d "{\"end_time_ns\": \"$((CUTOFF_BAD*1000000000))\"}" \ + routerrpc.Router/DeleteForwardingHistory +echo $? +``` + +(Replace the field name with whatever the proto actually exposes; +verify against `routerrpc.proto` for v0.21.0.) + +**Pass/Fail signal:** +- **PASS** if (a) the first call succeeds and `fwdinghistory` + afterward shows fewer events than before, and (b) the second call + fails with a clear error message about the 1-hour minimum. +- **FAIL** if the recent-cutoff call succeeds (the safety guard is + broken). + +--- + +### S2: `MuSig2RegisterCombinedNonce` / `MuSig2GetCombinedNonce` round-trip + +**Goal:** A coordinator can register a pre-aggregated combined nonce +and later retrieve it for a session. + +**Steps:** Run the coordinator-based MuSig2 flow end-to-end with at +least two signing participants. After the coordinator aggregates +nonces externally, call `MuSig2RegisterCombinedNonce` with the +session ID and combined nonce. From a participant, call +`MuSig2GetCombinedNonce` and verify the returned value. + +**Pass/Fail signal:** +- **PASS** if `MuSig2GetCombinedNonce` returns the same bytes + registered, and the subsequent partial-sign / finalize completes + with a valid signature. +- **FAIL** if the combined nonce roundtrips wrong, if the finalize + produces an invalid signature, or if the call rejects MuSig2 + v1.0.0rc2 sessions. + +--- + +### S3: `EstimateFee` with explicit `inputs` + +**Goal:** A fee estimate that names specific UTXOs uses those UTXOs; +naming a spent or non-existent UTXO errors cleanly. + +**Steps:** +```bash +# Pick two confirmed UTXOs on Alice. +$LNCLI_A listunspent --min_confs=1 | jq '.utxos[] | .outpoint' + +# Estimate fee using --utxos. +$LNCLI_A estimatefee --conf_target=6 \ + --utxos="$UTXO1" --utxos="$UTXO2" \ + --addr_to_amount='{"": 50000}' + +# Try a bogus UTXO. +$LNCLI_A estimatefee --conf_target=6 \ + --utxos="0000000000000000000000000000000000000000000000000000000000000000:0" \ + --addr_to_amount='{"": 50000}' +echo $? +``` + +**Pass/Fail signal:** +- **PASS** if the first call returns a fee estimate consistent with + using exactly those two UTXOs, and the second call fails with a + clear "input not found" or "not spendable" message. +- **FAIL** if the second call silently returns an estimate + (ignoring the bad input), or the first uses different UTXOs than + requested. + +--- + +### S4: HTLC event subscribers see invoice-level failure detail (not `UNKNOWN`) + +**Goal:** routerrpc HTLC event stream now provides specific reasons +for invoice-validation failures. + +**Steps:** +- Subscribe to `routerrpc.SubscribeHtlcEvents` on Bob (the + recipient). +- From Alice, attempt to pay a Bob invoice in a way that + invoice-validation rejects (e.g. expired invoice, wrong + preimage attempt, amount mismatch). Repeat for each failure mode + you want to test. +- Capture the failure-detail field from each emitted event. + +**Pass/Fail signal:** +- **PASS** if every invoice-validation failure surfaces a specific + reason (e.g. `INVOICE_EXPIRED`, `INCORRECT_PAYMENT_AMOUNT`, + `INVOICE_ALREADY_CANCELED`), not the legacy `UNKNOWN`. +- **FAIL** if any of these still emit `UNKNOWN`. + +--- + +### S5: `SubscribeChannelEvents` emits update events on state changes + +**Goal:** Confirm the new event variant fires for the +state-change cases it covers. + +**Steps:** +- Subscribe to `lnrpc.SubscribeChannelEvents` on Alice via a + long-running grpcurl session. +- Drive state changes: + - Open a channel → expect existing `pending_open_channel` and + `open_channel` events. + - Push a few payments → expect the new update event(s). + - Coop-close → expect existing close events. +- Inspect every emitted event's `type` field. + +**Pass/Fail signal:** +- **PASS** if at least one event with the new `update` type fires + during the test window, and the payload references the correct + channel. +- **FAIL** if no update event fires, or if a malformed event causes + the subscriber stream to error / disconnect. + +## Failure investigation + +- **Subsystems:** `RPCS`, `ROUTING`, `CRTR`, `SIGN`. +- **Useful greps:** `DeleteForwardingHistory`, + `MuSig2RegisterCombinedNonce`, `EstimateFee`, `HtlcEvent`, + `ChannelEvent`. +- **Proto-level surface:** check + `lnrpc/routerrpc/router.proto` (forwarding history, HTLC events), + `lnrpc/signrpc/signer.proto` (MuSig2 coordinator), and + `lnrpc/lightning.proto` (`EstimateFee`, `SubscribeChannelEvents`) + to confirm field names match the calls above before recording a + FAIL. + +## Related itests + +- Each of these RPCs typically has a corresponding itest. Verify in + `itest/` (e.g. `itest/lnd_forward_test.go`, + `itest/lnd_musig2_test.go`). + +## Out of scope + +- Payment SQL migration — separate guide + ([`payment-sql-migration.md`](./payment-sql-migration.md)). +- BOLT-12 / offers — not in v0.21.0. +- Performance characteristics of `EstimateFee` with many inputs. diff --git a/docs/testing-guides/v0.21.0/payment-sql-migration.md b/docs/testing-guides/v0.21.0/payment-sql-migration.md new file mode 100644 index 0000000000..1ac1fca2a9 --- /dev/null +++ b/docs/testing-guides/v0.21.0/payment-sql-migration.md @@ -0,0 +1,262 @@ +# Payment Store KV → SQL Migration — v0.21.0 RC Testing Guide + +**PRs:** #10153, #9147, #10287, #10291, #10368, #10292, #10307, #10308, #10373, #10485 (migration), #10535, #10627 (mainline promotion) +**Risk:** headline +**Audience:** node operators on `sqlite` or `postgres` backends with `--db.use-native-sql` +**Backends affected:** sqlite, postgres +**Networks:** all + +> ⚠️ **TBD — pending developer confirmation.** The behavior of +> `--db.skip-native-sql-migration=true` for **payments** specifically +> is under verification. The flag's description in `lncfg/db.go` +> (and S6 / the rescue-path bullet in this guide) implies the SQL +> payment tables are used empty after the flag is set; the intent +> may instead be that lnd continues reading payments from the KV +> store. Treat the rescue-path scenario as unconfirmed until this +> note is removed. + +## What this feature does + +v0.21.0 finishes the payments store migration from `kvdb` to native +SQL and promotes it to mainline. Nodes running with +`--db.use-native-sql=true` on a `sqlite` or `postgres` backend will, +on their first startup against this build, run a migration that +copies every payment row, attempt, and route hop from the embedded +`kvdb`-on-SQL tables into a normalized SQL schema. All subsequent +`ListPayments`, `QueryPayments`, `FetchPayment`, and the new +`omit_hops` / cursor-paginated query variants run directly against +the SQL schema. + +Nodes still on `bbolt` are unaffected by this migration (they don't +have `--db.use-native-sql`). Operators wanting to move from bbolt to +sqlite or postgres should use +[`lndinit`](https://github.com/lightninglabs/lndinit/blob/main/docs/data-migration.md) +*before* upgrading to v0.21.0, so the SQL payment migration sees a +populated source. + +## Why it matters / what could break + +This is a one-shot migration over potentially very large tables +(some nodes have millions of payment rows). The blast radius: + +- **Migration fails mid-run** → lnd refuses to start. The standard + recovery is to fix the underlying cause and restart; the last + resort is `--db.skip-native-sql-migration=true`, which abandons + partial migration progress and **loses payment history**. +- **Migration completes but data is corrupted** → silent. Surfaces + later as `ListPayments` rows missing, attempts attributed to the + wrong payment, or settled payments that report as failed. +- **Performance regression** → `ListPayments` slower than pre-migration + for users with deep history, or specific filters (date range, by + payment-hash) regressing. +- **bbolt user accidentally enables `--db.use-native-sql`** → empty + payment history because the SQL tables are empty and no KV source + exists to migrate from. (Documentation guards against this; verify + the failure mode is clean.) + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Existing v0.20.x node with payment history** on `sqlite` or + `postgres`, with `--db.use-native-sql=true` already enabled in v0.20 + (so invoices and graph are already SQL; payments are the new + addition). + - If you don't have one, build a fresh `sqlite` v0.20.x node and + push a few hundred payments through it before upgrading. +- **A backup of the v0.20 database** (`.bak` of the sqlite file or a + postgres `pg_dump`). Required — this migration is not reversible. +- **Tools:** `lncli`, `sqlite3` (or `psql`), `jq`. + +Shell variables: +``` +ALICE_DIR=~/.alice +LNCLI_A="lncli --lnddir=$ALICE_DIR" +DB=$ALICE_DIR/data/chain/bitcoin//lnd.db # or your sqlite path +``` + +## Setup + +```bash +# 1. On the v0.20.x build, capture a baseline of payment data. +$LNCLI_A listpayments --max_payments=0 --reversed | \ + jq '{count: (.payments | length), total_sat: ([.payments[].value_sat | tonumber] | add)}' > /tmp/payments-pre.json + +$LNCLI_A listpayments --max_payments=5 --reversed | \ + jq '[.payments[] | {payment_hash, status, value_sat, creation_date}]' > /tmp/payments-sample-pre.json + +# 2. Stop lnd cleanly. +$LNCLI_A stop + +# 3. Backup the database. +cp $DB $DB.pre-v0.21.bak # sqlite +# OR: pg_dump ... > /tmp/lnd-pre-v0.21.sql + +# 4. Swap the binary to v0.21.0-rc1 and start lnd back up. +# Leave --db.use-native-sql=true in lnd.conf (or on the CLI). +``` + +## Scenarios + +### S1: Migration runs and lnd starts cleanly + +**Goal:** First startup against v0.21.0 runs the payment migration +to completion and the node becomes operational. + +**Steps:** Start lnd with the existing v0.20 database and watch the +log. + +**Expected:** +- Log lines indicating the payment migration started, e.g. + `Running migration: payments KV -> SQL`. +- Log lines indicating completion (no error). +- `lncli getinfo` returns successfully. + +**Pass/Fail signal:** +```bash +$LNCLI_A getinfo | jq -r '.identity_pubkey' +``` +- **PASS** if a valid pubkey is returned within 60s of startup (or + longer, scaled to your payment history; document the time). +- **FAIL** if lnd exits with a migration error, or if `getinfo` + hangs past 5 minutes (the migration is stuck — capture the log). + +--- + +### S2: Post-migration payment count matches pre-migration + +**Goal:** No rows lost during migration. + +**Steps:** +```bash +$LNCLI_A listpayments --max_payments=0 --reversed | \ + jq '{count: (.payments | length), total_sat: ([.payments[].value_sat | tonumber] | add)}' > /tmp/payments-post.json + +diff /tmp/payments-pre.json /tmp/payments-post.json +``` + +**Pass/Fail signal:** +- **PASS** if `diff` shows no output (`count` and `total_sat` match). +- **FAIL** if either field differs. Capture both files and the + log. + +--- + +### S3: Spot-check individual payments by hash + +**Goal:** Per-row data is preserved (not just aggregate counts). + +**Steps:** +```bash +# Pick 5 payments from the pre-migration sample. +for ph in $(jq -r '.[].payment_hash' /tmp/payments-sample-pre.json); do + $LNCLI_A trackpayment $ph 2>/dev/null || \ + $LNCLI_A listpayments --max_payments=1 | \ + jq --arg ph "$ph" '.payments[] | select(.payment_hash==$ph)' +done > /tmp/payments-sample-post.json +``` + +**Pass/Fail signal:** +- **PASS** if every sampled payment matches its pre-migration + record on `status`, `value_sat`, `creation_date`, and the route + hops (if you didn't request `omit_hops`). +- **FAIL** if any row differs or is missing. + +--- + +### S4: `ListPayments` with `omit_hops=true` excludes hop data + +**Goal:** The new `omit_hops` filter (introduced in #10535) works on +the new SQL store. + +**Steps:** +```bash +# Older lncli builds may not expose this flag yet; fall back to gRPC. +$LNCLI_A listpayments --max_payments=10 --include_incomplete=false 2>/dev/null | \ + jq '.payments[0].htlcs[0].route | length' > /tmp/with-hops.txt + +# Then call with omit_hops=true (via gRPC, e.g. grpcurl): +grpcurl ... -d '{"max_payments": 10, "omit_hops": true}' \ + lnrpc.Lightning/ListPayments | \ + jq '.payments[0].htlcs[0].route' > /tmp/no-hops.txt +``` + +**Pass/Fail signal:** +- **PASS** if `with-hops.txt` shows a positive number and `no-hops.txt` + is `null` or has an empty `hops` list. +- **FAIL** if `omit_hops=true` still returns hops, or if the call errors. + +--- + +### S5: `bbolt + --db.use-native-sql` user is warned cleanly + +**Goal:** Confirm that a user who enables `--db.use-native-sql` on a +bbolt backend either hits a clean refusal at startup, or sees +documented behavior — not silent data loss. + +**Steps:** +- Take a fresh bbolt-backed v0.20 node with some payment history. +- Edit `lnd.conf` to add `db.use-native-sql=true` (without using + lndinit first). +- Start v0.21.0-rc1. + +**Pass/Fail signal:** +- **PASS** if lnd either (a) refuses to start with a clear message + pointing operators at `lndinit`, or (b) starts but logs a clear + warning that bbolt history is not migrated by this flag. +- **FAIL** if lnd starts, `getinfo` succeeds, and `listpayments` + returns an empty list with no warning — that's a silent data-loss + footgun for operators. + +--- + +### S6: `--db.skip-native-sql-migration` rescue path + +**Goal:** The skip-migration flag works as the documented last +resort. Only run this on a copy of the database. + +**Steps:** +```bash +# Restore the backup, then start v0.21 with the skip flag. +cp $DB.pre-v0.21.bak $DB +# Add: db.skip-native-sql-migration=true to lnd.conf +``` + +**Pass/Fail signal:** +- **PASS** if lnd starts, logs a clear warning that payment history + has been abandoned, and `listpayments` returns an empty list (the + intended behavior of the rescue flag — payments are sacrificed to + keep channels working). +- **FAIL** if lnd refuses to start, or if it starts but silently + retains stale KV payment data. + +## Failure investigation + +- **Subsystems:** `LNDB`, `RPCS`, `CRTR`. +- **Migration log lines:** grep for `MigratePaymentsKVToSQL`, + `migration version=`, `payment migration`. +- **Direct SQL inspection (sqlite):** + ```sql + -- Count rows in the new SQL tables. + SELECT COUNT(*) FROM payments; + SELECT COUNT(*) FROM payment_attempts; + ``` +- **Common past issues / regressions:** + - Cross-database timestamp handling — #10535 fixed a class of bugs + where postgres and sqlite stored creation timestamps with + different precisions. Watch `creation_date` mismatches. + - Schema indexes — verify that the indexes added in #10535 exist + after migration (`.indices payments` in sqlite). + +## Related itests + +- `payments/db/migration1/sql_migration_test.go` — migration unit/integration tests. +- `itest/lnd_payment_test.go` — end-to-end payment behavior. + +## Out of scope + +- Channel-state SQL migration (separate effort, not in v0.21.0). +- bbolt → sqlite/postgres backup migration — handled by + [`lndinit`](https://github.com/lightninglabs/lndinit/blob/main/docs/data-migration.md), not this guide. +- Performance benchmarking — call out qualitative regressions + (`listpayments` taking many seconds when it was sub-second on + v0.20), but exhaustive benchmarking is a separate effort. diff --git a/docs/testing-guides/v0.21.0/production-taproot-channels.md b/docs/testing-guides/v0.21.0/production-taproot-channels.md new file mode 100644 index 0000000000..1c851de17c --- /dev/null +++ b/docs/testing-guides/v0.21.0/production-taproot-channels.md @@ -0,0 +1,256 @@ +# Production Simple Taproot Channels — v0.21.0 RC Testing Guide + +**PRs:** #9985 (production support), #10763 (acceptor + RBF coop follow-up), #10672 (private-taproot funding script bug fix) +**Risk:** headline +**Audience:** node operators, LSPs, wallet integrators, channel-acceptor clients +**Backends affected:** all +**Networks:** regtest (primary), signet, mainnet + +## What this feature does + +v0.21.0 adds the production (final) variant of simple taproot +channels, negotiated via feature bits 80/81. Production taproot +channels use a more optimized commitment script +(`OP_CHECKSIGVERIFY` instead of `OP_CHECKSIG` + `OP_DROP`) and encode +MuSig2 nonces in `channel_reestablish` and `revoke_and_ack` as a map +keyed by the funding TXID. + +The nonce type used by a channel is auto-detected from the negotiated +channel type, not the peer's advertised feature bits. The RPC +channel acceptor now also reports production taproot opens with the +`SIMPLE_TAPROOT_FINAL` commitment type across every combination of +the `scid-alias` and `zero-conf` modifiers. + +## Why it matters / what could break + +- Misnegotiation between staging-taproot (existing) and final-taproot + (new) peers → channel open fails or, worse, opens with mismatched + script versions and force-closes at first commitment. +- Wrong nonce encoding on `channel_reestablish` after a reconnect → + peers cannot resume the channel; surface as repeated reconnect + loops with `channel_reestablish` errors in the log. +- Channel-acceptor clients seeing `UNKNOWN_COMMITMENT_TYPE` instead + of `SIMPLE_TAPROOT_FINAL` for production taproot opens with + `scid-alias` or `zero-conf` modifiers (the bug #10763 fixed — + guard against regression). +- Private taproot channels with a v1 gossip entry whose funding + script gets reconstructed as legacy P2WSH on restart (#10672). + Surfaces as missed-spend detection. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Backend:** `bitcoind` (regtest). +- **Peers:** Alice and Bob, both started with `--protocol.simple-taproot-chans`. +- **Tools:** `lncli`, `bitcoin-cli`, `jq`. + +Shell variables used below: +``` +ALICE_RPC=localhost:10001 +BOB_RPC=localhost:10002 +ALICE_MAC=~/.alice/data/chain/bitcoin/regtest/admin.macaroon +BOB_MAC=~/.bob/data/chain/bitcoin/regtest/admin.macaroon +ALICE_DIR=~/.alice +BOB_DIR=~/.bob +LNCLI_A="lncli --rpcserver=$ALICE_RPC --macaroonpath=$ALICE_MAC --lnddir=$ALICE_DIR" +LNCLI_B="lncli --rpcserver=$BOB_RPC --macaroonpath=$BOB_MAC --lnddir=$BOB_DIR" +``` + +Both nodes must run with at least: +``` +protocol.simple-taproot-chans=1 +``` + +## Setup + +```bash +# 1. Start a clean bitcoind in regtest and mine a few blocks. +bitcoin-cli -regtest createwallet test +ADDR=$(bitcoin-cli -regtest getnewaddress) +bitcoin-cli -regtest generatetoaddress 200 $ADDR + +# 2. Start Alice and Bob with --protocol.simple-taproot-chans. +# (Use your usual two-node regtest setup. The flag matters.) + +# 3. Connect Alice to Bob. +BOB_PUB=$($LNCLI_B getinfo | jq -r '.identity_pubkey') +$LNCLI_A connect $BOB_PUB@127.0.0.1:9736 + +# 4. Fund Alice's on-chain wallet. +ALICE_ADDR=$($LNCLI_A newaddress p2tr | jq -r '.address') +bitcoin-cli -regtest sendtoaddress $ALICE_ADDR 1 +bitcoin-cli -regtest generatetoaddress 6 $ADDR + +# Setup verification: +$LNCLI_A walletbalance | jq '.confirmed_balance' +# Expected: "100000000" (1 BTC in sats) +``` + +## Scenarios + +### S1: Open a production (final) taproot channel — happy path + +**Goal:** Verify that a `taproot-final` channel opens, confirms, and +reports `SIMPLE_TAPROOT_FINAL` as its commitment type. + +**Steps:** +```bash +$LNCLI_A openchannel \ + --node_key=$BOB_PUB \ + --local_amt=5000000 \ + --channel_type=taproot-final + +# Mine to confirm. +bitcoin-cli -regtest generatetoaddress 6 $ADDR +``` + +**Expected:** +- `openchannel` returns a funding-txid; no error. +- After 6 confirmations, the channel appears in `listchannels` on + both sides with `commitment_type == "SIMPLE_TAPROOT_FINAL"`. + +**Pass/Fail signal:** +```bash +$LNCLI_A listchannels | \ + jq '.channels[] | select(.remote_pubkey=="'$BOB_PUB'") | .commitment_type' +``` +- **PASS** if the output is `"SIMPLE_TAPROOT_FINAL"`. +- **FAIL** if `"SIMPLE_TAPROOT"` (staging), `"ANCHORS"`, or any + other value — that means negotiation fell back to the wrong type. + +--- + +### S2: Reconnect a production taproot channel — `channel_reestablish` round-trip + +**Goal:** Confirm the map-based nonce encoding keyed by funding TXID +survives a reconnect. This is the most likely place for production +taproot to regress, because nonce-type detection now flows from the +negotiated channel type instead of peer feature bits. + +**Steps:** +```bash +# With the S1 channel up, disconnect and reconnect Bob. +$LNCLI_A disconnect $BOB_PUB +sleep 2 +$LNCLI_A connect $BOB_PUB@127.0.0.1:9736 +sleep 3 +``` + +**Expected:** +- Reconnection completes. +- `listpeers` shows Bob back as connected. +- The channel from S1 still reports `active: true`. + +**Pass/Fail signal:** +```bash +$LNCLI_A listchannels | \ + jq '.channels[] | select(.remote_pubkey=="'$BOB_PUB'") | .active' +``` +- **PASS** if the output is `true`. +- **FAIL** if `false`, or if Alice's log contains + `unable to handle upstream reestablish msg` or + `received nonce of wrong type` — the nonce-type auto-detection + regressed. + +--- + +### S3: Send a payment over a production taproot channel + +**Goal:** End-to-end HTLC settlement on the new commitment type. + +**Steps:** +```bash +INV=$($LNCLI_B addinvoice --amt=10000 | jq -r '.payment_request') +$LNCLI_A payinvoice --force $INV +``` + +**Pass/Fail signal:** +```bash +$LNCLI_A listpayments | jq '.payments[-1].status' +``` +- **PASS** if the output is `"SUCCEEDED"`. +- **FAIL** otherwise (in particular `"IN_FLIGHT"` for more than ~10s + on regtest indicates a stuck HTLC). + +--- + +### S4: RPC channel acceptor reports `SIMPLE_TAPROOT_FINAL` + +**Goal:** Regression guard for #10763 — the acceptor must report +production taproot opens with the correct commitment type for every +combination of scid-alias and zero-conf modifiers, not +`UNKNOWN_COMMITMENT_TYPE`. + +**Steps:** +- Register an RPC channel acceptor against Bob (any external client + using `lnrpc.Lightning.ChannelAcceptor` bidi stream). Have it log + the `ChannelAcceptRequest.commitment_type` field and accept. +- From Alice, open four channels in turn, each with a different + combination of flags on `openchannel`: + 1. `--channel_type=taproot-final` + 2. `--channel_type=taproot-final --zero_conf` + 3. `--channel_type=taproot-final --scid_alias` + 4. `--channel_type=taproot-final --zero_conf --scid_alias` + + (Zero-conf and SCID-alias also require the relevant `--protocol.*` + flags and `--protocol.option-scid-alias` on both nodes; consult + [`docs/zero_conf_channels.md`](../../zero_conf_channels.md).) + +**Pass/Fail signal:** +- **PASS** if the acceptor logs `commitment_type == SIMPLE_TAPROOT_FINAL` + for all four opens. +- **FAIL** if any open shows `UNKNOWN_COMMITMENT_TYPE`, + `SIMPLE_TAPROOT` (staging), or anything else. + +--- + +### S5: Cooperative close (non-RBF) of a production taproot channel + +**Goal:** Plain coop close still works on `taproot-final`. RBF +coop-close is covered separately in +[`rbf-taproot-coop-close.md`](./rbf-taproot-coop-close.md). + +**Steps:** +```bash +CP=$($LNCLI_A listchannels | \ + jq -r '.channels[] | select(.remote_pubkey=="'$BOB_PUB'") | .channel_point') +$LNCLI_A closechannel --funding_txid=${CP%:*} --output_index=${CP##*:} +bitcoin-cli -regtest generatetoaddress 6 $ADDR +``` + +**Pass/Fail signal:** +- **PASS** if the channel disappears from `listchannels` on both + sides and the close-tx is mined. +- **FAIL** if the close hangs, force-closes instead of cooperating, + or the close transaction fails to relay. + +## Failure investigation + +- **Logs to grep on either side:** + - `grep -iE "taproot|musig|reestablish" ~/.alice/logs/bitcoin/regtest/lnd.log` + - Subsystems to set to `debug`: `PEER`, `CNCT`, `HSWC`, `LNWL`. +- **Channel-state introspection:** + - `lncli listchannels --include_channel_status_flags` — look at + `commitment_type` and `local_chan_reserve_sat`. + - `lncli pendingchannels` — for in-flight opens, check + `commitment_type` matches what was requested. +- **Prior regressions to watch for:** + - #10672 — private taproot channels with v1 gossip rebuilding + their funding script as legacy P2WSH on restart. Surfaces as + failed spend detection during force-close. + - #10763 — acceptor `UNKNOWN_COMMITMENT_TYPE` for production + taproot + scid-alias/zero-conf combinations. + +## Related itests + +- `itest/lnd_open_channel_test.go` — taproot open paths. +- `itest/lnd_taproot_test.go` — taproot-specific HTLC and close flows. +- `itest/lnd_channel_force_close_test.go` — force-close on taproot. + +## Out of scope + +- RBF cooperative close on taproot — see + [`rbf-taproot-coop-close.md`](./rbf-taproot-coop-close.md). +- Splice on taproot channels — not in v0.21.0. +- Taproot overlay channels (`--protocol.simple-taproot-overlay-chans`) + — separate commitment type, not the focus of this guide. diff --git a/docs/testing-guides/v0.21.0/rbf-taproot-coop-close.md b/docs/testing-guides/v0.21.0/rbf-taproot-coop-close.md new file mode 100644 index 0000000000..ee1dd64509 --- /dev/null +++ b/docs/testing-guides/v0.21.0/rbf-taproot-coop-close.md @@ -0,0 +1,232 @@ +# RBF Cooperative Close on Taproot Channels — v0.21.0 RC Testing Guide + +**PRs:** #10063 (RBF coop close + taproot/MuSig2), #10763 (overlay narrowing) +**Risk:** headline +**Audience:** node operators, channel-acceptor clients +**Backends affected:** all +**Networks:** regtest (primary), signet + +> ⚠️ **Authoring note:** Please verify the exact CLI mechanism for +> triggering successive RBF iterations on a coop close (re-running +> `closechannel` with a higher fee vs. a dedicated bump RPC) against +> the latest behavior before publishing. The scenarios below assume +> re-running `closechannel` re-enters the state machine and produces +> a new `ClosingComplete`. + +## What this feature does + +v0.21.0 extends the RBF cooperative close protocol +(`--protocol.rbf-coop-close`, introduced earlier) to simple taproot +channels. Each RBF iteration produces a fresh `ClosingComplete` / +`ClosingSig` pair with new MuSig2 partial signatures, using the JIT +(just-in-time) nonce pattern: the closer's nonce is bundled with its +signature in `ClosingComplete`, and the closee rotates its nonce via +`NextCloseeNonce` in `ClosingSig` for every round. The state machine +stores the `MusigPartialSig` and invalidates nonces after each +signing round to prevent reuse. + +A follow-up (#10763) narrows the RBF coop-close auto-enable to +*skip* taproot-overlay channels, since the RBF close state machine +does not yet thread through the `AuxCloser` hook overlay channels +rely on. + +## Why it matters / what could break + +- **Nonce reuse across RBF rounds** is a hard MuSig2 violation — if + it happens, signatures become forgeable. Watch for it. +- A regression on the JIT nonce ordering surfaces as repeated + `ClosingComplete` round-trips that never produce a confirming + transaction. +- Taproot-overlay channels accidentally entering the RBF flow will + produce nil-pointer dereferences or aux-close build failures. +- A non-taproot peer that signals `--protocol.rbf-coop-close` should + still complete a normal (non-MuSig2) RBF coop close; regressions + in the channel-type dispatch could break that path too. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer, on both peers. +- **Backend:** `bitcoind` (regtest); a real fee market makes this + easier to test on signet. +- **Peers:** Alice and Bob, both started with: + ``` + protocol.simple-taproot-chans=1 + protocol.rbf-coop-close=1 + ``` +- **Tools:** `lncli`, `bitcoin-cli`, `jq`. + +Shell variables: same as +[`production-taproot-channels.md`](./production-taproot-channels.md). + +## Setup + +```bash +# 1. With both nodes started under the prerequisites, open a +# production taproot channel from Alice to Bob. +$LNCLI_A openchannel \ + --node_key=$BOB_PUB \ + --local_amt=5000000 \ + --channel_type=taproot-final + +bitcoin-cli -regtest generatetoaddress 6 $ADDR + +CP=$($LNCLI_A listchannels | \ + jq -r '.channels[] | select(.remote_pubkey=="'$BOB_PUB'") | .channel_point') +echo "channel_point=$CP" + +# Setup verification: +$LNCLI_A listchannels | \ + jq '.channels[] | select(.remote_pubkey=="'$BOB_PUB'") | .commitment_type' +# Expected: "SIMPLE_TAPROOT_FINAL" +``` + +## Scenarios + +### S1: First-round RBF coop close on a taproot channel + +**Goal:** A single `closechannel` call on a `taproot-final` channel +produces a `ClosingComplete` / `ClosingSig` exchange with MuSig2 +partial signatures and a valid closing tx in the mempool. + +**Steps:** +```bash +# Stop generating blocks; we want the close tx to sit in mempool. +$LNCLI_A closechannel \ + --funding_txid=${CP%:*} --output_index=${CP##*:} \ + --sat_per_vbyte=2 \ + --max_fee_rate=200 & + +# Wait briefly for the exchange. +sleep 5 + +# Inspect mempool for the closing tx. +bitcoin-cli -regtest getrawmempool | jq 'length' +``` + +**Pass/Fail signal:** +- **PASS** if `getrawmempool` shows exactly 1 transaction and Alice's + log contains a `ClosingComplete` message sent to Bob. +- **FAIL** if no tx in mempool after 10s, or Alice's log shows + `unable to derive musig partial sig` or + `received nonce of wrong type`. + +--- + +### S2: RBF bump produces a new closing tx with a higher fee + +**Goal:** Trigger a second round. The new closing tx must +(a) double-spend the first, (b) use a higher fee, and (c) be signed +with **different** MuSig2 nonces. + +**Steps:** +```bash +# Capture the first closing tx fee. +TXID1=$(bitcoin-cli -regtest getrawmempool | jq -r '.[0]') +FEE1=$(bitcoin-cli -regtest getmempoolentry $TXID1 | jq -r '.fees.base') + +# Trigger a second round at a higher fee rate. +$LNCLI_A closechannel \ + --funding_txid=${CP%:*} --output_index=${CP##*:} \ + --sat_per_vbyte=10 \ + --max_fee_rate=200 & + +sleep 5 + +TXID2=$(bitcoin-cli -regtest getrawmempool | jq -r '.[0]') +FEE2=$(bitcoin-cli -regtest getmempoolentry $TXID2 | jq -r '.fees.base') +``` + +**Pass/Fail signal:** +- **PASS** if all three hold: + - `$TXID2 != $TXID1` (new tx), + - `$FEE2 > $FEE1` (higher fee), + - Bob's log contains a `NextCloseeNonce` field on the new `ClosingSig` + that differs from the previous round's nonce. +- **FAIL** if the txid is unchanged, fee did not increase, or the + nonce on `ClosingSig` is reused across rounds (this is the + critical bug to catch — search Bob's log for the previous-round + nonce hex and confirm it does *not* reappear). + +--- + +### S3: Final close confirms + +**Steps:** +```bash +bitcoin-cli -regtest generatetoaddress 6 $ADDR +sleep 2 +$LNCLI_A listchannels | \ + jq '.channels[] | select(.remote_pubkey=="'$BOB_PUB'")' | jq length +``` + +**Pass/Fail signal:** +- **PASS** if the channel is no longer in `listchannels` on either + side and the second-round tx confirmed on chain. +- **FAIL** if the channel is still listed, or the closing tx in the + mined block does not match `$TXID2` (an older round confirmed — + fee-bump replacement failed). + +--- + +### S4: Taproot-overlay channel must NOT auto-enable RBF coop close + +**Goal:** Regression guard for #10763. If a taproot-overlay channel +enters the RBF coop close path, the auxiliary close hook will not +fire and the close will misbehave. + +**Steps:** +- Start Alice and Bob with + `protocol.simple-taproot-overlay-chans=1` (in addition to RBF + coop), and an aux-close client registered. +- Open a taproot-overlay channel. +- Initiate a coop close. + +**Pass/Fail signal:** +- **PASS** if the close completes via the legacy (non-RBF) coop close + path — verified by log line `using legacy coop close` (or similar) + on the closer side, and the `AuxCloser` hook being invoked. +- **FAIL** if Alice's log shows the RBF state machine being entered + for an overlay channel, or the close transaction is built without + the aux-close additions. + +--- + +### S5: Non-taproot peer over RBF coop close (cross-check) + +**Goal:** Confirm RBF coop close still works on a non-taproot channel +when both peers signal `--protocol.rbf-coop-close`. Catches dispatch +regressions in the taproot/MuSig2 vs. legacy code split. + +**Steps:** +- Open an `anchors` channel (default) between Alice and Bob. +- Run S1 + S2 again on this channel. + +**Pass/Fail signal:** +- **PASS** if both rounds complete and the fee-bumped tx replaces + the original in mempool — same as S2, but no MuSig2 logs expected. +- **FAIL** if the close hangs or errors with a MuSig2-related + message on a non-taproot channel. + +## Failure investigation + +- **Logs (Alice and Bob):** set `PEER`, `LNWL`, `CRTR` to `debug`. + Grep for `closing_complete`, `closing_sig`, `musig`, + `NextCloseeNonce`, `ClosingComplete`. +- **State machine state:** `lncli pendingchannels` — + `pending_force_closing_channels` and `waiting_close_channels` reflect + intermediate states. +- **Nonce reuse detection:** dump each round's nonce hex from logs; + any repeat across rounds for the same channel is a bug. + +## Related itests + +- `itest/lnd_rbf_coop_test.go` (or equivalent — verify the exact file + in v0.21.0). +- `peer/musig_nonce_order_test.go` for nonce-ordering unit coverage. + +## Out of scope + +- Plain (non-RBF) cooperative close on taproot — see + [`production-taproot-channels.md` scenario S5](./production-taproot-channels.md). +- Force close (unilateral) on taproot. +- Splice — not in v0.21.0. diff --git a/docs/testing-guides/v0.21.0/reorg-safe-closes.md b/docs/testing-guides/v0.21.0/reorg-safe-closes.md new file mode 100644 index 0000000000..2afcaa1cbe --- /dev/null +++ b/docs/testing-guides/v0.21.0/reorg-safe-closes.md @@ -0,0 +1,245 @@ +# Reorg-Safe Channel Closes + `MinCLTVDelta` Change — v0.21.0 RC Testing Guide + +**PRs:** #10331 (reorg-safe closes + MinCLTVDelta raise), #10509 (new PendingChannels fields) +**Risk:** high-regression +**Audience:** node operators, integrators with custom CLTV invoice flows, RPC clients tracking close progress +**Backends affected:** all +**Networks:** regtest (primary), signet, mainnet + +## What this feature does + +Two coupled changes in v0.21.0 alter the close lifecycle: + +1. **Reorg protection on channel closes** (#10331). Previously, any + channel close was considered final the moment the spending + transaction was detected. v0.21.0 now waits between 3 and 6 + confirmations before resolving a channel as closed, scaled + linearly with channel capacity up to the non-wumbo maximum + (~0.168 BTC). Wumbo channels always require 6 confirmations. + +2. **New `PendingChannels` fields** (#10509). `WaitingCloseChannel` + now exposes `blocks_til_close_confirmed` (countdown) and + `close_height` (block height the close-tx was first confirmed + at), so clients can render progress. + +The MinCLTVDelta bump (also in #10331) is the breaking part: + +3. **`MinCLTVDelta` raised from 18 to 24**, providing more safety + margin above `DefaultFinalCltvRejectDelta` (19 blocks). Custom + CLTV deltas in the 18–23 range on `addinvoice` are now rejected. + The default of 80 is unchanged. Existing invoices created on + prior versions continue to work normally. + +## Why it matters / what could break + +- A client that polls `closedchannels` and expects entries to + appear immediately on spend will now see a multi-block lag. +- A client that uses `WaitingCloseChannel` but doesn't render the + new fields will under-inform users (cosmetic). +- An RPC client or wallet that always passes a custom + `cltv_expiry_delta` (e.g. 20) on `addinvoice` will now get + rejection errors at the daemon. Watch for surprised integrators. +- Wallets that decode incoming invoices created before the upgrade + with `cltv_expiry_delta < 24` must still honor them — verify the + payee/sender side is permissive even though the issuer side is now + strict. +- If the scaling math regresses (wrong conf count chosen for a given + capacity), the close finalizes too early or too late. + +## Prerequisites + +- **lnd build:** v0.21.0-beta.rc1 or newer. +- **Backend:** `bitcoind` regtest. Reorgs on regtest are scripted + via `invalidateblock`/`reconsiderblock`. +- **Peers:** Alice and Bob; a third node Carol for the MinCLTV + test (so the rejection error path is reachable end-to-end). +- **Tools:** `lncli`, `bitcoin-cli`, `jq`. + +## Setup + +```bash +# 1. Open three channels Alice ↔ Bob of different capacities to +# exercise the scaling logic: +# - small: 500_000 sat +# - medium: 5_000_000 sat +# - wumbo: 20_000_000 sat (requires --protocol.wumbo-channels) +$LNCLI_A openchannel --node_key=$BOB_PUB --local_amt=500000 +$LNCLI_A openchannel --node_key=$BOB_PUB --local_amt=5000000 +$LNCLI_A openchannel --node_key=$BOB_PUB --local_amt=20000000 # wumbo + +bitcoin-cli -regtest generatetoaddress 6 $ADDR +``` + +## Scenarios + +### S1: Small channel — close requires ~3 confirmations + +**Goal:** A 500k-sat channel close lingers in `waiting_close_channels` +until the scaled-conf threshold is reached. + +**Steps:** +```bash +# Pick the 500k channel. +CP=$($LNCLI_A listchannels | \ + jq -r '.channels[] | select(.capacity=="500000") | .channel_point') + +$LNCLI_A closechannel \ + --funding_txid=${CP%:*} --output_index=${CP##*:} \ + --sat_per_vbyte=2 + +# Mine 1 block (close-tx confirms). +bitcoin-cli -regtest generatetoaddress 1 $ADDR + +# Inspect waiting-close state. +$LNCLI_A pendingchannels | jq '.waiting_close_channels[]' +``` + +**Pass/Fail signal:** +- **PASS** if (a) the channel is in `waiting_close_channels`, (b) + `close_height` equals the block height the close-tx confirmed at, + (c) `blocks_til_close_confirmed` equals roughly 3 (the lower + scaled-conf bound for a small channel), and (d) after mining 2 + more blocks the channel moves to `closedchannels`. +- **FAIL** if the channel finalizes immediately (no waiting state) + or if `blocks_til_close_confirmed` is missing/zero from the start. + +--- + +### S2: Mid-size channel — close requires more confs than S1 + +**Steps:** Same as S1 against the 5M-sat channel. + +**Pass/Fail signal:** +- **PASS** if `blocks_til_close_confirmed` is strictly greater than + S1's value (the scaling actually scales) and bounded above by 6. +- **FAIL** if it's identical to S1's value (no scaling), or > 6. + +--- + +### S3: Wumbo channel — close always requires 6 confirmations + +**Steps:** Same as S1 against the 20M-sat channel. + +**Pass/Fail signal:** +- **PASS** if `blocks_til_close_confirmed` is exactly 6 the moment + the close-tx confirms. +- **FAIL** if anything other than 6. + +--- + +### S4: Force-close also respects the new conf requirement + +**Goal:** Reorg protection applies to unilateral closes too, not +just coop closes. + +**Steps:** +- Open another small channel; have Alice force-close it. +- Inspect `pending_force_closing_channels` after 1 conf. + +**Pass/Fail signal:** +- **PASS** if the force-close also stays pending for the scaled + number of confs. +- **FAIL** if force-close finalizes immediately on first + confirmation (regression — half the fix doesn't help if the other + half is missing). + +--- + +### S5: Reorg before close-conf threshold rewinds the close + +**Goal:** A reorg that re-disconfirms the close-tx before the +threshold should rewind the channel to "still open" rather than +incorrectly considering it closed. + +**Steps:** +```bash +# Coop-close a channel; mine 1 confirmation only. +$LNCLI_A closechannel --funding_txid=... --output_index=... --sat_per_vbyte=2 +bitcoin-cli -regtest generatetoaddress 1 $ADDR + +# Reorg the block out. +TIP=$(bitcoin-cli -regtest getbestblockhash) +bitcoin-cli -regtest invalidateblock $TIP + +# Mine a competing block (no close-tx). +bitcoin-cli -regtest generatetoaddress 1 $OTHER_ADDR +``` + +**Pass/Fail signal:** +- **PASS** if Alice's daemon notices the disconfirmation (log + contains `reorg detected` or similar) and `pendingchannels` + reports the close as no longer confirmed (e.g. `close_height` + cleared or the channel reverts to active). Then re-mining the + close-tx re-progresses the close. +- **FAIL** if the close stays "almost final" despite the reorg + invalidating it, or if Alice's chain-watch panics. + +--- + +### S6: `MinCLTVDelta` rejects new invoices with custom delta 18–23 + +**Goal:** The breaking-change side of #10331. Verify both the +rejection path and the error message. + +**Steps:** +```bash +# Default delta: should succeed. +$LNCLI_A addinvoice --amt=1000 --cltv_expiry_delta=80 | jq -r '.payment_request' > /tmp/inv-ok.txt +echo $? + +# Custom delta in the new-rejected band: should fail. +$LNCLI_A addinvoice --amt=1000 --cltv_expiry_delta=20 +echo $? +``` + +**Pass/Fail signal:** +- **PASS** if delta=80 succeeds and delta=20 fails with a clear + error message naming the new minimum (24). Non-zero exit code on + the 20 call. +- **FAIL** if delta=20 silently accepts (the breaking change didn't + land), or if the error message is opaque (doesn't help operators + fix their config). + +--- + +### S7: Existing pre-upgrade invoices with delta < 24 still pay + +**Goal:** Backwards-compatibility on the receiving side — a node +running v0.21 must still settle an invoice with `cltv_expiry_delta=20` +that was issued by a pre-0.21 node (or by an external invoice +generator). + +**Steps:** +- From a pre-0.21 build of Bob (or from any node still on v0.20), + generate an invoice with `cltv_expiry_delta=20`. +- Pay it from Alice (running v0.21). + +**Pass/Fail signal:** +- **PASS** if the payment succeeds; final hop accepts the HTLC. +- **FAIL** if Alice's payer side rejects, or if Bob's older node + fails to accept the inbound HTLC due to a v0.21-injected delta. + +## Failure investigation + +- **Subsystems:** `CRTR`, `CNCT`, `HSWC`, `BCST`. +- **Useful log lines:** `reorg`, `close confirmation`, `cltv_expiry`, + `blocks_til`. +- **State to query:** `pendingchannels`, `closedchannels`, + `decodepayreq ` to inspect a captured invoice's delta. +- **Scaling math regression:** if S1/S2/S3 don't show monotonically + increasing conf counts, dump the capacity-to-conf mapping from + the daemon's log at close time and cross-check against #10331. + +## Related itests + +- `itest/lnd_channel_force_close_test.go` and + `itest/lnd_channel_open_test.go` — close path coverage. +- `itest/lnd_payment_test.go` — invoice-delta validation. + +## Out of scope + +- Force-close fee-bumping behavior — see `bumpforceclosefee` flows, + unrelated to this guide. +- The closed-channel tombstone on sqlite/postgres — separate guide + ([`closed-channel-tombstone.md`](./closed-channel-tombstone.md)). +- The legacy non-scaled close logic (no longer present in v0.21.0).