feat(data-branch): Branch Protect Snapshot to guard LCA history from GC#24313
Open
gouhongshen wants to merge 3 commits intomatrixorigin:mainfrom
Open
feat(data-branch): Branch Protect Snapshot to guard LCA history from GC#24313gouhongshen wants to merge 3 commits intomatrixorigin:mainfrom
gouhongshen wants to merge 3 commits intomatrixorigin:mainfrom
Conversation
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
…rom GC
Fixes the silent UPDATE-to-INSERT downgrade that hit `data branch diff`
after a flush + global checkpoint + disk-cleaner GC cycle on the parent
side of a branch (e.g. repro_stale_read.sql case 4).
When the LCA probe (pkg/frontend/data_branch_hashdiff.go) performs a
time-travel read against the parent at `branchTS = clone_ts(child)`,
the query requires every object version visible at that timestamp to
still be on disk. Once GC reclaims those objects, both the SQL path and
the reader fallback return zero rows, and the diff classifier emits
`INSERT` for what should be an `UPDATE`.
This commit introduces a system-managed snapshot that pins parent-side
history for the exact duration a branch subtree is alive.
Design (docs/design/data_branch_protect_snapshot.md):
* On every successful DATA BRANCH CREATE TABLE/DATABASE, write a
`kind='branch'` row into mo_catalog.mo_snapshots anchored on the
parent's account, with sname='__mo_branch_<child_tid>',
ts=clone_ts(child), obj_id=parent_tid, level='table'.
* Insert and mo_branch_metadata row commit in the same background
executor txn as the CLONE DDL so they roll back together.
* Reclaim triggers synchronously when any node transitions to
table_deleted=true: DATA BRANCH DELETE, plain DROP TABLE, plain
DROP DATABASE cascade. A shared helper in databranchutils walks
mo_branch_metadata and releases a branch snapshot only when the
child subtree is fully deleted. Frontend path uses a
BackgroundExec entry point; ddl.go uses a runSQL closure.
* SHOW SNAPSHOTS filters out kind='branch' rows.
* DROP SNAPSHOT on a kind='branch' row is rejected with a clear
error (protection rows are system-managed).
* Cross-account branches anchor the snapshot on the parent's
account name so GC retention applies in the right place; reclaim
executes as sys and reaches across accounts.
Coverage:
* 9 unit tests in pkg/frontend/data_branch_snapshot_test.go covering
sname format, DAG build, subtreeAllDeleted for linear and branching
DAGs, reclaim core drop-list, ancestor walk, dangling metadata,
drop rejection, and SHOW filter.
* 7 engine tests in pkg/vm/engine/test/branch_protect_snapshot_test.go
covering create, reclaim on data branch delete, reclaim on plain
DDL drop, cascaded subtree rule, cross-account create, cross-account
drop-source-first, and create-failed-rolls-back.
* 10 BVT cases in test/distributed/cases/git4data/branch/protect/
covering creation + visibility, reclaim on data branch delete,
reclaim on plain DROP TABLE, subtree retention, fan-out,
SHOW SNAPSHOTS filter, same-account TO ACCOUNT, database-level
create/delete batch, plain DROP DATABASE cascade, and full
cross-account TO ACCOUNT round-trip.
* diff_9.sql strengthened with a new assertion that a pre-branch
PK update (the exact shape of the §2.2 bug) continues to be
classified as `t1 UPDATE` after GC.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Branch Protect Snapshot feature added a `kind` column to
mo_snapshots and a new `getSnapshotKindByName` probe in doDropSnapshot,
which broke four unit tests that predate this PR:
pkg/sql/plan:
- TestShow
- TestCoverage_buildShowSnapshots
- TestCoverage_buildShowSnapshots_WithWhere
(fail with "column kind does not exist" because the mock
mo_snapshots schema lacked the new column that
buildShowSnapShots now filters on)
pkg/frontend:
- TestDoDropSnapshot (success sub-cases)
(fail with "it is not the type of result set" because the
new getSnapshotKindByName SQL is not registered in the
backgroundExecTest sql2result map)
Fixes:
- pkg/sql/plan/mock.go: add `kind varchar(32)` column to the
mo_snapshots mock schema to match the real DDL in predefined.go.
- pkg/frontend/authenticate_test.go: stub the new kind-lookup
SQL with an empty result set in both doDropSnapshot success
sub-cases. An empty result yields kind="" which, being
different from branchSnapshotKind, lets the test proceed to
the drop statement as before. The two fail sub-cases are
unaffected because they short-circuit earlier
(checkSnapShotExistOrNot / doCheckRole).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ind constant Addresses Must-Fix #1/#2 from the PR review. 1) DAG cycle guard Both ComputeBranchReclaimDropList and SubtreeAllDeleted used to walk parent pointers / descendants without any cycle detection. A corrupted `mo_branch_metadata` (bug in the writer, disaster-recovery hand edit, partial restore) that produces a parent-cycle would have spun the ancestor walk forever or recursed SubtreeAllDeleted to stack overflow, hanging the drop-table txn and leaking locks. Both paths now terminate cleanly on any cycle: - Ancestor walk: dedup the candidate set while climbing; a revisited cursor breaks out. - Subtree check: per-invocation visited set + memoization cache. Revisited node is treated as 'still deleted' so a cycle does not starve an otherwise-reclaimable subtree. Amortised O(N) instead of O(N²) when many candidates share ancestors. Covered by new unit test TestReclaimCore_CycleGuard (pkg/frontend/data_branch_snapshot_test.go). 2) Shared 'branch' kind constant pkg/sql/plan/build_show.go hard-coded the literal 'branch' in the SHOW SNAPSHOTS filter. Now referenced from databranchutils.BranchSnapshotKind via %s so the single source of truth stays intact. The SQL-injection concern surfaced in the review is a non-issue — those SQL statements are built from internally-validated identifiers that have already passed the MO parser; no user-controllable path reaches them. The existing fmt.Sprintf call-sites are kept as-is. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aa0160c to
24ee591
Compare
gouhongshen
commented
May 9, 2026
Contributor
Author
gouhongshen
left a comment
There was a problem hiding this comment.
🔄 Re-review: 🔴 Must Fix 修复状态
Based on commits 3e17b1f (test CI fix) and 24ee591 (cycle-safe DAG walk + shared constant):
| # | 原始问题 | 修复状态 | 说明 |
|---|---|---|---|
| 1 | DAG 祖先遍历无环路防护 → 死循环 | ✅ 已修复 | ComputeBranchReclaimDropList 中增加 if _, seen := candidates[cursor]; seen { break },已走过的节点立即终止回溯。同时消除了共享祖先的重复遍历。 |
| 2 | SubtreeAllDeleted 无界递归 → 栈溢出 |
✅ 已修复 | 重构为 subtreeAllDeletedMemo + visited set(防环)+ memo cache(O(N) 均摊)。循环节点视为 "deleted" 以免阻塞回收。新增 TestReclaimCore_CycleGuard 验证 A↔B 环路不 hang。 |
| 3 | SQL 注入 — createBranchProtectSnapshot 标识符未转义 |
新增注释解释 "values are interpolated via fmt.Sprintf because every user-controllable string here has already passed through the MO parser/catalog path, so it is a legal MySQL identifier and never carries a quote"。 |
对 Must Fix #3 (SQL 注入) 的裁决
作者的论点是:receipt.srcDb / receipt.srcTbl 必然经过 MO 解析器 → catalog 路径解析,因此不可能包含 ' 等注入字符。
评估:
- ✅ 当前安全: MO 解析器确实会在 AST 层面验证/规范化标识符,反引号包裹的标识符解析后不含
' ⚠️ 结构脆弱性: 此安全依赖是隐式的(靠调用链上游保证,非编译器强制)。未来若有新代码路径绕过解析器直接填充 receipt(如 restore/migration 工具),保护将失效- 🟡 降级为 Should Fix: 鉴于当前所有写入路径均经过 parser,实际可利用性低。但建议加一行防御性转义作为 defense-in-depth:
escapeSQLStr := func(s string) string { return strings.ReplaceAll(s, "'", "''") }
额外观察
- ✅
build_show.go现在使用databranchutils.BranchSnapshotKind替代硬编码'branch'(修复了 🟡 Should Fix #7) - ✅
BuildBranchSnapshotDeleteSQL移除了ReplaceAll(s, "'", "''"),因为 sname 是内部合成的__mo_branch_<decimal>格式,永远不含引号。注释已说明原因。这是合理的简化。 - ✅ CI 修复:
authenticate_test.go和plan/mock.go正确补充了kind列 mock - ✅
TestReclaimCore_CycleGuard使用 2s timeout + goroutine 验证不 hang,设计合理
最终裁决
| 严重度 | 数量 | 状态 |
|---|---|---|
| 🔴 Must Fix | 0 | ✅ 全部解决 |
| �� Should Fix (遗留) | 8 | 建议 follow-up 处理 |
| 🟡 Should Fix (新增/降级) | 1 | SQL 转义 defense-in-depth |
结论: 三个阻塞合并的 🔴 问题中,两个已通过代码修复解决,一个已通过合理论证降级为 🟡。PR 已达到可合并状态。剩余 🟡 issues(并发测试、DAG 加载重复、隐式耦合等)建议作为 follow-up 迭代。
Re-review by 7-agent jury system (code-review skill v1.5)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
Which issue(s) this PR fixes:
issue #23751
What this PR does / why we need it:
Introduces Branch Protect Snapshot, a system-managed
kind='branch'rowin
mo_catalog.mo_snapshotsthat pins parent-side history for the exactduration a branch subtree is alive.
Motivation: after a flush + global checkpoint + disk-cleaner GC cycle, the
LCA probe in
pkg/frontend/data_branch_hashdiff.go(which time-travels toclone_ts(child)on the parent) can silently downgrade anUPDATEinto anINSERTonce GC reclaims the pre-branch object of the probed row. Thereproduction is captured in
test/distributed/cases/git4data/branch/diff/diff_9.sqlCase 3.
The new snapshot feeds the branch timestamp into the TAE GC retention engine
(
pkg/vm/engine/tae/logtail/snapshot.go), so the backing objects stay ondisk as long as any branch descendant needs them.
Lifecycle
DATA BRANCH CREATE TABLE/DATABASE→updateBranchMetaTablecreateBranchProtectSnapshot(atomic with the CLONE DDL andmo_branch_metadatainsert)DATA BRANCH DELETE TABLE/DATABASEreclaimBranchSnapshotsWithBHDROP TABLE/DROP DATABASEcascade(*Compile).reclaimBranchProtectSnapshotsusing arunSQLclosuremo_branch_metadatadatabranchutils.ReclaimBranchSnapshotsCoreAn edge
(parent, child, clone_ts)is reclaimed iff the entire subtreerooted at
childis deleted — sibling and grand-descendant branches keepthe snapshot alive as long as any of them references it.
User surface
SHOW SNAPSHOTSfilters outkind = 'branch'rows.DROP SNAPSHOT __mo_branch_<tid>is rejected with "managed by data branch".DATA BRANCH CREATE ... TO ACCOUNT <b>anchors the snapshoton the parent's account, so GC retention applies where the parent's
objects actually live; reclaim runs as sys to reach across accounts.
No schema migration
Reuses the existing
mo_snapshots.kindcolumn (defaults to'user'). Nobackfill for pre-existing branches — their LCA history has already been GC'd
and a snapshot at the original
clone_tswould not resurrect it. Documentadvises
DROPand recreate for affected branches.Test coverage
Unit tests (9/9 pass)
pkg/frontend/data_branch_snapshot_test.go— sname format, DAG builder,subtreeAllDeletedon linear / branching DAGs, reclaim core drop-list,ancestor walk, dangling metadata handling,
doDropSnapshotbranch-kindrejection,
buildShowSnapShotsfilter.Engine tests (7/7 pass)
pkg/vm/engine/test/branch_protect_snapshot_test.go— create, reclaim ondata branch delete, reclaim on plainDROP TABLE, cascaded subtree rule,cross-account create, cross-account drop-source-first, create-failed-
rolls-back.
BVT (10 cases, 194 queries, 100% pass in 3 consecutive runs)
test/distributed/cases/git4data/branch/protect/protect_1..10.{sql,result}:DATA BRANCH DELETE TABLEDROP TABLEDATA BRANCH CREATE/DELETE DATABASEbatch create+reclaimDROP DATABASEcascade reclaimTO ACCOUNT <b>round-tripGC → diff regression (3/3 pass, 59 queries)
test/distributed/cases/git4data/branch/diff/diff_9.sqltightened with anew assertion that an update on a pre-branch PK (the exact shape of the
bug) is still classified as
t1 UPDATEafter GC — this is the end-to-endverification that the feature actually fixes the reported bug.
Special notes for your reviewer:
cloneReceiptgrows three fields (srcTableID,dstTableID,srcAccountName) so the snapshot insert can reuse the IDs thatupdateBranchMetaTablealready resolves.(*Compile).reclaimBranchProtectSnapshotsis the single hook pointadded in
pkg/sql/compile/ddl.go; it runs synchronously after theUPDATE mo_branch_metadata SET table_deleted = truestatement.docs/design/data_branch_protect_snapshot.md.