[python] Add table repair fix mode and catalog integration (Part 2/3)#7944
[python] Add table repair fix mode and catalog integration (Part 2/3)#7944JunRuiLee wants to merge 2 commits into
Conversation
Implement read-only metadata consistency verification for Paimon tables. This verifies the chain: LATEST -> snapshot -> manifest list -> manifest files -> data files, and reports any broken links or corrupted files. Key components: - RepairIssue/RepairReport: data classes for structured issue reporting - TableRepair.verify(): walks the metadata chain and detects issues - Support for branch-qualified tables and partitioned data file paths - Respects custom partition.default-name configuration - Progress logging every 1000 data files when check_data_files=True - Documented time complexity: O(total_data_files)
Add the ability to fix metadata inconsistencies found during verification. Currently supports fixing the LATEST hint file to point to the newest valid snapshot when it references a missing one. Key additions: - TableRepair.repair(dry_run=False): applies fixes after verification - repair_table/repair_database/repair_catalog module-level entry points - Catalog.repair_table/repair_database/repair_catalog API with type annotations - FileSystemCatalog implementation delegating to repair module - Fix mode selects newest snapshot with intact manifest chain - check_data_files is respected when choosing which snapshot to fix to - Per-table error isolation in repair_database (continues on failure) - Idempotent fix operations (safe to re-run after interruption)
JingsongLi
left a comment
There was a problem hiding this comment.
Good follow-up. The fix mode and catalog integration look reasonable. Some feedback:
-
_fix_latest_fileatomicity: The fallback path (delete + overwrite whentry_to_write_atomicreturns False) has a small window where LATEST doesn't exist. In the Java Paimon implementation,SnapshotManageruses an overwrite-in-place strategy for hint files since they're best-effort. Consider just doingoverwrite_file_utf8directly without the delete step — the atomic write is a nice-to-have for LATEST but not strictly required. -
repair_database/repair_catalogfiltering: The directory listing usesname.endswith(".db")to find databases. This could accidentally pick up non-database directories that happen to end with.db. The Java catalog uses a metadata-based approach (listing from catalog metadata). Since this is filesystem-only, it's acceptable, but worth a comment noting the limitation. -
except DatabaseNotExistException: raiseinfilesystem_catalog.py— this try/except that just re-raises is a no-op. You can simplify to just callself.get_database(database_name)without the try block. -
Test duplication across Part 1 and Part 2: The test file in this PR (940 lines) includes all tests from Part 1 plus the new fix-mode tests. Since Part 2 depends on Part 1, the test file should only contain the new
TestTableRepairFixModeclass, with Part 1's tests already merged. Otherwise there will be conflicts when merging. -
Missing
check_data_filesin CLI: The CLIcmd_table_repairdoesn't expose--check-data-filesflag. Should it? Users who want to do a deep check would need to use the Python API directly.
Summary
TableRepair.repair(dry_run=False)to fix metadata inconsistenciesrepair_table/repair_database/repair_catalogcatalog API with return type annotationsFileSystemCatalogimplementation delegating to repair modulerepair_database(continues on individual table failures)Context
Split from #7940 following @JingsongLi's review comment.
Depends on Part 1: #7943
Please merge in order: Part 1 → Part 2 → Part 3.
Tests added
test_repair_fix_dangling_latest: fix mode rewrites LATEST to newest valid snapshottest_dry_run_does_not_modify: dry_run=True does not change any filestest_repair_database_level: repairs all tables in a databasetest_repair_catalog_level: repairs all tables across all databasestest_fix_latest_respects_check_data_files: fix skips snapshots with missing data filestest_repair_is_idempotent: running repair twice converges — second run is a no-op