Add fingerprint pre-check to Recon (Redshift)#1
Conversation
…ickslabs#2465) ## Changes ### What does this PR do? * Standardizes the reconcile sampling query across all dialects * Azure Synapse Dedicated SQL Pool does not accept VALUES as a derived table, so the previous T-SQL path failed at runtime on Synapse connections. ### Implementation Details replaces the previous CTE form (non-T-SQL) and `VALUES` derived-table form (T-SQL) with a single derived-table-join shape that works everywhere. ```sql -- Before (T-SQL path): INNER JOIN (VALUES (...), (...)) AS recon([col1], [col2]) ON ... -- Before (other dialects): WITH recon AS (SELECT ... UNION SELECT ...), src AS (...) SELECT ... FROM src INNER JOIN recon ON ... -- After (all dialects): SELECT ... FROM (...) AS src INNER JOIN (SELECT ... UNION SELECT ...) AS recon ON ... ``` ### Linked issues Resolves databrickslabs#2418 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` ### Tests - [x] manually tested - [x] added unit tests - [x] added integration tests
…QL scripts (databrickslabs#2482) ## Changes <!-- Summary of your changes that are easy to understand. Add screenshots when necessary, they're helpful to illustrate the before and after state --> ### What does this PR do? * Converts the MSSQL profiler from `type: python` venv steps to `type: sql` / `type: ddl` in-process steps ### Key changes - Replace `activity_extract.py`, `info_extract.py`, and `mssql/common/*`. - Add and modify 13 query/DDL SQL pairs to profile. - Use exisitng connector `MSSQLConnector` instead of duplicate in `get_sqlserver_reader`. - DDL types match documented source types (`SMALLINT`, `INTEGER`, `DOUBLE`, `TIMESTAMP`). - `sys_info.sql` enumerates 37 columns explicitly (no `SELECT *`). - `database` configurator prompt no longer defaults to `master`; user supplies the actual target DB. Docs (`mssql.mdx`, `legacy_synapse.mdx`) aligned with actual prompts. ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves databrickslabs#2459 ### Functionality - [ ] modified existing command: `databricks labs lakebridge execute-database-profiler --source-tech mssql` ### Tests <!-- How is this tested? Please see the checklist below and also describe any other relevant tests --> - [X] manually tested - [ ] added unit tests - [ ] added integration tests
…ation schema (databrickslabs#2304) ## Changes This PR adds Redshift as a supported profiler assessment platform. It introduces Redshift resources for serverless, provisioned, and provisioned_multi_az ### What does this PR do? - Adds Redshift as a supported platform for the profiler assessment (alongside Synapse). - Adds Redshift resources for all three variants (serverless, provisioned, provisioned_multi_az): query/SQL files (e.g. query_view, rs_managed_storage_gb, rs_nodes, chart queries), pipeline configs, and validation schema. - Files (2 levels: folder → subfolder, 3rd level: files under each)): - resources/assessments/redshift/ → provisioned/, provisioned_multi_az/, serverless/ with their .sql and pipeline_config.yml - resources/assessments/validation/ → redshift_extract_schema.yml ### Relevant implementation details - Resources: Same query/file layout for serverless, provisioned, and provisioned_multi_az under resources/assessments/redshift/{variant}/ (e.g. 0_query_view.sql … 9_chart_*.sql, pipeline_config.yml); serverless also has 10_cost_incurred.sql. Shared validation: resources/assessments/validation/redshift_extract_schema.yml. ### Caveats/things to watch out for when reviewing: - Multi-AZ Redshift has limited system views (e.g. some STV views unavailable); provisioned_multi_az SQL reflects that where applicable. - Serverless Redshift has one additional query ### Linked issues <!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes, fixed, resolve, resolves, resolved. See https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword --> Resolves #.. ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` - [ ] ... +add your own ### Tests 1. Manually Tested all credential flows on all clusters in AWS Sandbox account aws-sandbox-field-eng (332745928618) tests/resources/assessments/pipeline_config_main_redshift.yml: pipeline config that runs that script. - [x] manually tested - [ ] added unit tests - [ ] added integration tests --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: M Abulazm <mohamed.abulazm@databricks.com>
bishwajit-db
left a comment
There was a problem hiding this comment.
Thanks for this — it's a genuinely thorough and well-documented piece of work, and I really like the fail-open design and the dual-channel verification in the solver. I've left a handful of inline notes below, mostly questions and suggestions rather than hard objections.
The two I'd most love your thoughts on before merge are the timestamptz timezone handling and the Stage-2 build-failure fallback; the rest are smaller cleanups, a naming idea, and a couple of test suggestions. Happy to talk any of these through — nothing here that we can't sort out together.
| col_type_lower = (col_type or "").strip().lower() | ||
| spark_col = F.expr(_quote_spark_identifier(col_name)) | ||
| if _is_timestamp_like(col_type_lower): | ||
| cast_col = F.trim(F.date_format(spark_col, "yyyy-MM-dd HH:mm:ss.SSSSSS")) |
There was a problem hiding this comment.
Really nice work overall. One thing I wanted to check on timestamptz: the target side renders via date_format(...), which uses the Spark session timezone, while the Redshift source side pins UTC (TO_CHAR(... AT TIME ZONE 'UTC', ...) in redshift.py). If a cluster's spark.sql.session.timeZone isn't UTC, I think the two sides could render the same instant differently, so timestamptz rows would look mismatched even when they're equal (Stage-1 would just fall open, so still correct — but slower — and the row-hash path could over-report). Could we normalize both sides to UTC explicitly, and maybe add a small test pinning the expected string on each side? Just want the byte-identical guarantee to hold regardless of cluster timezone.
| ) | ||
|
|
||
| try: | ||
| data_reconcile_output = build_mismatch_output( |
There was a problem hiding this comment.
Small thought on the fallback here: the other non-MATCH branches in this method fall back to _run_reconcile_data(...), but if build_mismatch_output raises, this path returns DataReconcileOutput(exception=...) and marks the table failed. Since build_mismatch_output is fairly involved, would it make sense to fall back to the full pipeline here too, so a hiccup in the Stage-2 builder doesn't turn an otherwise-successful recon into a failure? Totally fine if reporting the exception is intentional — just flagging it for consistency with the nice fail-open behavior everywhere else.
| return rendered.sql(dialect="databricks") | ||
|
|
||
|
|
||
| def test_databricks_timestamp_handler_emits_microsecond_precision(): |
There was a problem hiding this comment.
These do a great job pinning the format-string shape on each side. One suggestion: since the MATCH fast-path leans on both engines producing the same bytes, could we add an assertion that pins the expected output string for a sample timestamp on each dialect and checks they're equal (rather than substring-checking two different patterns)? It'd guard against a pattern typo or the timezone case above. Also noticed BOOLEAN isn't rendered here even though we added the Redshift handler — a small case for it would round things out.
| elif col_type_lower == "date": | ||
| cast_expr = f"TO_CHAR({quoted}, 'YYYY-MM-DD')" | ||
| else: | ||
| cast_expr = f"CAST({quoted} AS VARCHAR)" |
There was a problem hiding this comment.
One edge case worth a look: CAST(col AS VARCHAR) in Redshift defaults to VARCHAR(256), so values longer than 256 chars can get truncated, while the Spark target keeps the full string. For long text columns I think that could make Stage-1 see a mismatch on otherwise-equal rows (it'd fall open, so still correct, just slower). Could we cast to an explicit wider length to keep the two sides aligned? A long-string test would be a nice guard too.
| ], | ||
| # Align with Redshift's ``TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US')`` so | ||
| # the per-row SHA2 inputs are byte-identical for Redshift -> Databricks reconciles. | ||
| exp.DataType.Type.TIMESTAMP.value: [ |
There was a problem hiding this comment.
These TIMESTAMP/TIMESTAMPTZ/BOOLEAN handlers look like solid fixes! Since they live in the shared transform mapping, they'll change row hashes for every Redshift→Databricks reconcile regardless of the fingerprint flag. Would it be worth pulling them (with their regression tests) into a small standalone PR, so they can land and be validated on their own and this PR stays focused on the fingerprint feature? No strong opinion if you'd rather keep them together — just a thought on reviewability and blast radius.
| ) | ||
|
|
||
| return ColumnAlignment( | ||
| exclude_columns=sorted(exclude) if exclude else None, |
There was a problem hiding this comment.
Minor: it looks like ColumnAlignment.exclude_columns is computed here, but run_fingerprint_precheck only ever reads column_mapping (the actual select/drop filtering happens in hash_columns_ordered_for_reconcile). If it isn't wired in anywhere, maybe we can drop the field to avoid confusion? (If it's meant to be used, the all_src_cols - selected diff might also need identifier normalization.)
| ) | ||
| elapsed = time.monotonic() - start | ||
|
|
||
| # Wall-clock: serial would be ~0.4s, parallel ~0.2s. Allow up to 0.35s for |
There was a problem hiding this comment.
Just a heads-up on possible CI flakiness: the elapsed < 0.35 wall-clock check after the two 0.2s sleeps could occasionally trip on a loaded runner. The distinct-thread-id assertion is a solid signal on its own — could we relax or drop the timing bound to keep this from flaking?
| layer: str, | ||
| data_source: DataSource, | ||
| ) -> list[str]: | ||
| """Mirror HashQueryBuilder hash column set + sort order (case-insensitive sort_key).""" |
There was a problem hiding this comment.
Following the note here about mirroring HashQueryBuilder (and similarly the serializer in redshift.py/spark_target.py mirroring expression_generator): would it be worth extracting these into shared helpers that both the row-hash and fingerprint paths call? I think it'd remove the drift risk the comment mentions — and it's actually the cleanest fix for the timestamptz timezone difference above, since if both paths share one serializer they can't diverge. I know the comment defers this until the second dialect; just flagging that doing it now would collapse a couple of these notes into one change.
| systemic_mismatch: bool = False | ||
|
|
||
|
|
||
| def detect_and_solve( |
There was a problem hiding this comment.
Tiny naming thought: engine already means a sqlglot Dialect throughout reconcile (we even take source_engine in the orchestrator), and *_engine.py is used for the transpiler engines (TranspileEngine/SqlglotEngine/LSPEngine). Since this module is really the detect/solve logic, would solver.py or detection.py read a bit more clearly? Just a rename, no rush.
| connection_name: str, | ||
| is_target: bool = False, | ||
| ) -> DataSource: | ||
| if isinstance(engine, Databricks): |
There was a problem hiding this comment.
Small one: I think this reorder is effectively a no-op — RemoteQueryReader.__init__ just stores its args (no I/O), and the Databricks branch never used reader anyway. Since it isn't needed for the feature, maybe we can drop it to keep the diff tight (or fold it into the expression_generator cleanup if you split that out)? Not a blocker either way.
c980821 to
9ae2f98
Compare
## Changes ### What does this PR do? Adds an opt-in fingerprint pre-check to Recon. When `fingerprint_precheck=True` and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline. - MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN. - MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing `compare.reconcile_data` flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing pipeline. - Ineligible -> falls through silently. The flag defaults to False; existing behaviour is unchanged. The algorithm is byte-identical to the dataprint sketch-based reconciliation library; this is the first dataprint-into-lakebridge integration. Redshift is the first dialect — adding Snowflake / Oracle / TSQL is one `FingerprintQueryBuilder` subclass plus one registry entry. ### Relevant implementation details - `trigger_recon_service._run_fingerprint_or_reconcile_data` is the single decision point. Static eligibility centralised in `classify_ineligibility`; the schema-dependent `unmapped_target_column_mapping` reason is raised by `align_columns` as a typed exception and routed through `FingerprintRunMetadata.ineligible(...)`. Every reason maps to an `IneligibilityReason` enum value and is recorded on `recon_metrics.fingerprint_metrics.ineligibility_reason`. - Source-side reads use upstream's `RemoteQueryReader` / `remote_query()` TVF unmodified; Stage-1 aggregation pushdown verified empirically on a 1 M-row Redshift fixture (DBR 17.3). - Stage-1 detection is parallelised across source / target via a 2-thread pool; failure semantics match the serial version. - Three new fields on `ReconcileConfig`: `fingerprint_precheck`, `fingerprint_treat_empty_as_null`, `fingerprint_row_count_override`. - Config version bumps 2 -> 3 with a `v2_migrate` that folds two legacy spellings (`redshift_fingerprint_precheck`, `use_fingerprint_precheck`) into the new flag. Existing deployments upgrade automatically. ### Pre-existing fixes that ride along (upstream PR databrickslabs#2339) Two correctness bugs in the upstream Redshift connector MR (databrickslabs#2339) surfaced during the dataprint integration P0 / P1 runs against a real cluster. Both crash the existing row-hash recon path on real customer schemas and are unrelated to dataprint, but they sat in the integration path so they are fixed inline. Both fixes live in `reconcile/query_builder/expression_generator.py` and are pinned by regression tests in `test_expression_generator.py`. - **Databricks block missing TIMESTAMP / TIMESTAMPTZ handler.** Redshift's source-side transform emits `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits), but the Databricks block had no override, so the target side fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))` — Spark emits a variable-length fractional component, omitted entirely for zero-microsecond timestamps. The byte-width drift made per-row SHA2 disagree for every TIMESTAMP / TIMESTAMPTZ row in any Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(ts, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` so source and target are byte-identical. - **Redshift block missing BOOLEAN handler.** The Redshift block defined overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no dialect-level `default`. BOOLEAN columns fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects during output schema resolution with `function pg_catalog.btrim(boolean) does not exist`. Any customer schema containing a single BOOLEAN column crashes row-hash recon end-to-end. Fix: explicit `COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')` so the rendered string matches Spark's `cast(boolean AS string)` byte-for-byte. ### Hardening from the internal review round After the initial internal review on the contributor's fork, three substantive code changes landed before pushing upstream: - **Pin TZ-aware Spark target columns to UTC** before formatting in `fingerprint/spark_target.py`. The Redshift side already pinned UTC via `TO_CHAR(_ AT TIME ZONE 'UTC', _)`; the Spark side was using `DATE_FORMAT(ts, _)` which renders in `spark.sql.session.timeZone`. On a non-UTC cluster the same instant rendered different bytes on the two sides. Fix splits LTZ vs NTZ handling and routes LTZ through `TO_UTC_TIMESTAMP(_, CURRENT_TIMEZONE())`. NTZ behaviour unchanged. - **Stage-2 build failures fall through to the full pipeline** in `trigger_recon_service.py` instead of marking the table failed. Every other non-MATCH branch already does this; the `build_mismatch_output` exception path was the one inconsistency. Metadata still records `fallback_to_full_pipeline=True` for observability. - **Cast Redshift strings to `VARCHAR(65535)`** in `fingerprint/query_builders/redshift.py` instead of bare `VARCHAR`, whose default 256-byte width truncated long text. `VARCHAR(65535)` is Redshift's maximum and matches Spark's unbounded string semantics. Smaller cleanups: dropped unused `ColumnAlignment.exclude_columns`; reverted a no-op reorder in `connectors/source_adapter.py`; replaced a flaky wall-clock assertion in `test_fetch_parallel.py` with a deterministic distinct-thread-id assertion; pinned the exact rendered SQL on each dialect in `test_expression_generator.py` (instead of substring-checking two different patterns) and added a regression test for the Redshift `BOOLEAN` handler. ### Caveats - DBR 17.3+ required for source-side reads via `remote_query()` (inherited from upstream's `RemoteQueryReader` adoption). - MISMATCH-state cost at 1 M scale currently exceeds row-hash-only mode by 16-94 s because Stage-2 still feeds the existing JOIN. MATCH is the headline win (38.7% on 1 M rows); billion-row scale is the production motivation. Stage-1 hash persistence as Stage-2 input is filed as a follow-up. - Pre-existing `success_count` formula in `verify_successful_reconciliation` (upstream PR databrickslabs#2259, commit `e56c79c3d`) is mathematically wrong; sits next to fingerprint code in `trigger_recon_service.py`. Not fixed here to keep scope contained; filed separately. ### Tests - All unit tests on the touched surface pass; `tests/unit/reconcile/` runs 282 tests in <1 s. The 6 `test_cli_analyze.py` failures are pre-existing on main and unrelated. - Regression tests added for every review-round fix (UTC pin, fallback path, VARCHAR(65535)); `test_expression_generator.py` pins the exact rendered SQL on each dialect for the two pre-existing fixes. - Correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across the 20-scenario dual-mode parity matrix: 39/40 cells PASS, 1 scenario shows a known fingerprint-solver fallback edge with verdict agreement on both sides — only the cap-bounded `mismatch` count differs (fingerprint reports the true 10000, normal reports the cap-50 sample). - Linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green. - Integration coverage to follow alongside the recon e2e cluster fixture (databrickslabs#2453).
9ae2f98 to
667494c
Compare
|
@bishwajit-db Thanks for the review. I raised a new MR as discussed in the lakebridge repo - databrickslabs#2490 |
|
As discussed, raised MR in the lakebridge repo - databrickslabs#2490 |
Changes
What does this PR do?
Adds an opt-in fingerprint pre-check to Recon. When
fingerprint_precheck=Trueand the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline.compare.reconcile_dataflow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing row-hash pipeline.The flag defaults to
False; existing behaviour is unchanged.The algorithm is byte-identical to the dataprint sketch-based reconciliation library (Databricks-internal); this MR is the first dataprint-into-lakebridge integration. Redshift is the first source dialect — adding Snowflake / Oracle / TSQL is one
FingerprintQueryBuildersubclass plus one registry entry, with no orchestrator changes.Relevant implementation details
trigger_recon_service._run_fingerprint_or_reconcile_datais the single decision point. Static eligibility (flag off, unsupported dialect, report type not data, no join columns, filters / transformations / thresholds configured) is centralised inclassify_ineligibility(fingerprint/orchestrator.py) and runs before any source-side compute. One config-time reason —unmapped_target_column_mapping— is detected later, inside the precheck (because it requires the target schema), and surfaces via the typedUnmappedTargetColumnMappingErrorraised byalign_columns; the trigger catches it before the runtime-failure branch and routes it throughFingerprintRunMetadata.ineligible(...). Every reason is one of the values on theIneligibilityReasonenum and is recorded onrecon_metrics.fingerprint_metrics.ineligibility_reason, so adoption queries can count each cause without needing to grep cluster logs.RemoteQueryReader/remote_query()TVF unmodified — no new connector code. Stage-1 aggregation pushdown was verified empirically against a 1 M-row Redshift fixture on a DBR 17.3 cluster: both direct JDBC (4.28 s median) and the upstreamremote_query()TVF path (5.44 s median) push the GROUP BY to Redshift and return identical aggregated row counts.ThreadPoolExecutor(max_workers=2, thread_name_prefix="fp-stage2"). Failure semantics match the serial version; the trigger-layer catch wraps the whole block.ReconcileConfig:fingerprint_precheck: bool = False(the main flag)fingerprint_treat_empty_as_null: bool = False(collapse''to NULL in the per-column hash payload, end-to-end on both sides)fingerprint_row_count_override: int | None = None(pin the adaptive sub-bucket / bucket tier whenDESCRIBE DETAILcannot readnumRecords)ReconcileConfig.__version__bumps from2 -> 3. The newv2_migrateintroducesfingerprint_precheckand folds two legacy spellings (redshift_fingerprint_precheck,use_fingerprint_precheck) into it. Combined with upstream'sv1_migrate, the chainv1 -> v2 -> v3runs automatically viaInstallation.load(ReconcileConfig). Existing deployments upgrade with zero user action.Pre-existing fixes that ride along (upstream PR databrickslabs#2339)
While integrating dataprint end-to-end against Redshift on a real cluster, two pre-existing correctness bugs in the upstream Redshift connector MR (PR databrickslabs#2339) surfaced. Neither is dataprint-specific — both crash the existing row-hash recon path on real customer schemas — but they sat directly in the integration path, so they are fixed inline. Both fixes live in
reconcile/query_builder/expression_generator.pyand are pinned by regression tests.TIMESTAMP/TIMESTAMPTZtransform handlers to the Databricks dialect (target side)COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')(always 6 fractional digits, fixed-width). The Databricks dialect block had no override forTIMESTAMP/TIMESTAMPTZ, so the target side fell through to the universal defaultTRIM(COALESCE(col, '_null_recon_')), where Spark's implicitcast(timestamp AS string)emits a variable-length fractional component (omitted entirely for zero-microsecond timestamps;'2023-10-02 18:08:43'vs source'2023-10-02 18:08:43.000000'). The byte-width drift made the per-row SHA2 disagree for every logically-identicalTIMESTAMP/TIMESTAMPTZrow in any Redshift -> Databricks reconcile. Fix: addCOALESCE(DATE_FORMAT(_, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')for both types so the two sides are byte-identical.BOOLEANtransform handler to the Redshift dialect (source side)SUPER/DATE/TIMESTAMP/TIMESTAMPTZand had no dialect-leveldefault. AnyBOOLEANcolumn fell through to the universal defaultTRIM(COALESCE(col, '_null_recon_')), which Redshift rejects during output schema resolution (before any rows are read) withfunction pg_catalog.btrim(boolean) does not exist. Result: any customer schema containing a singleBOOLEANcolumn crashes row-hash recon end-to-end on the Redshift connector shipped in PR databrickslabs#2339. Fix: explicitCOALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')so the rendered string matches Spark'scast(boolean AS string)byte-for-byte. The existing unit suite did not catch this because the test fixtures hand-build query strings and never exercise the dialect transform-mapping path.Caveats/things to watch out for when reviewing
remote_query()require classic clusters on DBR 17.3+ or SQL warehouses (Pro / Serverless) on version 2025.35+. This is inherited from upstream'sRemoteQueryReaderadoption, not a fingerprint-specific limitation; the runtime requirement is documented indocs/lakebridge/docs/reconcile/index.mdx.compare.reconcile_dataJOIN rather than replacing it, so on MISMATCH it costs an extra 16-94 s vs. row-hash-only mode. The MATCH path is the headline win (38.7% wall-clock improvement on 1 M rows: 55.5 s vs 90.6 s); the production motivation is billion-row scale, where the row-hash pipeline's full-table JOIN is what becomes intractable. Stage-1 hash persistence as Stage-2 input is filed as a follow-up.expression_generatortest file modified istests/unit/test_install.py(six"version": 2 -> 3fixture bumps, a direct consequence of the migration step we own). Not scope creep.success_countformula inverify_successful_reconciliationis mathematically wrong (yields counts greater thantotal). Pre-existing in upstream PR Improve reconciliation result handling and logging databrickslabs/lakebridge#2259 (commite56c79c3d); sits right next to fingerprint code intrigger_recon_service.py. Not fixed in this MR to keep scope contained; filed as a separate issue.Linked issues
Resolves #..
Functionality
docs/lakebridge/docs/reconcile/configuration.mdxFingerprint Pre-check (Experimental) section +docs/lakebridge/docs/reconcile/index.mdxruntime requirements)databricks labs lakebridge ...ReconcileConfig.fingerprint_precheck(defaultFalse)Tests
v1 -> v2 -> v3config migration, recon-capture typed schema, fingerprint dispatch, Stage-1 / Stage-2 contract symmetry)tests/unit/reconcile/alone runs 307 reconcile tests in 1.13 s. The 6 failingtest_cli_analyze.pycases (Informatica analyzer binary) are pre-existing onmain(verified on stash-clean HEAD) and unrelated.test_expression_generator.pypins handler presence, format string, and source / target byte alignment forTIMESTAMP/TIMESTAMPTZ, plus theBOOLEANCASE WHENrendering on Redshift.reconcile/surface plusconfig.py.