Add fingerprint pre-check to Recon (Redshift) by ameersalman33 · Pull Request #2490 · databrickslabs/lakebridge

ameersalman33 · 2026-06-02T15:11:19Z

Changes

What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When fingerprint_precheck=True and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline.

MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing compare.reconcile_data flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing row-hash pipeline.
Ineligible -> falls through to the existing row-hash pipeline silently.

The flag defaults to False; existing behaviour is unchanged.

The algorithm is byte-identical to the dataprint sketch-based reconciliation library (Databricks-internal); this MR is the first dataprint-into-lakebridge integration. Redshift is the first source dialect — adding Snowflake / Oracle / TSQL is one FingerprintQueryBuilder subclass plus one registry entry, with no orchestrator changes.

Relevant implementation details

trigger_recon_service._run_fingerprint_or_reconcile_data is the single decision point. Static eligibility (flag off, unsupported dialect, report type not data, no join columns, filters / transformations / thresholds configured) is centralised in classify_ineligibility (fingerprint/orchestrator.py) and runs before any source-side compute. One config-time reason — unmapped_target_column_mapping — is detected later, inside the precheck (because it requires the target schema), and surfaces via the typed UnmappedTargetColumnMappingError raised by align_columns; the trigger catches it before the runtime-failure branch and routes it through FingerprintRunMetadata.ineligible(...). Every reason is one of the values on the IneligibilityReason enum and is recorded on recon_metrics.fingerprint_metrics.ineligibility_reason, so adoption queries can count each cause without needing to grep cluster logs.
Source-side reads use upstream's RemoteQueryReader / remote_query() TVF unmodified — no new connector code. Stage-1 aggregation pushdown was verified empirically against a 1 M-row Redshift fixture on a DBR 17.3 cluster: both direct JDBC (4.28 s median) and the upstream remote_query() TVF path (5.44 s median) push the GROUP BY to Redshift and return identical aggregated row counts.
Stage-1 detection is parallelised across source / target via ThreadPoolExecutor(max_workers=2, thread_name_prefix="fp-stage2"). Failure semantics match the serial version; the trigger-layer catch wraps the whole block.
New config fields on ReconcileConfig:
- fingerprint_precheck: bool = False (the main flag)
- fingerprint_treat_empty_as_null: bool = False (collapse '' to NULL in the per-column hash payload, end-to-end on both sides)
- fingerprint_row_count_override: int | None = None (pin the adaptive sub-bucket / bucket tier when DESCRIBE DETAIL cannot read numRecords)
ReconcileConfig.__version__ bumps from 2 -> 3. The new v2_migrate introduces fingerprint_precheck and folds two legacy spellings (redshift_fingerprint_precheck, use_fingerprint_precheck) into it. Combined with upstream's v1_migrate, the chain v1 -> v2 -> v3 runs automatically via Installation.load(ReconcileConfig). Existing deployments upgrade with zero user action.

Pre-existing fixes that ride along (upstream PR #2339)

While integrating dataprint end-to-end against Redshift on a real cluster, two pre-existing correctness bugs in the upstream Redshift connector MR (PR #2339) surfaced. Neither is dataprint-specific — both crash the existing row-hash recon path on real customer schemas — but they sat directly in the integration path, so they are fixed inline. Both fixes live in reconcile/query_builder/expression_generator.py and are pinned by regression tests.

#	Fix	What broke before
1	Add `TIMESTAMP` / `TIMESTAMPTZ` transform handlers to the Databricks dialect (target side)	The Redshift dialect already emits source-side `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits, fixed-width). The Databricks dialect block had no override for `TIMESTAMP` / `TIMESTAMPTZ`, so the target side fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, where Spark's implicit `cast(timestamp AS string)` emits a variable-length fractional component (omitted entirely for zero-microsecond timestamps; `'2023-10-02 18:08:43'` vs source `'2023-10-02 18:08:43.000000'`). The byte-width drift made the per-row SHA2 disagree for every logically-identical `TIMESTAMP` / `TIMESTAMPTZ` row in any Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(_, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` for both types so the two sides are byte-identical.
2	Add `BOOLEAN` transform handler to the Redshift dialect (source side)	The Redshift block defined overrides only for `SUPER` / `DATE` / `TIMESTAMP` / `TIMESTAMPTZ` and had no dialect-level `default`. Any `BOOLEAN` column fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects during output schema resolution (before any rows are read) with `function pg_catalog.btrim(boolean) does not exist`. Result: any customer schema containing a single `BOOLEAN` column crashes row-hash recon end-to-end on the Redshift connector shipped in PR #2339. Fix: explicit `COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')` so the rendered string matches Spark's `cast(boolean AS string)` byte-for-byte. The existing unit suite did not catch this because the test fixtures hand-build query strings and never exercise the dialect transform-mapping path.

Hardening from internal review

Before pushing upstream this PR went through one round of internal review on the contributor's fork. Three substantive code changes landed: pinning timezone-aware target columns to UTC in fingerprint/spark_target.py so a non-UTC spark.sql.session.timeZone does not produce Stage-1 false-mismatches; falling through to the full pipeline on a build_mismatch_output exception in trigger_recon_service.py to match the fail-open contract every other branch already follows; and casting Redshift strings to VARCHAR(65535) instead of bare VARCHAR (which truncates at 256). The remaining changes are cleanups (unused ColumnAlignment.exclude_columns removed, no-op reorder in connectors/source_adapter.py reverted, flaky wall-clock assertion in test_fetch_parallel.py replaced with a deterministic distinct-thread-id check, and test_expression_generator.py pins exact rendered SQL on each dialect).

Caveats/things to watch out for when reviewing

DBR 17.3+ requirement: source-side reads via remote_query() require classic clusters on DBR 17.3+ or SQL warehouses (Pro / Serverless) on version 2025.35+. This is inherited from upstream's RemoteQueryReader adoption, not a fingerprint-specific limitation; the runtime requirement is documented in docs/lakebridge/docs/reconcile/index.mdx.
MISMATCH-state cost: at 1 M scale, fingerprint pays ~20 s for Stage-1 detection unconditionally and Stage-2 currently feeds into the same compare.reconcile_data JOIN rather than replacing it, so on MISMATCH it costs an extra 16-94 s vs. row-hash-only mode. The MATCH path is the headline win (38.7% wall-clock improvement on 1 M rows: 55.5 s vs 90.6 s); the production motivation is billion-row scale, where the row-hash pipeline's full-table JOIN is what becomes intractable. Stage-1 hash persistence as Stage-2 input is filed as a follow-up.
Test fixture file: the only non-fingerprint, non-expression_generator test file modified is tests/unit/test_install.py (six "version": 2 -> 3 fixture bumps, a direct consequence of the migration step we own). Not scope creep.
success_count formula in verify_successful_reconciliation is mathematically wrong (yields counts greater than total). Pre-existing in upstream PR Improve reconciliation result handling and logging #2259 (commit e56c79c3d); sits right next to fingerprint code in trigger_recon_service.py. Not fixed in this MR to keep scope contained; filed as a separate issue.

Linked issues

Resolves #..

Functionality

added relevant user documentation (docs/lakebridge/docs/reconcile/configuration.mdx Fingerprint Pre-check (Experimental) section + docs/lakebridge/docs/reconcile/index.mdx runtime requirements)
added new CLI command
modified existing command: databricks labs lakebridge ...
new opt-in feature behind ReconcileConfig.fingerprint_precheck (default False)

Tests

added unit tests (+1,400 LOC across 14 new test modules covering: query builders, algebraic solver, eligibility classifier, fetch parallelisation, v1 -> v2 -> v3 config migration, recon-capture typed schema, fingerprint dispatch, Stage-1 / Stage-2 contract symmetry)
All 1383 unit tests pass (2 skipped, 3 xfailed); tests/unit/reconcile/ alone runs 307 reconcile tests in 1.13 s. The 6 failing test_cli_analyze.py cases (Informatica analyzer binary) are pre-existing on main (verified on stash-clean HEAD) and unrelated.
regression tests for the two pre-existing fixes: test_expression_generator.py pins handler presence, format string, and source / target byte alignment for TIMESTAMP / TIMESTAMPTZ, plus the BOOLEAN CASE WHEN rendering on Redshift.
correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across the 20-scenario dual-mode parity matrix: 39 / 40 cells PASS, 1 scenario shows a known fingerprint-solver fallback edge with verdict + missing-rows agreement on both sides — only the cap-bounded mismatch count differs (fingerprint reports the true 10000, normal reports the cap-50 sample).
linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green across the full reconcile/ surface plus config.py.
added integration tests (the recon e2e cluster gate landed in create dedicated test cluster for recon e2e tests #2453; integration coverage will be added in a follow-up MR alongside the cluster fixture)

CLAassistant · 2026-06-02T15:11:41Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

## Changes ### What does this PR do? Adds an opt-in fingerprint pre-check to Recon. When `fingerprint_precheck=True` and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline. - MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN. - MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing `compare.reconcile_data` flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing pipeline. - Ineligible -> falls through silently. The flag defaults to False; existing behaviour is unchanged. The algorithm is byte-identical to the dataprint sketch-based reconciliation library; this is the first dataprint-into-lakebridge integration. Redshift is the first dialect — adding Snowflake / Oracle / TSQL is one `FingerprintQueryBuilder` subclass plus one registry entry. ### Relevant implementation details - `trigger_recon_service._run_fingerprint_or_reconcile_data` is the single decision point. Static eligibility centralised in `classify_ineligibility`; the schema-dependent `unmapped_target_column_mapping` reason is raised by `align_columns` as a typed exception and routed through `FingerprintRunMetadata.ineligible(...)`. Every reason maps to an `IneligibilityReason` enum value and is recorded on `recon_metrics.fingerprint_metrics.ineligibility_reason`. - Source-side reads use upstream's `RemoteQueryReader` / `remote_query()` TVF unmodified; Stage-1 aggregation pushdown verified empirically on a 1 M-row Redshift fixture (DBR 17.3). - Stage-1 detection is parallelised across source / target via a 2-thread pool; failure semantics match the serial version. - Three new fields on `ReconcileConfig`: `fingerprint_precheck`, `fingerprint_treat_empty_as_null`, `fingerprint_row_count_override`. - Config version bumps 2 -> 3 with a `v2_migrate` that folds two legacy spellings (`redshift_fingerprint_precheck`, `use_fingerprint_precheck`) into the new flag. Existing deployments upgrade automatically. ### Pre-existing fixes that ride along (upstream PR databrickslabs#2339) Two correctness bugs in the upstream Redshift connector MR (databrickslabs#2339) surfaced during the dataprint integration P0 / P1 runs against a real cluster. Both crash the existing row-hash recon path on real customer schemas and are unrelated to dataprint, but they sat in the integration path so they are fixed inline. Both fixes live in `reconcile/query_builder/expression_generator.py` and are pinned by regression tests in `test_expression_generator.py`. - **Databricks block missing TIMESTAMP / TIMESTAMPTZ handler.** Redshift's source-side transform emits `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits), but the Databricks block had no override, so the target side fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))` — Spark emits a variable-length fractional component, omitted entirely for zero-microsecond timestamps. The byte-width drift made per-row SHA2 disagree for every TIMESTAMP / TIMESTAMPTZ row in any Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(ts, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` so source and target are byte-identical. - **Redshift block missing BOOLEAN handler.** The Redshift block defined overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no dialect-level `default`. BOOLEAN columns fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects during output schema resolution with `function pg_catalog.btrim(boolean) does not exist`. Any customer schema containing a single BOOLEAN column crashes row-hash recon end-to-end. Fix: explicit `COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')` so the rendered string matches Spark's `cast(boolean AS string)` byte-for-byte. ### Hardening from the internal review round After the initial internal review on the contributor's fork, three substantive code changes landed before pushing upstream: - **Pin TZ-aware Spark target columns to UTC** before formatting in `fingerprint/spark_target.py`. The Redshift side already pinned UTC via `TO_CHAR(_ AT TIME ZONE 'UTC', _)`; the Spark side was using `DATE_FORMAT(ts, _)` which renders in `spark.sql.session.timeZone`. On a non-UTC cluster the same instant rendered different bytes on the two sides. Fix splits LTZ vs NTZ handling and routes LTZ through `TO_UTC_TIMESTAMP(_, CURRENT_TIMEZONE())`. NTZ behaviour unchanged. - **Stage-2 build failures fall through to the full pipeline** in `trigger_recon_service.py` instead of marking the table failed. Every other non-MATCH branch already does this; the `build_mismatch_output` exception path was the one inconsistency. Metadata still records `fallback_to_full_pipeline=True` for observability. - **Cast Redshift strings to `VARCHAR(65535)`** in `fingerprint/query_builders/redshift.py` instead of bare `VARCHAR`, whose default 256-byte width truncated long text. `VARCHAR(65535)` is Redshift's maximum and matches Spark's unbounded string semantics. Smaller cleanups: dropped unused `ColumnAlignment.exclude_columns`; reverted a no-op reorder in `connectors/source_adapter.py`; replaced a flaky wall-clock assertion in `test_fetch_parallel.py` with a deterministic distinct-thread-id assertion; pinned the exact rendered SQL on each dialect in `test_expression_generator.py` (instead of substring-checking two different patterns) and added a regression test for the Redshift `BOOLEAN` handler. ### Caveats - DBR 17.3+ required for source-side reads via `remote_query()` (inherited from upstream's `RemoteQueryReader` adoption). - MISMATCH-state cost at 1 M scale currently exceeds row-hash-only mode by 16-94 s because Stage-2 still feeds the existing JOIN. MATCH is the headline win (38.7% on 1 M rows); billion-row scale is the production motivation. Stage-1 hash persistence as Stage-2 input is filed as a follow-up. - Pre-existing `success_count` formula in `verify_successful_reconciliation` (upstream PR databrickslabs#2259, commit `e56c79c3d`) is mathematically wrong; sits next to fingerprint code in `trigger_recon_service.py`. Not fixed here to keep scope contained; filed separately. ### Tests - All unit tests on the touched surface pass; `tests/unit/reconcile/` runs 282 tests in <1 s. The 6 `test_cli_analyze.py` failures are pre-existing on main and unrelated. - Regression tests added for every review-round fix (UTC pin, fallback path, VARCHAR(65535)); `test_expression_generator.py` pins the exact rendered SQL on each dialect for the two pre-existing fixes. - Correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across the 20-scenario dual-mode parity matrix: 39/40 cells PASS, 1 scenario shows a known fingerprint-solver fallback edge with verdict agreement on both sides — only the cap-bounded `mismatch` count differs (fingerprint reports the true 10000, normal reports the cap-50 sample). - Linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green. - Integration coverage to follow alongside the recon e2e cluster fixture (databrickslabs#2453).

ameersalman33 requested a review from a team as a code owner June 2, 2026 15:11

ameersalman33 mentioned this pull request Jun 2, 2026

Add fingerprint pre-check to Recon (Redshift) ameersalman33/lakebridge#1

Closed

10 tasks

ameersalman33 force-pushed the feature/dataprint-integration branch from 667494c to 282740a Compare June 3, 2026 14:23

sundarshankar89 added the do-not-merge label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fingerprint pre-check to Recon (Redshift)#2490

Add fingerprint pre-check to Recon (Redshift)#2490
ameersalman33 wants to merge 1 commit into
databrickslabs:mainfrom
ameersalman33:feature/dataprint-integration

ameersalman33 commented Jun 2, 2026

Uh oh!

CLAassistant commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ameersalman33 commented Jun 2, 2026

Changes

What does this PR do?

Relevant implementation details

Pre-existing fixes that ride along (upstream PR #2339)

Hardening from internal review

Caveats/things to watch out for when reviewing

Linked issues

Functionality

Tests

Uh oh!

CLAassistant commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants