Skip to content

Add fingerprint pre-check to Recon (Redshift)#2490

Open
ameersalman33 wants to merge 1 commit into
databrickslabs:mainfrom
ameersalman33:feature/dataprint-integration
Open

Add fingerprint pre-check to Recon (Redshift)#2490
ameersalman33 wants to merge 1 commit into
databrickslabs:mainfrom
ameersalman33:feature/dataprint-integration

Conversation

@ameersalman33

Copy link
Copy Markdown

Changes

What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When fingerprint_precheck=True and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline.

  • MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
  • MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing compare.reconcile_data flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing row-hash pipeline.
  • Ineligible -> falls through to the existing row-hash pipeline silently.

The flag defaults to False; existing behaviour is unchanged.

The algorithm is byte-identical to the dataprint sketch-based reconciliation library (Databricks-internal); this MR is the first dataprint-into-lakebridge integration. Redshift is the first source dialect — adding Snowflake / Oracle / TSQL is one FingerprintQueryBuilder subclass plus one registry entry, with no orchestrator changes.

Relevant implementation details

  • trigger_recon_service._run_fingerprint_or_reconcile_data is the single decision point. Static eligibility (flag off, unsupported dialect, report type not data, no join columns, filters / transformations / thresholds configured) is centralised in classify_ineligibility (fingerprint/orchestrator.py) and runs before any source-side compute. One config-time reason — unmapped_target_column_mapping — is detected later, inside the precheck (because it requires the target schema), and surfaces via the typed UnmappedTargetColumnMappingError raised by align_columns; the trigger catches it before the runtime-failure branch and routes it through FingerprintRunMetadata.ineligible(...). Every reason is one of the values on the IneligibilityReason enum and is recorded on recon_metrics.fingerprint_metrics.ineligibility_reason, so adoption queries can count each cause without needing to grep cluster logs.
  • Source-side reads use upstream's RemoteQueryReader / remote_query() TVF unmodified — no new connector code. Stage-1 aggregation pushdown was verified empirically against a 1 M-row Redshift fixture on a DBR 17.3 cluster: both direct JDBC (4.28 s median) and the upstream remote_query() TVF path (5.44 s median) push the GROUP BY to Redshift and return identical aggregated row counts.
  • Stage-1 detection is parallelised across source / target via ThreadPoolExecutor(max_workers=2, thread_name_prefix="fp-stage2"). Failure semantics match the serial version; the trigger-layer catch wraps the whole block.
  • New config fields on ReconcileConfig:
    • fingerprint_precheck: bool = False (the main flag)
    • fingerprint_treat_empty_as_null: bool = False (collapse '' to NULL in the per-column hash payload, end-to-end on both sides)
    • fingerprint_row_count_override: int | None = None (pin the adaptive sub-bucket / bucket tier when DESCRIBE DETAIL cannot read numRecords)
  • ReconcileConfig.__version__ bumps from 2 -> 3. The new v2_migrate introduces fingerprint_precheck and folds two legacy spellings (redshift_fingerprint_precheck, use_fingerprint_precheck) into it. Combined with upstream's v1_migrate, the chain v1 -> v2 -> v3 runs automatically via Installation.load(ReconcileConfig). Existing deployments upgrade with zero user action.

Pre-existing fixes that ride along (upstream PR #2339)

While integrating dataprint end-to-end against Redshift on a real cluster, two pre-existing correctness bugs in the upstream Redshift connector MR (PR #2339) surfaced. Neither is dataprint-specific — both crash the existing row-hash recon path on real customer schemas — but they sat directly in the integration path, so they are fixed inline. Both fixes live in reconcile/query_builder/expression_generator.py and are pinned by regression tests.

# Fix What broke before
1 Add TIMESTAMP / TIMESTAMPTZ transform handlers to the Databricks dialect (target side) The Redshift dialect already emits source-side COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_') (always 6 fractional digits, fixed-width). The Databricks dialect block had no override for TIMESTAMP / TIMESTAMPTZ, so the target side fell through to the universal default TRIM(COALESCE(col, '_null_recon_')), where Spark's implicit cast(timestamp AS string) emits a variable-length fractional component (omitted entirely for zero-microsecond timestamps; '2023-10-02 18:08:43' vs source '2023-10-02 18:08:43.000000'). The byte-width drift made the per-row SHA2 disagree for every logically-identical TIMESTAMP / TIMESTAMPTZ row in any Redshift -> Databricks reconcile. Fix: add COALESCE(DATE_FORMAT(_, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_') for both types so the two sides are byte-identical.
2 Add BOOLEAN transform handler to the Redshift dialect (source side) The Redshift block defined overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no dialect-level default. Any BOOLEAN column fell through to the universal default TRIM(COALESCE(col, '_null_recon_')), which Redshift rejects during output schema resolution (before any rows are read) with function pg_catalog.btrim(boolean) does not exist. Result: any customer schema containing a single BOOLEAN column crashes row-hash recon end-to-end on the Redshift connector shipped in PR #2339. Fix: explicit COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_') so the rendered string matches Spark's cast(boolean AS string) byte-for-byte. The existing unit suite did not catch this because the test fixtures hand-build query strings and never exercise the dialect transform-mapping path.

Hardening from internal review

Before pushing upstream this PR went through one round of internal review on the contributor's fork. Three substantive code changes landed: pinning timezone-aware target columns to UTC in fingerprint/spark_target.py so a non-UTC spark.sql.session.timeZone does not produce Stage-1 false-mismatches; falling through to the full pipeline on a build_mismatch_output exception in trigger_recon_service.py to match the fail-open contract every other branch already follows; and casting Redshift strings to VARCHAR(65535) instead of bare VARCHAR (which truncates at 256). The remaining changes are cleanups (unused ColumnAlignment.exclude_columns removed, no-op reorder in connectors/source_adapter.py reverted, flaky wall-clock assertion in test_fetch_parallel.py replaced with a deterministic distinct-thread-id check, and test_expression_generator.py pins exact rendered SQL on each dialect).

Caveats/things to watch out for when reviewing

  • DBR 17.3+ requirement: source-side reads via remote_query() require classic clusters on DBR 17.3+ or SQL warehouses (Pro / Serverless) on version 2025.35+. This is inherited from upstream's RemoteQueryReader adoption, not a fingerprint-specific limitation; the runtime requirement is documented in docs/lakebridge/docs/reconcile/index.mdx.
  • MISMATCH-state cost: at 1 M scale, fingerprint pays ~20 s for Stage-1 detection unconditionally and Stage-2 currently feeds into the same compare.reconcile_data JOIN rather than replacing it, so on MISMATCH it costs an extra 16-94 s vs. row-hash-only mode. The MATCH path is the headline win (38.7% wall-clock improvement on 1 M rows: 55.5 s vs 90.6 s); the production motivation is billion-row scale, where the row-hash pipeline's full-table JOIN is what becomes intractable. Stage-1 hash persistence as Stage-2 input is filed as a follow-up.
  • Test fixture file: the only non-fingerprint, non-expression_generator test file modified is tests/unit/test_install.py (six "version": 2 -> 3 fixture bumps, a direct consequence of the migration step we own). Not scope creep.
  • success_count formula in verify_successful_reconciliation is mathematically wrong (yields counts greater than total). Pre-existing in upstream PR Improve reconciliation result handling and logging #2259 (commit e56c79c3d); sits right next to fingerprint code in trigger_recon_service.py. Not fixed in this MR to keep scope contained; filed as a separate issue.

Linked issues

Resolves #..

Functionality

  • added relevant user documentation (docs/lakebridge/docs/reconcile/configuration.mdx Fingerprint Pre-check (Experimental) section + docs/lakebridge/docs/reconcile/index.mdx runtime requirements)
  • added new CLI command
  • modified existing command: databricks labs lakebridge ...
  • new opt-in feature behind ReconcileConfig.fingerprint_precheck (default False)

Tests

  • added unit tests (+1,400 LOC across 14 new test modules covering: query builders, algebraic solver, eligibility classifier, fetch parallelisation, v1 -> v2 -> v3 config migration, recon-capture typed schema, fingerprint dispatch, Stage-1 / Stage-2 contract symmetry)
  • All 1383 unit tests pass (2 skipped, 3 xfailed); tests/unit/reconcile/ alone runs 307 reconcile tests in 1.13 s. The 6 failing test_cli_analyze.py cases (Informatica analyzer binary) are pre-existing on main (verified on stash-clean HEAD) and unrelated.
  • regression tests for the two pre-existing fixes: test_expression_generator.py pins handler presence, format string, and source / target byte alignment for TIMESTAMP / TIMESTAMPTZ, plus the BOOLEAN CASE WHEN rendering on Redshift.
  • correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across the 20-scenario dual-mode parity matrix: 39 / 40 cells PASS, 1 scenario shows a known fingerprint-solver fallback edge with verdict + missing-rows agreement on both sides — only the cap-bounded mismatch count differs (fingerprint reports the true 10000, normal reports the cap-50 sample).
  • linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green across the full reconcile/ surface plus config.py.
  • added integration tests (the recon e2e cluster gate landed in create dedicated test cluster for recon e2e tests #2453; integration coverage will be added in a follow-up MR alongside the cluster fixture)

@ameersalman33 ameersalman33 requested a review from a team as a code owner June 2, 2026 15:11
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

## Changes

### What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When `fingerprint_precheck=True`
and the source has a registered query builder, Recon runs a sketch-based
detection pass (MD5-sub-bucketed aggregates over both sides) before the
row-hash compare pipeline.

- MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
- MISMATCH -> an algebraic solver returns the differing row hashes; a
  surgical Stage-2 fetch pulls just those rows and feeds them into the
  existing `compare.reconcile_data` flow. If the mismatch is systemic
  (>15% of sub-buckets), the precheck defers to the existing pipeline.
- Ineligible -> falls through silently.

The flag defaults to False; existing behaviour is unchanged. The algorithm
is byte-identical to the dataprint sketch-based reconciliation library;
this is the first dataprint-into-lakebridge integration. Redshift is the
first dialect — adding Snowflake / Oracle / TSQL is one
`FingerprintQueryBuilder` subclass plus one registry entry.

### Relevant implementation details

- `trigger_recon_service._run_fingerprint_or_reconcile_data` is the single
  decision point. Static eligibility centralised in `classify_ineligibility`;
  the schema-dependent `unmapped_target_column_mapping` reason is raised
  by `align_columns` as a typed exception and routed through
  `FingerprintRunMetadata.ineligible(...)`. Every reason maps to an
  `IneligibilityReason` enum value and is recorded on
  `recon_metrics.fingerprint_metrics.ineligibility_reason`.
- Source-side reads use upstream's `RemoteQueryReader` / `remote_query()`
  TVF unmodified; Stage-1 aggregation pushdown verified empirically on a
  1 M-row Redshift fixture (DBR 17.3).
- Stage-1 detection is parallelised across source / target via a 2-thread
  pool; failure semantics match the serial version.
- Three new fields on `ReconcileConfig`: `fingerprint_precheck`,
  `fingerprint_treat_empty_as_null`, `fingerprint_row_count_override`.
- Config version bumps 2 -> 3 with a `v2_migrate` that folds two legacy
  spellings (`redshift_fingerprint_precheck`, `use_fingerprint_precheck`)
  into the new flag. Existing deployments upgrade automatically.

### Pre-existing fixes that ride along (upstream PR databrickslabs#2339)

Two correctness bugs in the upstream Redshift connector MR (databrickslabs#2339)
surfaced during the dataprint integration P0 / P1 runs against a real
cluster. Both crash the existing row-hash recon path on real customer
schemas and are unrelated to dataprint, but they sat in the integration
path so they are fixed inline. Both fixes live in
`reconcile/query_builder/expression_generator.py` and are pinned by
regression tests in `test_expression_generator.py`.

- **Databricks block missing TIMESTAMP / TIMESTAMPTZ handler.** Redshift's
  source-side transform emits `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD
  HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits), but
  the Databricks block had no override, so the target side fell through
  to the universal default `TRIM(COALESCE(col, '_null_recon_'))` — Spark
  emits a variable-length fractional component, omitted entirely for
  zero-microsecond timestamps. The byte-width drift made per-row SHA2
  disagree for every TIMESTAMP / TIMESTAMPTZ row in any
  Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(ts,
  'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` so source and target
  are byte-identical.
- **Redshift block missing BOOLEAN handler.** The Redshift block defined
  overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no
  dialect-level `default`. BOOLEAN columns fell through to the universal
  default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects
  during output schema resolution with `function pg_catalog.btrim(boolean)
  does not exist`. Any customer schema containing a single BOOLEAN
  column crashes row-hash recon end-to-end. Fix: explicit `COALESCE(CASE
  WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END,
  '_null_recon_')` so the rendered string matches Spark's
  `cast(boolean AS string)` byte-for-byte.

### Hardening from the internal review round

After the initial internal review on the contributor's fork, three
substantive code changes landed before pushing upstream:

- **Pin TZ-aware Spark target columns to UTC** before formatting in
  `fingerprint/spark_target.py`. The Redshift side already pinned UTC via
  `TO_CHAR(_ AT TIME ZONE 'UTC', _)`; the Spark side was using
  `DATE_FORMAT(ts, _)` which renders in `spark.sql.session.timeZone`. On
  a non-UTC cluster the same instant rendered different bytes on the two
  sides. Fix splits LTZ vs NTZ handling and routes LTZ through
  `TO_UTC_TIMESTAMP(_, CURRENT_TIMEZONE())`. NTZ behaviour unchanged.
- **Stage-2 build failures fall through to the full pipeline** in
  `trigger_recon_service.py` instead of marking the table failed. Every
  other non-MATCH branch already does this; the `build_mismatch_output`
  exception path was the one inconsistency. Metadata still records
  `fallback_to_full_pipeline=True` for observability.
- **Cast Redshift strings to `VARCHAR(65535)`** in
  `fingerprint/query_builders/redshift.py` instead of bare `VARCHAR`,
  whose default 256-byte width truncated long text. `VARCHAR(65535)` is
  Redshift's maximum and matches Spark's unbounded string semantics.

Smaller cleanups: dropped unused `ColumnAlignment.exclude_columns`;
reverted a no-op reorder in `connectors/source_adapter.py`; replaced a
flaky wall-clock assertion in `test_fetch_parallel.py` with a
deterministic distinct-thread-id assertion; pinned the exact rendered
SQL on each dialect in `test_expression_generator.py` (instead of
substring-checking two different patterns) and added a regression test
for the Redshift `BOOLEAN` handler.

### Caveats

- DBR 17.3+ required for source-side reads via `remote_query()`
  (inherited from upstream's `RemoteQueryReader` adoption).
- MISMATCH-state cost at 1 M scale currently exceeds row-hash-only mode
  by 16-94 s because Stage-2 still feeds the existing JOIN. MATCH is
  the headline win (38.7% on 1 M rows); billion-row scale is the
  production motivation. Stage-1 hash persistence as Stage-2 input is
  filed as a follow-up.
- Pre-existing `success_count` formula in `verify_successful_reconciliation`
  (upstream PR databrickslabs#2259, commit `e56c79c3d`) is mathematically wrong; sits
  next to fingerprint code in `trigger_recon_service.py`. Not fixed
  here to keep scope contained; filed separately.

### Tests

- All unit tests on the touched surface pass; `tests/unit/reconcile/`
  runs 282 tests in <1 s. The 6 `test_cli_analyze.py` failures are
  pre-existing on main and unrelated.
- Regression tests added for every review-round fix (UTC pin, fallback
  path, VARCHAR(65535)); `test_expression_generator.py` pins the exact
  rendered SQL on each dialect for the two pre-existing fixes.
- Correctness validated end-to-end on a 1 M-row Redshift / Delta fixture
  across the 20-scenario dual-mode parity matrix: 39/40 cells PASS, 1
  scenario shows a known fingerprint-solver fallback edge with verdict
  agreement on both sides — only the cap-bounded `mismatch` count
  differs (fingerprint reports the true 10000, normal reports the
  cap-50 sample).
- Linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green.
- Integration coverage to follow alongside the recon e2e cluster fixture
  (databrickslabs#2453).
@ameersalman33 ameersalman33 force-pushed the feature/dataprint-integration branch from 667494c to 282740a Compare June 3, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants