Add fingerprint pre-check to Recon (Redshift) by ameersalman33 · Pull Request #1 · ameersalman33/lakebridge

ameersalman33 · 2026-05-25T14:13:45Z

Changes

What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When fingerprint_precheck=True and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline.

MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing compare.reconcile_data flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing row-hash pipeline.
Ineligible -> falls through to the existing row-hash pipeline silently.

The flag defaults to False; existing behaviour is unchanged.

The algorithm is byte-identical to the dataprint sketch-based reconciliation library (Databricks-internal); this MR is the first dataprint-into-lakebridge integration. Redshift is the first source dialect — adding Snowflake / Oracle / TSQL is one FingerprintQueryBuilder subclass plus one registry entry, with no orchestrator changes.

Relevant implementation details

trigger_recon_service._run_fingerprint_or_reconcile_data is the single decision point. Static eligibility (flag off, unsupported dialect, report type not data, no join columns, filters / transformations / thresholds configured) is centralised in classify_ineligibility (fingerprint/orchestrator.py) and runs before any source-side compute. One config-time reason — unmapped_target_column_mapping — is detected later, inside the precheck (because it requires the target schema), and surfaces via the typed UnmappedTargetColumnMappingError raised by align_columns; the trigger catches it before the runtime-failure branch and routes it through FingerprintRunMetadata.ineligible(...). Every reason is one of the values on the IneligibilityReason enum and is recorded on recon_metrics.fingerprint_metrics.ineligibility_reason, so adoption queries can count each cause without needing to grep cluster logs.
Source-side reads use upstream's RemoteQueryReader / remote_query() TVF unmodified — no new connector code. Stage-1 aggregation pushdown was verified empirically against a 1 M-row Redshift fixture on a DBR 17.3 cluster: both direct JDBC (4.28 s median) and the upstream remote_query() TVF path (5.44 s median) push the GROUP BY to Redshift and return identical aggregated row counts.
Stage-1 detection is parallelised across source / target via ThreadPoolExecutor(max_workers=2, thread_name_prefix="fp-stage2"). Failure semantics match the serial version; the trigger-layer catch wraps the whole block.
New config fields on ReconcileConfig:
- fingerprint_precheck: bool = False (the main flag)
- fingerprint_treat_empty_as_null: bool = False (collapse '' to NULL in the per-column hash payload, end-to-end on both sides)
- fingerprint_row_count_override: int | None = None (pin the adaptive sub-bucket / bucket tier when DESCRIBE DETAIL cannot read numRecords)
ReconcileConfig.__version__ bumps from 2 -> 3. The new v2_migrate introduces fingerprint_precheck and folds two legacy spellings (redshift_fingerprint_precheck, use_fingerprint_precheck) into it. Combined with upstream's v1_migrate, the chain v1 -> v2 -> v3 runs automatically via Installation.load(ReconcileConfig). Existing deployments upgrade with zero user action.

Pre-existing fixes that ride along (upstream PR databrickslabs#2339)

While integrating dataprint end-to-end against Redshift on a real cluster, two pre-existing correctness bugs in the upstream Redshift connector MR (PR databrickslabs#2339) surfaced. Neither is dataprint-specific — both crash the existing row-hash recon path on real customer schemas — but they sat directly in the integration path, so they are fixed inline. Both fixes live in reconcile/query_builder/expression_generator.py and are pinned by regression tests.

#	Fix	What broke before
1	Add `TIMESTAMP` / `TIMESTAMPTZ` transform handlers to the Databricks dialect (target side)	The Redshift dialect already emits source-side `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits, fixed-width). The Databricks dialect block had no override for `TIMESTAMP` / `TIMESTAMPTZ`, so the target side fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, where Spark's implicit `cast(timestamp AS string)` emits a variable-length fractional component (omitted entirely for zero-microsecond timestamps; `'2023-10-02 18:08:43'` vs source `'2023-10-02 18:08:43.000000'`). The byte-width drift made the per-row SHA2 disagree for every logically-identical `TIMESTAMP` / `TIMESTAMPTZ` row in any Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(_, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` for both types so the two sides are byte-identical.
2	Add `BOOLEAN` transform handler to the Redshift dialect (source side)	The Redshift block defined overrides only for `SUPER` / `DATE` / `TIMESTAMP` / `TIMESTAMPTZ` and had no dialect-level `default`. Any `BOOLEAN` column fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects during output schema resolution (before any rows are read) with `function pg_catalog.btrim(boolean) does not exist`. Result: any customer schema containing a single `BOOLEAN` column crashes row-hash recon end-to-end on the Redshift connector shipped in PR databrickslabs#2339. Fix: explicit `COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')` so the rendered string matches Spark's `cast(boolean AS string)` byte-for-byte. The existing unit suite did not catch this because the test fixtures hand-build query strings and never exercise the dialect transform-mapping path.

Caveats/things to watch out for when reviewing

DBR 17.3+ requirement: source-side reads via remote_query() require classic clusters on DBR 17.3+ or SQL warehouses (Pro / Serverless) on version 2025.35+. This is inherited from upstream's RemoteQueryReader adoption, not a fingerprint-specific limitation; the runtime requirement is documented in docs/lakebridge/docs/reconcile/index.mdx.
MISMATCH-state cost: at 1 M scale, fingerprint pays ~20 s for Stage-1 detection unconditionally and Stage-2 currently feeds into the same compare.reconcile_data JOIN rather than replacing it, so on MISMATCH it costs an extra 16-94 s vs. row-hash-only mode. The MATCH path is the headline win (38.7% wall-clock improvement on 1 M rows: 55.5 s vs 90.6 s); the production motivation is billion-row scale, where the row-hash pipeline's full-table JOIN is what becomes intractable. Stage-1 hash persistence as Stage-2 input is filed as a follow-up.
Test fixture file: the only non-fingerprint, non-expression_generator test file modified is tests/unit/test_install.py (six "version": 2 -> 3 fixture bumps, a direct consequence of the migration step we own). Not scope creep.
success_count formula in verify_successful_reconciliation is mathematically wrong (yields counts greater than total). Pre-existing in upstream PR Improve reconciliation result handling and logging databrickslabs/lakebridge#2259 (commit e56c79c3d); sits right next to fingerprint code in trigger_recon_service.py. Not fixed in this MR to keep scope contained; filed as a separate issue.

Linked issues

Resolves #..

Functionality

added relevant user documentation (docs/lakebridge/docs/reconcile/configuration.mdx Fingerprint Pre-check (Experimental) section + docs/lakebridge/docs/reconcile/index.mdx runtime requirements)
added new CLI command
modified existing command: databricks labs lakebridge ...
new opt-in feature behind ReconcileConfig.fingerprint_precheck (default False)

Tests

added unit tests (+1,400 LOC across 14 new test modules covering: query builders, algebraic solver, eligibility classifier, fetch parallelisation, v1 -> v2 -> v3 config migration, recon-capture typed schema, fingerprint dispatch, Stage-1 / Stage-2 contract symmetry)
All 1383 unit tests pass (2 skipped, 3 xfailed); tests/unit/reconcile/ alone runs 307 reconcile tests in 1.13 s. The 6 failing test_cli_analyze.py cases (Informatica analyzer binary) are pre-existing on main (verified on stash-clean HEAD) and unrelated.
regression tests for the two pre-existing fixes: test_expression_generator.py pins handler presence, format string, and source / target byte alignment for TIMESTAMP / TIMESTAMPTZ, plus the BOOLEAN CASE WHEN rendering on Redshift.
correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across a curated 24-scenario catalog: 47 / 48 Track-1 cells pass; 20 / 20 dual-mode parity OK.
linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green across the full reconcile/ surface plus config.py.
added integration tests (the recon e2e cluster gate landed in create dedicated test cluster for recon e2e tests databrickslabs/lakebridge#2453; integration coverage will be added in a follow-up MR alongside the cluster fixture)

…ickslabs#2465) ## Changes ### What does this PR do? * Standardizes the reconcile sampling query across all dialects * Azure Synapse Dedicated SQL Pool does not accept VALUES as a derived table, so the previous T-SQL path failed at runtime on Synapse connections. ### Implementation Details replaces the previous CTE form (non-T-SQL) and `VALUES` derived-table form (T-SQL) with a single derived-table-join shape that works everywhere. ```sql -- Before (T-SQL path): INNER JOIN (VALUES (...), (...)) AS recon([col1], [col2]) ON ... -- Before (other dialects): WITH recon AS (SELECT ... UNION SELECT ...), src AS (...) SELECT ... FROM src INNER JOIN recon ON ... -- After (all dialects): SELECT ... FROM (...) AS src INNER JOIN (SELECT ... UNION SELECT ...) AS recon ON ... ``` ### Linked issues Resolves databrickslabs#2418 ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` ### Tests - [x] manually tested - [x] added unit tests - [x] added integration tests

…QL scripts (databrickslabs#2482) ## Changes  ### What does this PR do? * Converts the MSSQL profiler from `type: python` venv steps to `type: sql` / `type: ddl` in-process steps ### Key changes - Replace `activity_extract.py`, `info_extract.py`, and `mssql/common/*`. - Add and modify 13 query/DDL SQL pairs to profile. - Use exisitng connector `MSSQLConnector` instead of duplicate in `get_sqlserver_reader`. - DDL types match documented source types (`SMALLINT`, `INTEGER`, `DOUBLE`, `TIMESTAMP`). - `sys_info.sql` enumerates 37 columns explicitly (no `SELECT *`). - `database` configurator prompt no longer defaults to `master`; user supplies the actual target DB. Docs (`mssql.mdx`, `legacy_synapse.mdx`) aligned with actual prompts. ### Linked issues  Resolves databrickslabs#2459 ### Functionality - [ ] modified existing command: `databricks labs lakebridge execute-database-profiler --source-tech mssql` ### Tests  - [X] manually tested - [ ] added unit tests - [ ] added integration tests

…ation schema (databrickslabs#2304) ## Changes This PR adds Redshift as a supported profiler assessment platform. It introduces Redshift resources for serverless, provisioned, and provisioned_multi_az ### What does this PR do? - Adds Redshift as a supported platform for the profiler assessment (alongside Synapse). - Adds Redshift resources for all three variants (serverless, provisioned, provisioned_multi_az): query/SQL files (e.g. query_view, rs_managed_storage_gb, rs_nodes, chart queries), pipeline configs, and validation schema. - Files (2 levels: folder → subfolder, 3rd level: files under each)): - resources/assessments/redshift/ → provisioned/, provisioned_multi_az/, serverless/ with their .sql and pipeline_config.yml - resources/assessments/validation/ → redshift_extract_schema.yml ### Relevant implementation details - Resources: Same query/file layout for serverless, provisioned, and provisioned_multi_az under resources/assessments/redshift/{variant}/ (e.g. 0_query_view.sql … 9_chart_*.sql, pipeline_config.yml); serverless also has 10_cost_incurred.sql. Shared validation: resources/assessments/validation/redshift_extract_schema.yml. ### Caveats/things to watch out for when reviewing: - Multi-AZ Redshift has limited system views (e.g. some STV views unavailable); provisioned_multi_az SQL reflects that where applicable. - Serverless Redshift has one additional query ### Linked issues  Resolves #.. ### Functionality - [ ] added relevant user documentation - [ ] added new CLI command - [ ] modified existing command: `databricks labs lakebridge ...` - [ ] ... +add your own ### Tests 1. Manually Tested all credential flows on all clusters in AWS Sandbox account aws-sandbox-field-eng (332745928618) tests/resources/assessments/pipeline_config_main_redshift.yml: pipeline config that runs that script. - [x] manually tested - [ ] added unit tests - [ ] added integration tests --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: M Abulazm <mohamed.abulazm@databricks.com>

bishwajit-db

Thanks for this — it's a genuinely thorough and well-documented piece of work, and I really like the fail-open design and the dual-channel verification in the solver. I've left a handful of inline notes below, mostly questions and suggestions rather than hard objections.

The two I'd most love your thoughts on before merge are the timestamptz timezone handling and the Stage-2 build-failure fallback; the rest are smaller cleanups, a naming idea, and a couple of test suggestions. Happy to talk any of these through — nothing here that we can't sort out together.

bishwajit-db · 2026-06-02T12:33:58Z

+    col_type_lower = (col_type or "").strip().lower()
+    spark_col = F.expr(_quote_spark_identifier(col_name))
+    if _is_timestamp_like(col_type_lower):
+        cast_col = F.trim(F.date_format(spark_col, "yyyy-MM-dd HH:mm:ss.SSSSSS"))


Really nice work overall. One thing I wanted to check on timestamptz: the target side renders via date_format(...), which uses the Spark session timezone, while the Redshift source side pins UTC (TO_CHAR(... AT TIME ZONE 'UTC', ...) in redshift.py). If a cluster's spark.sql.session.timeZone isn't UTC, I think the two sides could render the same instant differently, so timestamptz rows would look mismatched even when they're equal (Stage-1 would just fall open, so still correct — but slower — and the row-hash path could over-report). Could we normalize both sides to UTC explicitly, and maybe add a small test pinning the expected string on each side? Just want the byte-identical guarantee to hold regardless of cluster timezone.

bishwajit-db · 2026-06-02T12:33:58Z

+            )
+
+        try:
+            data_reconcile_output = build_mismatch_output(


Small thought on the fallback here: the other non-MATCH branches in this method fall back to _run_reconcile_data(...), but if build_mismatch_output raises, this path returns DataReconcileOutput(exception=...) and marks the table failed. Since build_mismatch_output is fairly involved, would it make sense to fall back to the full pipeline here too, so a hiccup in the Stage-2 builder doesn't turn an otherwise-successful recon into a failure? Totally fine if reporting the exception is intentional — just flagging it for consistency with the nice fail-open behavior everywhere else.

bishwajit-db · 2026-06-02T12:33:58Z

+    return rendered.sql(dialect="databricks")
+
+
+def test_databricks_timestamp_handler_emits_microsecond_precision():


These do a great job pinning the format-string shape on each side. One suggestion: since the MATCH fast-path leans on both engines producing the same bytes, could we add an assertion that pins the expected output string for a sample timestamp on each dialect and checks they're equal (rather than substring-checking two different patterns)? It'd guard against a pattern typo or the timezone case above. Also noticed BOOLEAN isn't rendered here even though we added the Redshift handler — a small case for it would round things out.

bishwajit-db · 2026-06-02T12:33:58Z

+        elif col_type_lower == "date":
+            cast_expr = f"TO_CHAR({quoted}, 'YYYY-MM-DD')"
+        else:
+            cast_expr = f"CAST({quoted} AS VARCHAR)"


One edge case worth a look: CAST(col AS VARCHAR) in Redshift defaults to VARCHAR(256), so values longer than 256 chars can get truncated, while the Spark target keeps the full string. For long text columns I think that could make Stage-1 see a mismatch on otherwise-equal rows (it'd fall open, so still correct, just slower). Could we cast to an explicit wider length to keep the two sides aligned? A long-string test would be a nice guard too.

bishwajit-db · 2026-06-02T12:33:58Z

        ],
+        # Align with Redshift's ``TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US')`` so
+        # the per-row SHA2 inputs are byte-identical for Redshift -> Databricks reconciles.
+        exp.DataType.Type.TIMESTAMP.value: [


These TIMESTAMP/TIMESTAMPTZ/BOOLEAN handlers look like solid fixes! Since they live in the shared transform mapping, they'll change row hashes for every Redshift→Databricks reconcile regardless of the fingerprint flag. Would it be worth pulling them (with their regression tests) into a small standalone PR, so they can land and be validated on their own and this PR stays focused on the fingerprint feature? No strong opinion if you'd rather keep them together — just a thought on reviewability and blast radius.

bishwajit-db · 2026-06-02T12:33:58Z

+                )
+
+    return ColumnAlignment(
+        exclude_columns=sorted(exclude) if exclude else None,


Minor: it looks like ColumnAlignment.exclude_columns is computed here, but run_fingerprint_precheck only ever reads column_mapping (the actual select/drop filtering happens in hash_columns_ordered_for_reconcile). If it isn't wired in anywhere, maybe we can drop the field to avoid confusion? (If it's meant to be used, the all_src_cols - selected diff might also need identifier normalization.)

bishwajit-db · 2026-06-02T12:33:58Z

+        )
+        elapsed = time.monotonic() - start
+
+    # Wall-clock: serial would be ~0.4s, parallel ~0.2s. Allow up to 0.35s for


Just a heads-up on possible CI flakiness: the elapsed < 0.35 wall-clock check after the two 0.2s sleeps could occasionally trip on a loaded runner. The distinct-thread-id assertion is a solid signal on its own — could we relax or drop the timing bound to keep this from flaking?

bishwajit-db · 2026-06-02T12:33:58Z

+    layer: str,
+    data_source: DataSource,
+) -> list[str]:
+    """Mirror HashQueryBuilder hash column set + sort order (case-insensitive sort_key)."""


Following the note here about mirroring HashQueryBuilder (and similarly the serializer in redshift.py/spark_target.py mirroring expression_generator): would it be worth extracting these into shared helpers that both the row-hash and fingerprint paths call? I think it'd remove the drift risk the comment mentions — and it's actually the cleanest fix for the timestamptz timezone difference above, since if both paths share one serializer they can't diverge. I know the comment defers this until the second dialect; just flagging that doing it now would collapse a couple of these notes into one change.

bishwajit-db · 2026-06-02T12:33:58Z

+    systemic_mismatch: bool = False
+
+
+def detect_and_solve(


Tiny naming thought: engine already means a sqlglot Dialect throughout reconcile (we even take source_engine in the orchestrator), and *_engine.py is used for the transpiler engines (TranspileEngine/SqlglotEngine/LSPEngine). Since this module is really the detect/solve logic, would solver.py or detection.py read a bit more clearly? Just a rename, no rush.

bishwajit-db · 2026-06-02T12:33:58Z

    connection_name: str,
    is_target: bool = False,
 ) -> DataSource:
+    if isinstance(engine, Databricks):


Small one: I think this reorder is effectively a no-op — RemoteQueryReader.__init__ just stores its args (no I/O), and the Databricks branch never used reader anyway. Since it isn't needed for the feature, maybe we can drop it to keep the diff tight (or fold it into the expression_generator cleanup if you split that out)? Not a blocker either way.

## Changes ### What does this PR do? Adds an opt-in fingerprint pre-check to Recon. When `fingerprint_precheck=True` and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline. - MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN. - MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing `compare.reconcile_data` flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing pipeline. - Ineligible -> falls through silently. The flag defaults to False; existing behaviour is unchanged. The algorithm is byte-identical to the dataprint sketch-based reconciliation library; this is the first dataprint-into-lakebridge integration. Redshift is the first dialect — adding Snowflake / Oracle / TSQL is one `FingerprintQueryBuilder` subclass plus one registry entry. ### Relevant implementation details - `trigger_recon_service._run_fingerprint_or_reconcile_data` is the single decision point. Static eligibility centralised in `classify_ineligibility`; the schema-dependent `unmapped_target_column_mapping` reason is raised by `align_columns` as a typed exception and routed through `FingerprintRunMetadata.ineligible(...)`. Every reason maps to an `IneligibilityReason` enum value and is recorded on `recon_metrics.fingerprint_metrics.ineligibility_reason`. - Source-side reads use upstream's `RemoteQueryReader` / `remote_query()` TVF unmodified; Stage-1 aggregation pushdown verified empirically on a 1 M-row Redshift fixture (DBR 17.3). - Stage-1 detection is parallelised across source / target via a 2-thread pool; failure semantics match the serial version. - Three new fields on `ReconcileConfig`: `fingerprint_precheck`, `fingerprint_treat_empty_as_null`, `fingerprint_row_count_override`. - Config version bumps 2 -> 3 with a `v2_migrate` that folds two legacy spellings (`redshift_fingerprint_precheck`, `use_fingerprint_precheck`) into the new flag. Existing deployments upgrade automatically. ### Pre-existing fixes that ride along (upstream PR databrickslabs#2339) Two correctness bugs in the upstream Redshift connector MR (databrickslabs#2339) surfaced during the dataprint integration P0 / P1 runs against a real cluster. Both crash the existing row-hash recon path on real customer schemas and are unrelated to dataprint, but they sat in the integration path so they are fixed inline. Both fixes live in `reconcile/query_builder/expression_generator.py` and are pinned by regression tests in `test_expression_generator.py`. - **Databricks block missing TIMESTAMP / TIMESTAMPTZ handler.** Redshift's source-side transform emits `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits), but the Databricks block had no override, so the target side fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))` — Spark emits a variable-length fractional component, omitted entirely for zero-microsecond timestamps. The byte-width drift made per-row SHA2 disagree for every TIMESTAMP / TIMESTAMPTZ row in any Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(ts, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` so source and target are byte-identical. - **Redshift block missing BOOLEAN handler.** The Redshift block defined overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no dialect-level `default`. BOOLEAN columns fell through to the universal default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects during output schema resolution with `function pg_catalog.btrim(boolean) does not exist`. Any customer schema containing a single BOOLEAN column crashes row-hash recon end-to-end. Fix: explicit `COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_')` so the rendered string matches Spark's `cast(boolean AS string)` byte-for-byte. ### Hardening from the internal review round After the initial internal review on the contributor's fork, three substantive code changes landed before pushing upstream: - **Pin TZ-aware Spark target columns to UTC** before formatting in `fingerprint/spark_target.py`. The Redshift side already pinned UTC via `TO_CHAR(_ AT TIME ZONE 'UTC', _)`; the Spark side was using `DATE_FORMAT(ts, _)` which renders in `spark.sql.session.timeZone`. On a non-UTC cluster the same instant rendered different bytes on the two sides. Fix splits LTZ vs NTZ handling and routes LTZ through `TO_UTC_TIMESTAMP(_, CURRENT_TIMEZONE())`. NTZ behaviour unchanged. - **Stage-2 build failures fall through to the full pipeline** in `trigger_recon_service.py` instead of marking the table failed. Every other non-MATCH branch already does this; the `build_mismatch_output` exception path was the one inconsistency. Metadata still records `fallback_to_full_pipeline=True` for observability. - **Cast Redshift strings to `VARCHAR(65535)`** in `fingerprint/query_builders/redshift.py` instead of bare `VARCHAR`, whose default 256-byte width truncated long text. `VARCHAR(65535)` is Redshift's maximum and matches Spark's unbounded string semantics. Smaller cleanups: dropped unused `ColumnAlignment.exclude_columns`; reverted a no-op reorder in `connectors/source_adapter.py`; replaced a flaky wall-clock assertion in `test_fetch_parallel.py` with a deterministic distinct-thread-id assertion; pinned the exact rendered SQL on each dialect in `test_expression_generator.py` (instead of substring-checking two different patterns) and added a regression test for the Redshift `BOOLEAN` handler. ### Caveats - DBR 17.3+ required for source-side reads via `remote_query()` (inherited from upstream's `RemoteQueryReader` adoption). - MISMATCH-state cost at 1 M scale currently exceeds row-hash-only mode by 16-94 s because Stage-2 still feeds the existing JOIN. MATCH is the headline win (38.7% on 1 M rows); billion-row scale is the production motivation. Stage-1 hash persistence as Stage-2 input is filed as a follow-up. - Pre-existing `success_count` formula in `verify_successful_reconciliation` (upstream PR databrickslabs#2259, commit `e56c79c3d`) is mathematically wrong; sits next to fingerprint code in `trigger_recon_service.py`. Not fixed here to keep scope contained; filed separately. ### Tests - All unit tests on the touched surface pass; `tests/unit/reconcile/` runs 282 tests in <1 s. The 6 `test_cli_analyze.py` failures are pre-existing on main and unrelated. - Regression tests added for every review-round fix (UTC pin, fallback path, VARCHAR(65535)); `test_expression_generator.py` pins the exact rendered SQL on each dialect for the two pre-existing fixes. - Correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across the 20-scenario dual-mode parity matrix: 39/40 cells PASS, 1 scenario shows a known fingerprint-solver fallback edge with verdict agreement on both sides — only the cap-bounded `mismatch` count differs (fingerprint reports the true 10000, normal reports the cap-50 sample). - Linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green. - Integration coverage to follow alongside the recon e2e cluster fixture (databrickslabs#2453).

ameersalman33 · 2026-06-02T15:12:25Z

@bishwajit-db Thanks for the review. I raised a new MR as discussed in the lakebridge repo - databrickslabs#2490

ameersalman33 · 2026-06-02T15:34:06Z

As discussed, raised MR in the lakebridge repo - databrickslabs#2490

m-abulazm and others added 3 commits May 28, 2026 05:28

bishwajit-db reviewed Jun 2, 2026

View reviewed changes

ameersalman33 force-pushed the feature/dataprint-integration branch from c980821 to 9ae2f98 Compare June 2, 2026 14:07

ameersalman33 force-pushed the feature/dataprint-integration branch from 9ae2f98 to 667494c Compare June 2, 2026 15:10

ameersalman33 closed this Jun 2, 2026

		return rendered.sql(dialect="databricks")


		def test_databricks_timestamp_handler_emits_microsecond_precision():

Conversation

ameersalman33 commented May 25, 2026

Changes

What does this PR do?

Relevant implementation details

Pre-existing fixes that ride along (upstream PR databrickslabs#2339)

Caveats/things to watch out for when reviewing

Linked issues

Functionality

Tests

Uh oh!

bishwajit-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ameersalman33 commented Jun 2, 2026

Uh oh!

ameersalman33 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants