Skip to content

Add fingerprint pre-check to Recon (Redshift)#1

Closed
ameersalman33 wants to merge 4 commits into
mainfrom
feature/dataprint-integration
Closed

Add fingerprint pre-check to Recon (Redshift)#1
ameersalman33 wants to merge 4 commits into
mainfrom
feature/dataprint-integration

Conversation

@ameersalman33

Copy link
Copy Markdown
Owner

Changes

What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When fingerprint_precheck=True and the source has a registered query builder, Recon runs a sketch-based detection pass (MD5-sub-bucketed aggregates over both sides) before the row-hash compare pipeline.

  • MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
  • MISMATCH -> an algebraic solver returns the differing row hashes; a surgical Stage-2 fetch pulls just those rows and feeds them into the existing compare.reconcile_data flow. If the mismatch is systemic (>15% of sub-buckets), the precheck defers to the existing row-hash pipeline.
  • Ineligible -> falls through to the existing row-hash pipeline silently.

The flag defaults to False; existing behaviour is unchanged.

The algorithm is byte-identical to the dataprint sketch-based reconciliation library (Databricks-internal); this MR is the first dataprint-into-lakebridge integration. Redshift is the first source dialect — adding Snowflake / Oracle / TSQL is one FingerprintQueryBuilder subclass plus one registry entry, with no orchestrator changes.

Relevant implementation details

  • trigger_recon_service._run_fingerprint_or_reconcile_data is the single decision point. Static eligibility (flag off, unsupported dialect, report type not data, no join columns, filters / transformations / thresholds configured) is centralised in classify_ineligibility (fingerprint/orchestrator.py) and runs before any source-side compute. One config-time reason — unmapped_target_column_mapping — is detected later, inside the precheck (because it requires the target schema), and surfaces via the typed UnmappedTargetColumnMappingError raised by align_columns; the trigger catches it before the runtime-failure branch and routes it through FingerprintRunMetadata.ineligible(...). Every reason is one of the values on the IneligibilityReason enum and is recorded on recon_metrics.fingerprint_metrics.ineligibility_reason, so adoption queries can count each cause without needing to grep cluster logs.
  • Source-side reads use upstream's RemoteQueryReader / remote_query() TVF unmodified — no new connector code. Stage-1 aggregation pushdown was verified empirically against a 1 M-row Redshift fixture on a DBR 17.3 cluster: both direct JDBC (4.28 s median) and the upstream remote_query() TVF path (5.44 s median) push the GROUP BY to Redshift and return identical aggregated row counts.
  • Stage-1 detection is parallelised across source / target via ThreadPoolExecutor(max_workers=2, thread_name_prefix="fp-stage2"). Failure semantics match the serial version; the trigger-layer catch wraps the whole block.
  • New config fields on ReconcileConfig:
    • fingerprint_precheck: bool = False (the main flag)
    • fingerprint_treat_empty_as_null: bool = False (collapse '' to NULL in the per-column hash payload, end-to-end on both sides)
    • fingerprint_row_count_override: int | None = None (pin the adaptive sub-bucket / bucket tier when DESCRIBE DETAIL cannot read numRecords)
  • ReconcileConfig.__version__ bumps from 2 -> 3. The new v2_migrate introduces fingerprint_precheck and folds two legacy spellings (redshift_fingerprint_precheck, use_fingerprint_precheck) into it. Combined with upstream's v1_migrate, the chain v1 -> v2 -> v3 runs automatically via Installation.load(ReconcileConfig). Existing deployments upgrade with zero user action.

Pre-existing fixes that ride along (upstream PR databrickslabs#2339)

While integrating dataprint end-to-end against Redshift on a real cluster, two pre-existing correctness bugs in the upstream Redshift connector MR (PR databrickslabs#2339) surfaced. Neither is dataprint-specific — both crash the existing row-hash recon path on real customer schemas — but they sat directly in the integration path, so they are fixed inline. Both fixes live in reconcile/query_builder/expression_generator.py and are pinned by regression tests.

# Fix What broke before
1 Add TIMESTAMP / TIMESTAMPTZ transform handlers to the Databricks dialect (target side) The Redshift dialect already emits source-side COALESCE(TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US'), '_null_recon_') (always 6 fractional digits, fixed-width). The Databricks dialect block had no override for TIMESTAMP / TIMESTAMPTZ, so the target side fell through to the universal default TRIM(COALESCE(col, '_null_recon_')), where Spark's implicit cast(timestamp AS string) emits a variable-length fractional component (omitted entirely for zero-microsecond timestamps; '2023-10-02 18:08:43' vs source '2023-10-02 18:08:43.000000'). The byte-width drift made the per-row SHA2 disagree for every logically-identical TIMESTAMP / TIMESTAMPTZ row in any Redshift -> Databricks reconcile. Fix: add COALESCE(DATE_FORMAT(_, 'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_') for both types so the two sides are byte-identical.
2 Add BOOLEAN transform handler to the Redshift dialect (source side) The Redshift block defined overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no dialect-level default. Any BOOLEAN column fell through to the universal default TRIM(COALESCE(col, '_null_recon_')), which Redshift rejects during output schema resolution (before any rows are read) with function pg_catalog.btrim(boolean) does not exist. Result: any customer schema containing a single BOOLEAN column crashes row-hash recon end-to-end on the Redshift connector shipped in PR databrickslabs#2339. Fix: explicit COALESCE(CASE WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END, '_null_recon_') so the rendered string matches Spark's cast(boolean AS string) byte-for-byte. The existing unit suite did not catch this because the test fixtures hand-build query strings and never exercise the dialect transform-mapping path.

Caveats/things to watch out for when reviewing

  • DBR 17.3+ requirement: source-side reads via remote_query() require classic clusters on DBR 17.3+ or SQL warehouses (Pro / Serverless) on version 2025.35+. This is inherited from upstream's RemoteQueryReader adoption, not a fingerprint-specific limitation; the runtime requirement is documented in docs/lakebridge/docs/reconcile/index.mdx.
  • MISMATCH-state cost: at 1 M scale, fingerprint pays ~20 s for Stage-1 detection unconditionally and Stage-2 currently feeds into the same compare.reconcile_data JOIN rather than replacing it, so on MISMATCH it costs an extra 16-94 s vs. row-hash-only mode. The MATCH path is the headline win (38.7% wall-clock improvement on 1 M rows: 55.5 s vs 90.6 s); the production motivation is billion-row scale, where the row-hash pipeline's full-table JOIN is what becomes intractable. Stage-1 hash persistence as Stage-2 input is filed as a follow-up.
  • Test fixture file: the only non-fingerprint, non-expression_generator test file modified is tests/unit/test_install.py (six "version": 2 -> 3 fixture bumps, a direct consequence of the migration step we own). Not scope creep.
  • success_count formula in verify_successful_reconciliation is mathematically wrong (yields counts greater than total). Pre-existing in upstream PR Improve reconciliation result handling and logging databrickslabs/lakebridge#2259 (commit e56c79c3d); sits right next to fingerprint code in trigger_recon_service.py. Not fixed in this MR to keep scope contained; filed as a separate issue.

Linked issues

Resolves #..

Functionality

  • added relevant user documentation (docs/lakebridge/docs/reconcile/configuration.mdx Fingerprint Pre-check (Experimental) section + docs/lakebridge/docs/reconcile/index.mdx runtime requirements)
  • added new CLI command
  • modified existing command: databricks labs lakebridge ...
  • new opt-in feature behind ReconcileConfig.fingerprint_precheck (default False)

Tests

  • added unit tests (+1,400 LOC across 14 new test modules covering: query builders, algebraic solver, eligibility classifier, fetch parallelisation, v1 -> v2 -> v3 config migration, recon-capture typed schema, fingerprint dispatch, Stage-1 / Stage-2 contract symmetry)
  • All 1383 unit tests pass (2 skipped, 3 xfailed); tests/unit/reconcile/ alone runs 307 reconcile tests in 1.13 s. The 6 failing test_cli_analyze.py cases (Informatica analyzer binary) are pre-existing on main (verified on stash-clean HEAD) and unrelated.
  • regression tests for the two pre-existing fixes: test_expression_generator.py pins handler presence, format string, and source / target byte alignment for TIMESTAMP / TIMESTAMPTZ, plus the BOOLEAN CASE WHEN rendering on Redshift.
  • correctness validated end-to-end on a 1 M-row Redshift / Delta fixture across a curated 24-scenario catalog: 47 / 48 Track-1 cells pass; 20 / 20 dual-mode parity OK.
  • linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green across the full reconcile/ surface plus config.py.
  • added integration tests (the recon e2e cluster gate landed in create dedicated test cluster for recon e2e tests databrickslabs/lakebridge#2453; integration coverage will be added in a follow-up MR alongside the cluster fixture)

m-abulazm and others added 3 commits May 28, 2026 05:28
…ickslabs#2465)

## Changes

### What does this PR do?
* Standardizes the reconcile sampling query across all dialects
* Azure Synapse Dedicated SQL Pool does not accept VALUES as a derived
table, so the previous T-SQL path failed at runtime on Synapse
connections.

### Implementation Details
replaces the previous CTE form (non-T-SQL) and `VALUES` derived-table
form (T-SQL) with a single derived-table-join shape that works
everywhere.
                                                                       
```sql
  -- Before (T-SQL path):                                              
INNER JOIN (VALUES (...), (...)) AS recon([col1], [col2]) ON ...
-- Before (other dialects):
WITH recon AS (SELECT ... UNION SELECT ...), src AS (...)
SELECT ... FROM src INNER JOIN recon ON ...
   
-- After (all dialects):
SELECT ... FROM (...) AS src
  INNER JOIN (SELECT ... UNION SELECT ...) AS recon ON ...
```                                                                                                                                                                                                                 

### Linked issues
Resolves databrickslabs#2418

### Functionality

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs lakebridge ...`

### Tests

- [x] manually tested
- [x] added unit tests
- [x] added integration tests
…QL scripts (databrickslabs#2482)

## Changes
<!-- Summary of your changes that are easy to understand. Add
screenshots when necessary, they're helpful to illustrate the before and
after state -->
### What does this PR do?
* Converts the MSSQL profiler from `type: python` venv steps to `type:
sql` / `type: ddl` in-process steps

### Key changes
- Replace `activity_extract.py`, `info_extract.py`, and
`mssql/common/*`.
 - Add and modify 13 query/DDL SQL pairs to profile.
- Use exisitng connector `MSSQLConnector` instead of duplicate in
`get_sqlserver_reader`.
- DDL types match documented source types (`SMALLINT`, `INTEGER`,
`DOUBLE`, `TIMESTAMP`).
  - `sys_info.sql` enumerates 37 columns explicitly (no `SELECT *`).
- `database` configurator prompt no longer defaults to `master`; user
supplies the actual target DB. Docs (`mssql.mdx`, `legacy_synapse.mdx`)
aligned with actual prompts.

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves databrickslabs#2459 

### Functionality

- [ ] modified existing command: `databricks labs lakebridge
execute-database-profiler --source-tech mssql`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [X] manually tested
- [ ] added unit tests
- [ ] added integration tests
…ation schema (databrickslabs#2304)

## Changes

This PR adds Redshift as a supported profiler assessment platform. It
introduces Redshift resources for serverless, provisioned, and
provisioned_multi_az

### What does this PR do?

- Adds Redshift as a supported platform for the profiler assessment
(alongside Synapse).
- Adds Redshift resources for all three variants (serverless,
provisioned, provisioned_multi_az): query/SQL files (e.g. query_view,
rs_managed_storage_gb, rs_nodes, chart queries), pipeline configs, and
validation schema.
- Files (2 levels: folder → subfolder, 3rd level: files under each)):
- resources/assessments/redshift/ → provisioned/, provisioned_multi_az/,
serverless/ with their .sql and pipeline_config.yml
  - resources/assessments/validation/ → redshift_extract_schema.yml

### Relevant implementation details

- Resources: Same query/file layout for serverless, provisioned, and
provisioned_multi_az under resources/assessments/redshift/{variant}/
(e.g. 0_query_view.sql … 9_chart_*.sql, pipeline_config.yml); serverless
also has 10_cost_incurred.sql. Shared validation:
resources/assessments/validation/redshift_extract_schema.yml.

### Caveats/things to watch out for when reviewing:

- Multi-AZ Redshift has limited system views (e.g. some STV views
unavailable); provisioned_multi_az SQL reflects that where applicable.
- Serverless Redshift has one additional query

### Linked issues
<!-- DOC: Link issue with a keyword: close, closes, closed, fix, fixes,
fixed, resolve, resolves, resolved. See
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword
-->

Resolves #..

### Functionality

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs lakebridge ...`
- [ ] ... +add your own

### Tests

1. Manually Tested all credential flows on all clusters in AWS Sandbox
account aws-sandbox-field-eng (332745928618)
tests/resources/assessments/pipeline_config_main_redshift.yml: pipeline
config that runs that script.
- [x] manually tested
- [ ] added unit tests
- [ ] added integration tests

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: M Abulazm <mohamed.abulazm@databricks.com>

@bishwajit-db bishwajit-db left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — it's a genuinely thorough and well-documented piece of work, and I really like the fail-open design and the dual-channel verification in the solver. I've left a handful of inline notes below, mostly questions and suggestions rather than hard objections.

The two I'd most love your thoughts on before merge are the timestamptz timezone handling and the Stage-2 build-failure fallback; the rest are smaller cleanups, a naming idea, and a couple of test suggestions. Happy to talk any of these through — nothing here that we can't sort out together.

col_type_lower = (col_type or "").strip().lower()
spark_col = F.expr(_quote_spark_identifier(col_name))
if _is_timestamp_like(col_type_lower):
cast_col = F.trim(F.date_format(spark_col, "yyyy-MM-dd HH:mm:ss.SSSSSS"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work overall. One thing I wanted to check on timestamptz: the target side renders via date_format(...), which uses the Spark session timezone, while the Redshift source side pins UTC (TO_CHAR(... AT TIME ZONE 'UTC', ...) in redshift.py). If a cluster's spark.sql.session.timeZone isn't UTC, I think the two sides could render the same instant differently, so timestamptz rows would look mismatched even when they're equal (Stage-1 would just fall open, so still correct — but slower — and the row-hash path could over-report). Could we normalize both sides to UTC explicitly, and maybe add a small test pinning the expected string on each side? Just want the byte-identical guarantee to hold regardless of cluster timezone.

)

try:
data_reconcile_output = build_mismatch_output(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thought on the fallback here: the other non-MATCH branches in this method fall back to _run_reconcile_data(...), but if build_mismatch_output raises, this path returns DataReconcileOutput(exception=...) and marks the table failed. Since build_mismatch_output is fairly involved, would it make sense to fall back to the full pipeline here too, so a hiccup in the Stage-2 builder doesn't turn an otherwise-successful recon into a failure? Totally fine if reporting the exception is intentional — just flagging it for consistency with the nice fail-open behavior everywhere else.

return rendered.sql(dialect="databricks")


def test_databricks_timestamp_handler_emits_microsecond_precision():

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These do a great job pinning the format-string shape on each side. One suggestion: since the MATCH fast-path leans on both engines producing the same bytes, could we add an assertion that pins the expected output string for a sample timestamp on each dialect and checks they're equal (rather than substring-checking two different patterns)? It'd guard against a pattern typo or the timezone case above. Also noticed BOOLEAN isn't rendered here even though we added the Redshift handler — a small case for it would round things out.

elif col_type_lower == "date":
cast_expr = f"TO_CHAR({quoted}, 'YYYY-MM-DD')"
else:
cast_expr = f"CAST({quoted} AS VARCHAR)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One edge case worth a look: CAST(col AS VARCHAR) in Redshift defaults to VARCHAR(256), so values longer than 256 chars can get truncated, while the Spark target keeps the full string. For long text columns I think that could make Stage-1 see a mismatch on otherwise-equal rows (it'd fall open, so still correct, just slower). Could we cast to an explicit wider length to keep the two sides aligned? A long-string test would be a nice guard too.

],
# Align with Redshift's ``TO_CHAR(ts, 'YYYY-MM-DD HH24:MI:SS.US')`` so
# the per-row SHA2 inputs are byte-identical for Redshift -> Databricks reconciles.
exp.DataType.Type.TIMESTAMP.value: [

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These TIMESTAMP/TIMESTAMPTZ/BOOLEAN handlers look like solid fixes! Since they live in the shared transform mapping, they'll change row hashes for every Redshift→Databricks reconcile regardless of the fingerprint flag. Would it be worth pulling them (with their regression tests) into a small standalone PR, so they can land and be validated on their own and this PR stays focused on the fingerprint feature? No strong opinion if you'd rather keep them together — just a thought on reviewability and blast radius.

)

return ColumnAlignment(
exclude_columns=sorted(exclude) if exclude else None,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: it looks like ColumnAlignment.exclude_columns is computed here, but run_fingerprint_precheck only ever reads column_mapping (the actual select/drop filtering happens in hash_columns_ordered_for_reconcile). If it isn't wired in anywhere, maybe we can drop the field to avoid confusion? (If it's meant to be used, the all_src_cols - selected diff might also need identifier normalization.)

)
elapsed = time.monotonic() - start

# Wall-clock: serial would be ~0.4s, parallel ~0.2s. Allow up to 0.35s for

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up on possible CI flakiness: the elapsed < 0.35 wall-clock check after the two 0.2s sleeps could occasionally trip on a loaded runner. The distinct-thread-id assertion is a solid signal on its own — could we relax or drop the timing bound to keep this from flaking?

layer: str,
data_source: DataSource,
) -> list[str]:
"""Mirror HashQueryBuilder hash column set + sort order (case-insensitive sort_key)."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the note here about mirroring HashQueryBuilder (and similarly the serializer in redshift.py/spark_target.py mirroring expression_generator): would it be worth extracting these into shared helpers that both the row-hash and fingerprint paths call? I think it'd remove the drift risk the comment mentions — and it's actually the cleanest fix for the timestamptz timezone difference above, since if both paths share one serializer they can't diverge. I know the comment defers this until the second dialect; just flagging that doing it now would collapse a couple of these notes into one change.

systemic_mismatch: bool = False


def detect_and_solve(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny naming thought: engine already means a sqlglot Dialect throughout reconcile (we even take source_engine in the orchestrator), and *_engine.py is used for the transpiler engines (TranspileEngine/SqlglotEngine/LSPEngine). Since this module is really the detect/solve logic, would solver.py or detection.py read a bit more clearly? Just a rename, no rush.

connection_name: str,
is_target: bool = False,
) -> DataSource:
if isinstance(engine, Databricks):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small one: I think this reorder is effectively a no-op — RemoteQueryReader.__init__ just stores its args (no I/O), and the Databricks branch never used reader anyway. Since it isn't needed for the feature, maybe we can drop it to keep the diff tight (or fold it into the expression_generator cleanup if you split that out)? Not a blocker either way.

@ameersalman33 ameersalman33 force-pushed the feature/dataprint-integration branch from c980821 to 9ae2f98 Compare June 2, 2026 14:07
## Changes

### What does this PR do?

Adds an opt-in fingerprint pre-check to Recon. When `fingerprint_precheck=True`
and the source has a registered query builder, Recon runs a sketch-based
detection pass (MD5-sub-bucketed aggregates over both sides) before the
row-hash compare pipeline.

- MATCH -> Recon short-circuits in seconds; no full table scan, no JOIN.
- MISMATCH -> an algebraic solver returns the differing row hashes; a
  surgical Stage-2 fetch pulls just those rows and feeds them into the
  existing `compare.reconcile_data` flow. If the mismatch is systemic
  (>15% of sub-buckets), the precheck defers to the existing pipeline.
- Ineligible -> falls through silently.

The flag defaults to False; existing behaviour is unchanged. The algorithm
is byte-identical to the dataprint sketch-based reconciliation library;
this is the first dataprint-into-lakebridge integration. Redshift is the
first dialect — adding Snowflake / Oracle / TSQL is one
`FingerprintQueryBuilder` subclass plus one registry entry.

### Relevant implementation details

- `trigger_recon_service._run_fingerprint_or_reconcile_data` is the single
  decision point. Static eligibility centralised in `classify_ineligibility`;
  the schema-dependent `unmapped_target_column_mapping` reason is raised
  by `align_columns` as a typed exception and routed through
  `FingerprintRunMetadata.ineligible(...)`. Every reason maps to an
  `IneligibilityReason` enum value and is recorded on
  `recon_metrics.fingerprint_metrics.ineligibility_reason`.
- Source-side reads use upstream's `RemoteQueryReader` / `remote_query()`
  TVF unmodified; Stage-1 aggregation pushdown verified empirically on a
  1 M-row Redshift fixture (DBR 17.3).
- Stage-1 detection is parallelised across source / target via a 2-thread
  pool; failure semantics match the serial version.
- Three new fields on `ReconcileConfig`: `fingerprint_precheck`,
  `fingerprint_treat_empty_as_null`, `fingerprint_row_count_override`.
- Config version bumps 2 -> 3 with a `v2_migrate` that folds two legacy
  spellings (`redshift_fingerprint_precheck`, `use_fingerprint_precheck`)
  into the new flag. Existing deployments upgrade automatically.

### Pre-existing fixes that ride along (upstream PR databrickslabs#2339)

Two correctness bugs in the upstream Redshift connector MR (databrickslabs#2339)
surfaced during the dataprint integration P0 / P1 runs against a real
cluster. Both crash the existing row-hash recon path on real customer
schemas and are unrelated to dataprint, but they sat in the integration
path so they are fixed inline. Both fixes live in
`reconcile/query_builder/expression_generator.py` and are pinned by
regression tests in `test_expression_generator.py`.

- **Databricks block missing TIMESTAMP / TIMESTAMPTZ handler.** Redshift's
  source-side transform emits `COALESCE(TO_CHAR(ts, 'YYYY-MM-DD
  HH24:MI:SS.US'), '_null_recon_')` (always 6 fractional digits), but
  the Databricks block had no override, so the target side fell through
  to the universal default `TRIM(COALESCE(col, '_null_recon_'))` — Spark
  emits a variable-length fractional component, omitted entirely for
  zero-microsecond timestamps. The byte-width drift made per-row SHA2
  disagree for every TIMESTAMP / TIMESTAMPTZ row in any
  Redshift -> Databricks reconcile. Fix: add `COALESCE(DATE_FORMAT(ts,
  'yyyy-MM-dd HH:mm:ss.SSSSSS'), '_null_recon_')` so source and target
  are byte-identical.
- **Redshift block missing BOOLEAN handler.** The Redshift block defined
  overrides only for SUPER / DATE / TIMESTAMP / TIMESTAMPTZ and had no
  dialect-level `default`. BOOLEAN columns fell through to the universal
  default `TRIM(COALESCE(col, '_null_recon_'))`, which Redshift rejects
  during output schema resolution with `function pg_catalog.btrim(boolean)
  does not exist`. Any customer schema containing a single BOOLEAN
  column crashes row-hash recon end-to-end. Fix: explicit `COALESCE(CASE
  WHEN col THEN 'true' WHEN NOT col THEN 'false' ELSE NULL END,
  '_null_recon_')` so the rendered string matches Spark's
  `cast(boolean AS string)` byte-for-byte.

### Hardening from the internal review round

After the initial internal review on the contributor's fork, three
substantive code changes landed before pushing upstream:

- **Pin TZ-aware Spark target columns to UTC** before formatting in
  `fingerprint/spark_target.py`. The Redshift side already pinned UTC via
  `TO_CHAR(_ AT TIME ZONE 'UTC', _)`; the Spark side was using
  `DATE_FORMAT(ts, _)` which renders in `spark.sql.session.timeZone`. On
  a non-UTC cluster the same instant rendered different bytes on the two
  sides. Fix splits LTZ vs NTZ handling and routes LTZ through
  `TO_UTC_TIMESTAMP(_, CURRENT_TIMEZONE())`. NTZ behaviour unchanged.
- **Stage-2 build failures fall through to the full pipeline** in
  `trigger_recon_service.py` instead of marking the table failed. Every
  other non-MATCH branch already does this; the `build_mismatch_output`
  exception path was the one inconsistency. Metadata still records
  `fallback_to_full_pipeline=True` for observability.
- **Cast Redshift strings to `VARCHAR(65535)`** in
  `fingerprint/query_builders/redshift.py` instead of bare `VARCHAR`,
  whose default 256-byte width truncated long text. `VARCHAR(65535)` is
  Redshift's maximum and matches Spark's unbounded string semantics.

Smaller cleanups: dropped unused `ColumnAlignment.exclude_columns`;
reverted a no-op reorder in `connectors/source_adapter.py`; replaced a
flaky wall-clock assertion in `test_fetch_parallel.py` with a
deterministic distinct-thread-id assertion; pinned the exact rendered
SQL on each dialect in `test_expression_generator.py` (instead of
substring-checking two different patterns) and added a regression test
for the Redshift `BOOLEAN` handler.

### Caveats

- DBR 17.3+ required for source-side reads via `remote_query()`
  (inherited from upstream's `RemoteQueryReader` adoption).
- MISMATCH-state cost at 1 M scale currently exceeds row-hash-only mode
  by 16-94 s because Stage-2 still feeds the existing JOIN. MATCH is
  the headline win (38.7% on 1 M rows); billion-row scale is the
  production motivation. Stage-1 hash persistence as Stage-2 input is
  filed as a follow-up.
- Pre-existing `success_count` formula in `verify_successful_reconciliation`
  (upstream PR databrickslabs#2259, commit `e56c79c3d`) is mathematically wrong; sits
  next to fingerprint code in `trigger_recon_service.py`. Not fixed
  here to keep scope contained; filed separately.

### Tests

- All unit tests on the touched surface pass; `tests/unit/reconcile/`
  runs 282 tests in <1 s. The 6 `test_cli_analyze.py` failures are
  pre-existing on main and unrelated.
- Regression tests added for every review-round fix (UTC pin, fallback
  path, VARCHAR(65535)); `test_expression_generator.py` pins the exact
  rendered SQL on each dialect for the two pre-existing fixes.
- Correctness validated end-to-end on a 1 M-row Redshift / Delta fixture
  across the 20-scenario dual-mode parity matrix: 39/40 cells PASS, 1
  scenario shows a known fingerprint-solver fallback edge with verdict
  agreement on both sides — only the cap-bounded `mismatch` count
  differs (fingerprint reports the true 10000, normal reports the
  cap-50 sample).
- Linter clean: pylint 10.00/10 on touched src; ruff, black, mypy green.
- Integration coverage to follow alongside the recon e2e cluster fixture
  (databrickslabs#2453).
@ameersalman33 ameersalman33 force-pushed the feature/dataprint-integration branch from 9ae2f98 to 667494c Compare June 2, 2026 15:10
@ameersalman33

Copy link
Copy Markdown
Owner Author

@bishwajit-db Thanks for the review. I raised a new MR as discussed in the lakebridge repo - databrickslabs#2490

@ameersalman33

Copy link
Copy Markdown
Owner Author

As discussed, raised MR in the lakebridge repo - databrickslabs#2490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants