Feature/oracle profiler validation by nnigam19 · Pull Request #2495 · databrickslabs/lakebridge

nnigam19 · 2026-06-05T12:07:05Z

What does this PR do?

Validates the Oracle profiler's DuckDB to Delta ingestion ("Profiler Ingestion Job") and fixes a
timestamp type-fidelity bug found during that validation. It also adds the per-source extract-schema
definition that the ingestion's validation step requires for Oracle.

1. Add resources/assessments/validation/oracle_extract_schema.yml (11 tables).
_validate_profiler_extract loads validation/<source_tech>_extract_schema.yml to know which tables/columns
to validate. This file was missing for Oracle, so the ingestion job raised FileNotFoundError for Oracle
before it ever reached ingestion. The definition mirrors the Oracle extraction DDLs exactly (verified
against both the shipped DDLs and a real extract).

2. Fix data-type fidelity in _ingest_table (assessments/dashboards/execute.py).
Ingestion previously relied on pandas-based schema inference (spark.createDataFrame(pdf)). DuckDB's
timezone-naive TIMESTAMP was inferred as Spark TIMESTAMP (LTZ / instant), which reinterprets the
value in the session time zone and shifts wall-clock values by the session offset.

The fix derives an explicit Spark schema from the DuckDB column types (build_spark_schema +
_DUCKDB_TO_SPARK_TYPE) and passes it to createDataFrame, mapping:

DuckDB	Spark
`VARCHAR`	`StringType`
`INTEGER`	`IntegerType`
`BIGINT`	`LongType`
`DOUBLE`	`DoubleType`
`TIMESTAMP`	`TIMESTAMP_NTZ` (the fix - preserves wall-clock)
`TIMESTAMP WITH TIME ZONE`	`TimestampType` (LTZ)

Columns whose DuckDB type isn't mapped (e.g. nested STRUCT/LIST from other profilers) fall back to
schema inference, so Synapse/other sources are unaffected.

3. Tests.

tests/unit/assessment/test_ingest_schema.py - unit tests for the type mapping, incl. TIMESTAMP to TIMESTAMP_NTZ,
TIMESTAMP WITH TIME ZONE to LTZ, and the unmapped-type (STRUCT) fallback.
tests/integration/assessments/test_oracle_validation.py + build_mock_oracle_extract - builds a mock
Oracle extract from the shipped DDLs and validates it against the shipped oracle_extract_schema.yml,
guarding DDL ↔ schema-definition consistency (no live Oracle required).
tests/integration/assessments/manual_oracle_e2e_check.py - a manual end-to-end fidelity check (not run in
CI. manual_ prefix keeps pytest from collecting it).

What was verified

All 11 Oracle tables round-trip with exact row counts.
Strings, integers (incl. nullable), bigints, and doubles round-trip value-exact.
Timestamps preserved exactly after the fix (validated end-to-end on Databricks serverless, Spark 4.x).
No regression to other sources: Synapse STRUCT tables fall back to inference, existing
validator / deployment / pipeline tests pass.
black / ruff / mypy / pylint clean.

Out of scope

Oracle to DuckDB (extraction-side) precision for example e.g. NUMBER to DOUBLE narrowing, not changed (it
lives in the extraction DDLs from feature/oracle_profiler #2187).

Note for reviewers

This branch is based on the --output-folder work (#2488), so until that merges into main the diff also
includes those changes. I'll rebase onto main once #2488 lands so the diff shows only this change. Opening
as draft for now.

Also flagging (pre-existing, not changed here): the checks in _validate_profiler_extract are created at
WARN severity and the gate only counts FAIL + ERROR, so today validation reports issues but does not
block ingestion. Happy to follow up separately if it should gate.

Linked issues

Part of the Oracle profiler rollout (#2187). No issue closed by this PR.

Functionality

modified existing behavior: execute-database-profiler ingestion (_ingest_table type fidelity)
added per-source asset: oracle_extract_schema.yml

Tests

manually tested (end-to-end against a real extract on a real runtime)
added unit tests
Added integration tests

Surfaces the profiler's DuckDB extract path as a CLI choice instead of a hardcoded value in per-source pipeline_config.yml files with the default `~/.databricks/labs/lakebridge_profilers/<source>_assessment`. Mirrors the shape of the transpile command's `--output-folder` flag. Closes databrickslabs#2461.

…pe fidelity Add the missing oracle_extract_schema.yml required by the profiler ingestion validation step (without it the job raised FileNotFoundError for Oracle before reaching ingestion), and fix _ingest_table to derive an explicit Spark schema from the DuckDB column types instead of relying on pandas inference. DuckDB's timezone-naive TIMESTAMP was inferred as Spark TIMESTAMP (LTZ), which shifts wall-clock values by the session offset. It now maps to TIMESTAMP_NTZ, preserving them exactly. Unmapped types (e.g. nested STRUCT from other sources) fall back to inference, so other sources are unaffected. Tests: unit tests for the DuckDB->Spark type mapping (incl. TIMESTAMP->NTZ and STRUCT fallback). An integration test validating a DDL-built mock Oracle extract against the shipped schema YAML and a manual end-to-end fidelity check (not run in CI).

CLAassistant · 2026-06-05T12:07:15Z

All committers have signed the CLA.

m-abulazm and others added 2 commits June 1, 2026 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/oracle profiler validation#2495

Feature/oracle profiler validation#2495
nnigam19 wants to merge 2 commits into
databrickslabs:mainfrom
nnigam19:feature/oracle-profiler-validation

nnigam19 commented Jun 5, 2026

Uh oh!

CLAassistant commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nnigam19 commented Jun 5, 2026

What does this PR do?

What was verified

Out of scope

Note for reviewers

Linked issues

Functionality

Tests

Uh oh!

CLAassistant commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 5, 2026 •

edited

Loading