Skip to content

Feature/oracle profiler validation#2495

Draft
nnigam19 wants to merge 2 commits into
databrickslabs:mainfrom
nnigam19:feature/oracle-profiler-validation
Draft

Feature/oracle profiler validation#2495
nnigam19 wants to merge 2 commits into
databrickslabs:mainfrom
nnigam19:feature/oracle-profiler-validation

Conversation

@nnigam19

@nnigam19 nnigam19 commented Jun 5, 2026

Copy link
Copy Markdown

What does this PR do?

Validates the Oracle profiler's DuckDB to Delta ingestion ("Profiler Ingestion Job") and fixes a
timestamp type-fidelity bug found during that validation. It also adds the per-source extract-schema
definition that the ingestion's validation step requires for Oracle.

1. Add resources/assessments/validation/oracle_extract_schema.yml (11 tables).
_validate_profiler_extract loads validation/<source_tech>_extract_schema.yml to know which tables/columns
to validate. This file was missing for Oracle, so the ingestion job raised FileNotFoundError for Oracle
before it ever reached ingestion. The definition mirrors the Oracle extraction DDLs exactly (verified
against both the shipped DDLs and a real extract).

2. Fix data-type fidelity in _ingest_table (assessments/dashboards/execute.py).
Ingestion previously relied on pandas-based schema inference (spark.createDataFrame(pdf)). DuckDB's
timezone-naive TIMESTAMP was inferred as Spark TIMESTAMP (LTZ / instant), which reinterprets the
value in the session time zone and shifts wall-clock values by the session offset.

The fix derives an explicit Spark schema from the DuckDB column types (build_spark_schema +
_DUCKDB_TO_SPARK_TYPE) and passes it to createDataFrame, mapping:

DuckDB Spark
VARCHAR StringType
INTEGER IntegerType
BIGINT LongType
DOUBLE DoubleType
TIMESTAMP TIMESTAMP_NTZ (the fix - preserves wall-clock)
TIMESTAMP WITH TIME ZONE TimestampType (LTZ)

Columns whose DuckDB type isn't mapped (e.g. nested STRUCT/LIST from other profilers) fall back to
schema inference
, so Synapse/other sources are unaffected.

3. Tests.

  • tests/unit/assessment/test_ingest_schema.py - unit tests for the type mapping, incl. TIMESTAMP to TIMESTAMP_NTZ,
    TIMESTAMP WITH TIME ZONE to LTZ, and the unmapped-type (STRUCT) fallback.
  • tests/integration/assessments/test_oracle_validation.py + build_mock_oracle_extract - builds a mock
    Oracle extract from the shipped DDLs and validates it against the shipped oracle_extract_schema.yml,
    guarding DDL ↔ schema-definition consistency (no live Oracle required).
  • tests/integration/assessments/manual_oracle_e2e_check.py - a manual end-to-end fidelity check (not run in
    CI. manual_ prefix keeps pytest from collecting it).

What was verified

  • All 11 Oracle tables round-trip with exact row counts.
  • Strings, integers (incl. nullable), bigints, and doubles round-trip value-exact.
  • Timestamps preserved exactly after the fix (validated end-to-end on Databricks serverless, Spark 4.x).
  • No regression to other sources: Synapse STRUCT tables fall back to inference, existing
    validator / deployment / pipeline tests pass.
  • black / ruff / mypy / pylint clean.

Out of scope

  • Oracle to DuckDB (extraction-side) precision for example e.g. NUMBER to DOUBLE narrowing, not changed (it
    lives in the extraction DDLs from feature/oracle_profiler #2187).

Note for reviewers

This branch is based on the --output-folder work (#2488), so until that merges into main the diff also
includes those changes. I'll rebase onto main once #2488 lands so the diff shows only this change. Opening
as draft for now.

Also flagging (pre-existing, not changed here): the checks in _validate_profiler_extract are created at
WARN severity and the gate only counts FAIL + ERROR, so today validation reports issues but does not
block ingestion. Happy to follow up separately if it should gate.

Linked issues

Part of the Oracle profiler rollout (#2187). No issue closed by this PR.

Functionality

  • modified existing behavior: execute-database-profiler ingestion (_ingest_table type fidelity)
  • added per-source asset: oracle_extract_schema.yml

Tests

  • manually tested (end-to-end against a real extract on a real runtime)
  • added unit tests
  • Added integration tests

m-abulazm and others added 2 commits June 1, 2026 10:53
Surfaces the profiler's DuckDB extract path as a CLI choice instead of a
hardcoded value in per-source pipeline_config.yml files with the default
`~/.databricks/labs/lakebridge_profilers/<source>_assessment`. Mirrors the shape of the transpile command's `--output-folder` flag.

Closes databrickslabs#2461.
…pe fidelity

Add the missing oracle_extract_schema.yml required by the profiler ingestion validation step (without it the job
  raised FileNotFoundError for Oracle before reaching ingestion), and fix _ingest_table to derive an explicit Spark
  schema from the DuckDB column types instead of relying on pandas inference.

DuckDB's timezone-naive TIMESTAMP was inferred as Spark TIMESTAMP (LTZ), which shifts wall-clock values by the
  session offset. It now maps to TIMESTAMP_NTZ, preserving them exactly. Unmapped types (e.g. nested STRUCT from other
  sources) fall back to inference, so other sources are unaffected.

Tests: unit tests for the DuckDB->Spark type mapping (incl. TIMESTAMP->NTZ and STRUCT fallback). An integration
  test validating a DDL-built mock Oracle extract against the shipped schema YAML and a manual end-to-end fidelity check
  (not run in CI).
@CLAassistant

CLAassistant commented Jun 5, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants