Feature/oracle profiler validation#2495
Draft
nnigam19 wants to merge 2 commits into
Draft
Conversation
Surfaces the profiler's DuckDB extract path as a CLI choice instead of a hardcoded value in per-source pipeline_config.yml files with the default `~/.databricks/labs/lakebridge_profilers/<source>_assessment`. Mirrors the shape of the transpile command's `--output-folder` flag. Closes databrickslabs#2461.
…pe fidelity Add the missing oracle_extract_schema.yml required by the profiler ingestion validation step (without it the job raised FileNotFoundError for Oracle before reaching ingestion), and fix _ingest_table to derive an explicit Spark schema from the DuckDB column types instead of relying on pandas inference. DuckDB's timezone-naive TIMESTAMP was inferred as Spark TIMESTAMP (LTZ), which shifts wall-clock values by the session offset. It now maps to TIMESTAMP_NTZ, preserving them exactly. Unmapped types (e.g. nested STRUCT from other sources) fall back to inference, so other sources are unaffected. Tests: unit tests for the DuckDB->Spark type mapping (incl. TIMESTAMP->NTZ and STRUCT fallback). An integration test validating a DDL-built mock Oracle extract against the shipped schema YAML and a manual end-to-end fidelity check (not run in CI).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Validates the Oracle profiler's DuckDB to Delta ingestion ("Profiler Ingestion Job") and fixes a
timestamp type-fidelity bug found during that validation. It also adds the per-source extract-schema
definition that the ingestion's validation step requires for Oracle.
1. Add
resources/assessments/validation/oracle_extract_schema.yml(11 tables)._validate_profiler_extractloadsvalidation/<source_tech>_extract_schema.ymlto know which tables/columnsto validate. This file was missing for Oracle, so the ingestion job raised
FileNotFoundErrorfor Oraclebefore it ever reached ingestion. The definition mirrors the Oracle extraction DDLs exactly (verified
against both the shipped DDLs and a real extract).
2. Fix data-type fidelity in
_ingest_table(assessments/dashboards/execute.py).Ingestion previously relied on pandas-based schema inference (
spark.createDataFrame(pdf)). DuckDB'stimezone-naive
TIMESTAMPwas inferred as SparkTIMESTAMP(LTZ / instant), which reinterprets thevalue in the session time zone and shifts wall-clock values by the session offset.
The fix derives an explicit Spark schema from the DuckDB column types (
build_spark_schema+_DUCKDB_TO_SPARK_TYPE) and passes it tocreateDataFrame, mapping:VARCHARStringTypeINTEGERIntegerTypeBIGINTLongTypeDOUBLEDoubleTypeTIMESTAMPTIMESTAMP_NTZ(the fix - preserves wall-clock)TIMESTAMP WITH TIME ZONETimestampType(LTZ)Columns whose DuckDB type isn't mapped (e.g. nested
STRUCT/LISTfrom other profilers) fall back toschema inference, so Synapse/other sources are unaffected.
3. Tests.
tests/unit/assessment/test_ingest_schema.py- unit tests for the type mapping, incl.TIMESTAMP to TIMESTAMP_NTZ,TIMESTAMP WITH TIME ZONE to LTZ, and the unmapped-type (STRUCT) fallback.tests/integration/assessments/test_oracle_validation.py+build_mock_oracle_extract- builds a mockOracle extract from the shipped DDLs and validates it against the shipped
oracle_extract_schema.yml,guarding DDL ↔ schema-definition consistency (no live Oracle required).
tests/integration/assessments/manual_oracle_e2e_check.py- a manual end-to-end fidelity check (not run inCI.
manual_prefix keeps pytest from collecting it).What was verified
STRUCTtables fall back to inference, existingvalidator / deployment / pipeline tests pass.
black/ruff/mypy/pylintclean.Out of scope
NUMBER to DOUBLEnarrowing, not changed (itlives in the extraction DDLs from feature/oracle_profiler #2187).
Note for reviewers
This branch is based on the
--output-folderwork (#2488), so until that merges intomainthe diff alsoincludes those changes. I'll rebase onto
mainonce #2488 lands so the diff shows only this change. Openingas draft for now.
Also flagging (pre-existing, not changed here): the checks in
_validate_profiler_extractare created atWARNseverity and the gate only countsFAIL+ERROR, so today validation reports issues but does notblock ingestion. Happy to follow up separately if it should gate.
Linked issues
Part of the Oracle profiler rollout (#2187). No issue closed by this PR.
Functionality
execute-database-profileringestion (_ingest_tabletype fidelity)oracle_extract_schema.ymlTests