Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2403
Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2403moomindani wants to merge 3 commits into
Conversation
The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing the literal star, e.g. Aggregate(agg_columns=["*"], type="count"), produces SELECT count(`*`) AS `source_count_*` FROM :tbl which fails at analysis with UNRESOLVED_COLUMN. There is currently no way to ask the reconcile aggregate engine to compute COUNT(*) (a true row count), even though the rules table already accepts and stores agg_column = "*". Special-case the literal star with type == "count" inside _get_mapping_cols_with_alias and emit a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces SELECT count(*) AS `source_count_*` FROM :tbl NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match. The change is bounded to count + literal star, so non-count aggregates and ordinary column names take the existing path unchanged. Co-authored-by: Isaac
Cover the AggregateQueryBuilder behavior introduced in this branch: - Aggregate(agg_columns=["*"], type="count") emits SELECT count(*) - The same holds when agg_columns has already been ansi-normalized to ["`*`"] by NormalizeReconConfigService - COUNT(*) coexists with COUNT(<col>) inside a single aggregate query - Star outside of count (e.g. type="sum") keeps the existing path so the special-case is bounded to the COUNT(*) use case Without the fix in this branch the first three cases produce SELECT count(`*`) ... and fail at SQL analysis with UNRESOLVED_COLUMN. Verified locally by reverting src/.../aggregate_query.py to origin/main: tests fail; with the patch applied: tests pass. Co-authored-by: Isaac
m-abulazm
left a comment
There was a problem hiding this comment.
In general, I agree with the idea. record count is something we need to run all the time. this is src/databricks/labs/lakebridge/reconcile/query_builder/count_query.py I would use that instead of coupling this with aggregate.
|
Thanks for the review. I looked at The reason I put
The user-facing surface I wanted was simply "let users put Let me know if you'd still prefer a different shape (e.g. surfacing this through a new dedicated config field instead of |
…count-star-support
Changes
What does this PR do?
Adds support for
Aggregate(agg_columns=["*"], type="count")so the aggregate reconcile engine can emitCOUNT(*)(a true row count). Today this combination produces invalid SQL.Relevant implementation details
The aggregate query builder treats every entry of
agg_columnsas a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. PassingAggregate(agg_columns=["*"], type="count")produceswhich fails at analysis with
UNRESOLVED_COLUMN. There is no way today to ask the engine to computeCOUNT(*), even though theaggregate_rulestable already accepts and storesagg_column = "*".This blocks the most common use case of aggregate reconcile — row-count parity between source and target. Workarounds (
count(<non-null column>)) only work when such a column exists, which is not always the case (tables without primary keys or NOT NULL constraints, common after staging-table migrations).The fix special-cases the literal star with
type == "count"inside_get_mapping_cols_with_aliasand emits a sqlglotStarexpression instead of pushing"*"through the column-name normalizer. The downstream formatter in_agg_query_cols_with_aliasthen producesNormalizeReconConfigServicerewrites entries inagg_columnsto their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw"*"or the wrapped"`*`". The check usesDialectUtils.unnormalize_identifierso both forms match.Caveats/things to watch out for when reviewing:
type == "count"AND literal star. Non-count aggregates and ordinary column names take the existing path unchangedsource_count_*(with*in the identifier) is preserved by the existing alias-rewriting code; it is backtick-quoted in the final SQL and accepted by Spark / Databricks SQLLinked issues
None
Functionality
Aggregate(agg_columns=["*"], type="count"))Tests
Manual end-to-end on Databricks Serverless: source 100 rows, target 1000 rows,
Aggregate(agg_columns=["*"], type="count"). Emitted SQL isSELECT count(*) AS `source_count_*` FROM :tbl. The recon run yieldsaggregate=False,mismatch=1inrecon.aggregate_metrics.Unit tests added in
tests/unit/reconcile/query_builder/test_aggregate_query.py:test_count_star_emits_unquoted_star:["*"]producescount(*)(notcount(`*`))test_count_star_normalized_input_emits_unquoted_star: same afterNormalizeReconConfigServicerewrite to["`*`"]test_count_star_alongside_named_column:["*", "s_acctbal"]producescount(*)andcount(`s_acctbal`)in the same querytest_star_with_non_count_aggregate_is_unchanged:type="sum"keeps the existing path