Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2424
Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2424moomindani wants to merge 6 commits into
Conversation
The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing the literal star, e.g. Aggregate(agg_columns=["*"], type="count"), produces SELECT count(`*`) AS `source_count_*` FROM :tbl which fails at analysis with UNRESOLVED_COLUMN. There is currently no way to ask the reconcile aggregate engine to compute COUNT(*) (a true row count), even though the rules table already accepts and stores agg_column = "*". Special-case the literal star with type == "count" inside _get_mapping_cols_with_alias and emit a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces SELECT count(*) AS `source_count_*` FROM :tbl NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match. The change is bounded to count + literal star, so non-count aggregates and ordinary column names take the existing path unchanged. Co-authored-by: Isaac
Cover the AggregateQueryBuilder behavior introduced in this branch: - Aggregate(agg_columns=["*"], type="count") emits SELECT count(*) - The same holds when agg_columns has already been ansi-normalized to ["`*`"] by NormalizeReconConfigService - COUNT(*) coexists with COUNT(<col>) inside a single aggregate query - Star outside of count (e.g. type="sum") keeps the existing path so the special-case is bounded to the COUNT(*) use case Without the fix in this branch the first three cases produce SELECT count(`*`) ... and fail at SQL analysis with UNRESOLVED_COLUMN. Verified locally by reverting src/.../aggregate_query.py to origin/main: tests fail; with the patch applied: tests pass. Co-authored-by: Isaac
…count-star-support
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2424 +/- ##
==========================================
+ Coverage 64.78% 65.39% +0.60%
==========================================
Files 104 104
Lines 9463 9472 +9
Branches 997 1000 +3
==========================================
+ Hits 6131 6194 +63
+ Misses 3156 3098 -58
- Partials 176 180 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ 154/154 passed, 3 skipped, 41m28s total Running from acceptance #4460 |
| if DialectUtils.unnormalize_identifier(col) == "*" and agg_type.lower() == "count": | ||
| cols_with_mapping.append( | ||
| exp.Alias( | ||
| this=exp.Star(), | ||
| alias=exp.Identifier(this=f"{agg_type.lower()}<#>*", quoted=False), | ||
| ) | ||
| ) | ||
| continue |
There was a problem hiding this comment.
I reviewed this in the original PR
m-abulazm
left a comment
There was a problem hiding this comment.
lets move this forward, the right fix would be to fix NormalizeReconConfigService_normalize_agg to handle star correctly instead of on this layer
Per review feedback (#2424 review-4288385460), shift the source of truth for `*` handling from the aggregate-query layer into NormalizeReconConfigService. - `_normalize_agg` now recognises `*` (raw or ansi-normalised `\`*\``) as a non-identifier, leaves it as the raw `"*"` downstream (no backtick wrap), and raises ValueError early when `*` is used with a non-count type. - `aggregate_query._get_mapping_cols_with_alias` keeps a one-line `col == "*"` check to emit the sqlglot `Star` expression (the AST node cannot live in `Aggregate.agg_columns: list[str]`, so the emission has to stay in the query layer — but the detection logic is no longer normalize-aware). - Replaced the test that asserted invalid `sum(\`*\`)` SQL was emitted with one that asserts the early ValueError, matching the new contract. Co-authored-by: Isaac
|
Moved per your suggestion. All four unit tests pass; replaced the old "non-count types keep invalid |
…count-star-support # Conflicts: # uv.lock
pylint enforces a 2-char minimum for argument names; `c` was caught in the new helper added in the previous commit. Co-authored-by: Isaac
Changes
What does this PR do?
Adds support for
Aggregate(agg_columns=["*"], type="count")so the aggregate reconcile engine can emitCOUNT(*)(a true row count). Today this combination produces invalid SQL.Relevant implementation details
The aggregate query builder treats every entry of
agg_columnsas a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. PassingAggregate(agg_columns=["*"], type="count")produceswhich fails at analysis with
UNRESOLVED_COLUMN. There is no way today to ask the engine to computeCOUNT(*), even though theaggregate_rulestable already accepts and storesagg_column = "*".This blocks the most common use case of aggregate reconcile — row-count parity between source and target. Workarounds (
count(<non-null column>)) only work when such a column exists, which is not always the case (tables without primary keys or NOT NULL constraints, common after staging-table migrations).The fix special-cases the literal star with
type == "count"inside_get_mapping_cols_with_aliasand emits a sqlglotStarexpression instead of pushing"*"through the column-name normalizer. The downstream formatter in_agg_query_cols_with_aliasthen producesNormalizeReconConfigServicerewrites entries inagg_columnsto their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw"*"or the wrapped"`*`". The check usesDialectUtils.unnormalize_identifierso both forms match.Caveats/things to watch out for when reviewing:
type == "count"AND literal star. Non-count aggregates and ordinary column names take the existing path unchangedsource_count_*(with*in the identifier) is preserved by the existing alias-rewriting code; it is backtick-quoted in the final SQL and accepted by Spark / Databricks SQLLinked issues
None
Functionality
Aggregate(agg_columns=["*"], type="count"))Tests
Manual end-to-end on Databricks Serverless: source 100 rows, target 1000 rows,
Aggregate(agg_columns=["*"], type="count"). Emitted SQL isSELECT count(*) AS `source_count_*` FROM :tbl. The recon run yieldsaggregate=False,mismatch=1inrecon.aggregate_metrics.Unit tests added in
tests/unit/reconcile/query_builder/test_aggregate_query.py:test_count_star_emits_unquoted_star:["*"]producescount(*)(notcount(`*`))test_count_star_normalized_input_emits_unquoted_star: same afterNormalizeReconConfigServicerewrite to["`*`"]test_count_star_alongside_named_column:["*", "s_acctbal"]producescount(*)andcount(`s_acctbal`)in the same querytest_star_with_non_count_aggregate_is_unchanged:type="sum"keeps the existing pathReopened from #2403 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments and history are preserved on the original PR.