Skip to content

Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2424

Open
moomindani wants to merge 6 commits into
mainfrom
feat/recon-aggregate-count-star-support
Open

Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2424
moomindani wants to merge 6 commits into
mainfrom
feat/recon-aggregate-count-star-support

Conversation

@moomindani

Copy link
Copy Markdown
Contributor

Changes

What does this PR do?

Adds support for Aggregate(agg_columns=["*"], type="count") so the aggregate reconcile engine can emit COUNT(*) (a true row count). Today this combination produces invalid SQL.

Relevant implementation details

The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing Aggregate(agg_columns=["*"], type="count") produces

SELECT count(`*`) AS `source_count_*` FROM :tbl

which fails at analysis with UNRESOLVED_COLUMN. There is no way today to ask the engine to compute COUNT(*), even though the aggregate_rules table already accepts and stores agg_column = "*".

This blocks the most common use case of aggregate reconcile — row-count parity between source and target. Workarounds (count(<non-null column>)) only work when such a column exists, which is not always the case (tables without primary keys or NOT NULL constraints, common after staging-table migrations).

The fix special-cases the literal star with type == "count" inside _get_mapping_cols_with_alias and emits a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces

SELECT count(*) AS `source_count_*` FROM :tbl

NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match.

Caveats/things to watch out for when reviewing:

  • The fast-path is bounded to type == "count" AND literal star. Non-count aggregates and ordinary column names take the existing path unchanged
  • The alias name source_count_* (with * in the identifier) is preserved by the existing alias-rewriting code; it is backtick-quoted in the final SQL and accepted by Spark / Databricks SQL

Linked issues

None

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: aggregate reconcile (Aggregate(agg_columns=["*"], type="count"))

Tests

  • manually tested
  • added unit tests
  • added integration tests

Manual end-to-end on Databricks Serverless: source 100 rows, target 1000 rows, Aggregate(agg_columns=["*"], type="count"). Emitted SQL is SELECT count(*) AS `source_count_*` FROM :tbl. The recon run yields aggregate=False, mismatch=1 in recon.aggregate_metrics.

Unit tests added in tests/unit/reconcile/query_builder/test_aggregate_query.py:

  • test_count_star_emits_unquoted_star: ["*"] produces count(*) (not count(`*`))
  • test_count_star_normalized_input_emits_unquoted_star: same after NormalizeReconConfigService rewrite to ["`*`"]
  • test_count_star_alongside_named_column: ["*", "s_acctbal"] produces count(*) and count(`s_acctbal`) in the same query
  • test_star_with_non_count_aggregate_is_unchanged: type="sum" keeps the existing path

Reopened from #2403 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments and history are preserved on the original PR.

moomindani added 3 commits May 1, 2026 13:22
The aggregate query builder treats every entry of agg_columns as a
column identifier and pushes it through the dialect normalizer, which
wraps non-identifier characters in backticks. Passing the literal star,
e.g. Aggregate(agg_columns=["*"], type="count"), produces

  SELECT count(`*`) AS `source_count_*` FROM :tbl

which fails at analysis with UNRESOLVED_COLUMN. There is currently no
way to ask the reconcile aggregate engine to compute COUNT(*) (a true
row count), even though the rules table already accepts and stores
agg_column = "*".

Special-case the literal star with type == "count" inside
_get_mapping_cols_with_alias and emit a sqlglot Star expression instead
of pushing "*" through the column-name normalizer. The downstream
formatter in _agg_query_cols_with_alias then produces

  SELECT count(*) AS `source_count_*` FROM :tbl

NormalizeReconConfigService rewrites entries in agg_columns to their
ansi-normalized form before this builder is invoked, so the incoming
value may be either the raw "*" or the wrapped "`*`". The check uses
DialectUtils.unnormalize_identifier so both forms match.

The change is bounded to count + literal star, so non-count aggregates
and ordinary column names take the existing path unchanged.

Co-authored-by: Isaac
Cover the AggregateQueryBuilder behavior introduced in this branch:
- Aggregate(agg_columns=["*"], type="count") emits SELECT count(*)
- The same holds when agg_columns has already been ansi-normalized to
  ["`*`"] by NormalizeReconConfigService
- COUNT(*) coexists with COUNT(<col>) inside a single aggregate query
- Star outside of count (e.g. type="sum") keeps the existing path so
  the special-case is bounded to the COUNT(*) use case

Without the fix in this branch the first three cases produce
SELECT count(`*`) ... and fail at SQL analysis with
UNRESOLVED_COLUMN. Verified locally by reverting
src/.../aggregate_query.py to origin/main: tests fail; with the patch
applied: tests pass.

Co-authored-by: Isaac
@codecov

codecov Bot commented May 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.39%. Comparing base (7a816c6) to head (59107db).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2424      +/-   ##
==========================================
+ Coverage   64.78%   65.39%   +0.60%     
==========================================
  Files         104      104              
  Lines        9463     9472       +9     
  Branches      997     1000       +3     
==========================================
+ Hits         6131     6194      +63     
+ Misses       3156     3098      -58     
- Partials      176      180       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

✅ 154/154 passed, 3 skipped, 41m28s total

Running from acceptance #4460

Comment on lines +69 to +76
if DialectUtils.unnormalize_identifier(col) == "*" and agg_type.lower() == "count":
cols_with_mapping.append(
exp.Alias(
this=exp.Star(),
alias=exp.Identifier(this=f"{agg_type.lower()}<#>*", quoted=False),
)
)
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this in the original PR

@m-abulazm m-abulazm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this forward, the right fix would be to fix NormalizeReconConfigService_normalize_agg to handle star correctly instead of on this layer

@m-abulazm m-abulazm added the bug Something isn't working label May 14, 2026
Per review feedback (#2424 review-4288385460), shift the source of truth for
`*` handling from the aggregate-query layer into NormalizeReconConfigService.

- `_normalize_agg` now recognises `*` (raw or ansi-normalised `\`*\``) as a
  non-identifier, leaves it as the raw `"*"` downstream (no backtick wrap),
  and raises ValueError early when `*` is used with a non-count type.
- `aggregate_query._get_mapping_cols_with_alias` keeps a one-line `col == "*"`
  check to emit the sqlglot `Star` expression (the AST node cannot live in
  `Aggregate.agg_columns: list[str]`, so the emission has to stay in the
  query layer — but the detection logic is no longer normalize-aware).
- Replaced the test that asserted invalid `sum(\`*\`)` SQL was emitted with
  one that asserts the early ValueError, matching the new contract.

Co-authored-by: Isaac
@moomindani

Copy link
Copy Markdown
Contributor Author

Moved per your suggestion. _normalize_agg now owns * handling — it accepts both raw * and the ansi-normalised `*`, keeps them as the raw * downstream (skipping identifier normalization), and raises ValueError early when * is used with a non-count type. The Star expression emission stays in aggregate_query because Aggregate.agg_columns: list[str] can't carry a sqlglot AST node, but the detection there is now a single col == "*" check rather than the previous normalize-aware comparison.

All four unit tests pass; replaced the old "non-count types keep invalid sum(*) SQL" test with one asserting the early ValueError, matching the new contract.

pylint enforces a 2-char minimum for argument names; `c` was caught in
the new helper added in the previous commit.

Co-authored-by: Isaac
@m-abulazm m-abulazm added the feat/recon making sure that remorphed query produces the same results as original label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working feat/recon making sure that remorphed query produces the same results as original

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants