Skip to content

Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2403

Closed
moomindani wants to merge 3 commits into
databrickslabs:mainfrom
moomindani:feat/recon-aggregate-count-star-support
Closed

Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile#2403
moomindani wants to merge 3 commits into
databrickslabs:mainfrom
moomindani:feat/recon-aggregate-count-star-support

Conversation

@moomindani

Copy link
Copy Markdown
Contributor

Changes

What does this PR do?

Adds support for Aggregate(agg_columns=["*"], type="count") so the aggregate reconcile engine can emit COUNT(*) (a true row count). Today this combination produces invalid SQL.

Relevant implementation details

The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing Aggregate(agg_columns=["*"], type="count") produces

SELECT count(`*`) AS `source_count_*` FROM :tbl

which fails at analysis with UNRESOLVED_COLUMN. There is no way today to ask the engine to compute COUNT(*), even though the aggregate_rules table already accepts and stores agg_column = "*".

This blocks the most common use case of aggregate reconcile — row-count parity between source and target. Workarounds (count(<non-null column>)) only work when such a column exists, which is not always the case (tables without primary keys or NOT NULL constraints, common after staging-table migrations).

The fix special-cases the literal star with type == "count" inside _get_mapping_cols_with_alias and emits a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces

SELECT count(*) AS `source_count_*` FROM :tbl

NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match.

Caveats/things to watch out for when reviewing:

  • The fast-path is bounded to type == "count" AND literal star. Non-count aggregates and ordinary column names take the existing path unchanged
  • The alias name source_count_* (with * in the identifier) is preserved by the existing alias-rewriting code; it is backtick-quoted in the final SQL and accepted by Spark / Databricks SQL

Linked issues

None

Functionality

  • added relevant user documentation
  • added new CLI command
  • modified existing command: aggregate reconcile (Aggregate(agg_columns=["*"], type="count"))

Tests

  • manually tested
  • added unit tests
  • added integration tests

Manual end-to-end on Databricks Serverless: source 100 rows, target 1000 rows, Aggregate(agg_columns=["*"], type="count"). Emitted SQL is SELECT count(*) AS `source_count_*` FROM :tbl. The recon run yields aggregate=False, mismatch=1 in recon.aggregate_metrics.

Unit tests added in tests/unit/reconcile/query_builder/test_aggregate_query.py:

  • test_count_star_emits_unquoted_star: ["*"] produces count(*) (not count(`*`))
  • test_count_star_normalized_input_emits_unquoted_star: same after NormalizeReconConfigService rewrite to ["`*`"]
  • test_count_star_alongside_named_column: ["*", "s_acctbal"] produces count(*) and count(`s_acctbal`) in the same query
  • test_star_with_non_count_aggregate_is_unchanged: type="sum" keeps the existing path

moomindani added 2 commits May 1, 2026 13:22
The aggregate query builder treats every entry of agg_columns as a
column identifier and pushes it through the dialect normalizer, which
wraps non-identifier characters in backticks. Passing the literal star,
e.g. Aggregate(agg_columns=["*"], type="count"), produces

  SELECT count(`*`) AS `source_count_*` FROM :tbl

which fails at analysis with UNRESOLVED_COLUMN. There is currently no
way to ask the reconcile aggregate engine to compute COUNT(*) (a true
row count), even though the rules table already accepts and stores
agg_column = "*".

Special-case the literal star with type == "count" inside
_get_mapping_cols_with_alias and emit a sqlglot Star expression instead
of pushing "*" through the column-name normalizer. The downstream
formatter in _agg_query_cols_with_alias then produces

  SELECT count(*) AS `source_count_*` FROM :tbl

NormalizeReconConfigService rewrites entries in agg_columns to their
ansi-normalized form before this builder is invoked, so the incoming
value may be either the raw "*" or the wrapped "`*`". The check uses
DialectUtils.unnormalize_identifier so both forms match.

The change is bounded to count + literal star, so non-count aggregates
and ordinary column names take the existing path unchanged.

Co-authored-by: Isaac
Cover the AggregateQueryBuilder behavior introduced in this branch:
- Aggregate(agg_columns=["*"], type="count") emits SELECT count(*)
- The same holds when agg_columns has already been ansi-normalized to
  ["`*`"] by NormalizeReconConfigService
- COUNT(*) coexists with COUNT(<col>) inside a single aggregate query
- Star outside of count (e.g. type="sum") keeps the existing path so
  the special-case is bounded to the COUNT(*) use case

Without the fix in this branch the first three cases produce
SELECT count(`*`) ... and fail at SQL analysis with
UNRESOLVED_COLUMN. Verified locally by reverting
src/.../aggregate_query.py to origin/main: tests fail; with the patch
applied: tests pass.

Co-authored-by: Isaac
@moomindani moomindani requested a review from a team as a code owner May 1, 2026 04:43

@m-abulazm m-abulazm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I agree with the idea. record count is something we need to run all the time. this is src/databricks/labs/lakebridge/reconcile/query_builder/count_query.py I would use that instead of coupling this with aggregate.

@moomindani

Copy link
Copy Markdown
Contributor Author

Thanks for the review. I looked at count_query.py — it's already invoked unconditionally from Reconciliation.get_record_count for every non-schema report (including aggregate), and the result lands in the recon main table as record_count. So the bare "are source/target row counts equal" case is already covered without this PR.

The reason I put COUNT(*) in the aggregate builder rather than rerouting to CountQueryBuilder is that the public API in TriggerReconService.trigger_recon (now properly dispatching to the aggregate service via #2401) lets callers express things CountQueryBuilder can't:

  • Aggregate(agg_columns=["*"], type="count", group_by_columns=["region"]) — per-group COUNT(*). CountQueryBuilder has no GROUP BY support.
  • COUNT(*) alongside SUM(amount), MAX(date), etc. in a single emitted queryTable.aggregates: list[Aggregate] supports this and the builder groups them by group_by_columns_as_str. CountQueryBuilder is single-purpose.
  • Result destination — aggregate rules write to recon.aggregate_metrics with a rule_id per metric (used by store_aggregates_metrics). CountQueryBuilder's output goes to the recon main table as record_count. Routing ["*"] to CountQueryBuilder would mean the same Aggregate(...) config produces results in different tables depending on the column — inconsistent for users.

The user-facing surface I wanted was simply "let users put * in agg_columns for type=count", which keeps the Aggregate(...) schema uniform. Doing that without the special-case in _get_mapping_cols_with_alias is what fails today (COUNT(\*`)UNRESOLVED_COLUMN`), so the change is scoped to just that one branch.

Let me know if you'd still prefer a different shape (e.g. surfacing this through a new dedicated config field instead of agg_columns=["*"]) — happy to iterate.

@moomindani

Copy link
Copy Markdown
Contributor Author

Reopened as #2424 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments here are preserved as history; further discussion will happen on #2424.

@moomindani moomindani closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants