Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile by moomindani · Pull Request #2403 · databrickslabs/lakebridge

moomindani · 2026-05-01T04:43:09Z

Changes

What does this PR do?

Adds support for Aggregate(agg_columns=["*"], type="count") so the aggregate reconcile engine can emit COUNT(*) (a true row count). Today this combination produces invalid SQL.

Relevant implementation details

The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing Aggregate(agg_columns=["*"], type="count") produces

SELECT count(`*`) AS `source_count_*` FROM :tbl

which fails at analysis with UNRESOLVED_COLUMN. There is no way today to ask the engine to compute COUNT(*), even though the aggregate_rules table already accepts and stores agg_column = "*".

This blocks the most common use case of aggregate reconcile — row-count parity between source and target. Workarounds (count(<non-null column>)) only work when such a column exists, which is not always the case (tables without primary keys or NOT NULL constraints, common after staging-table migrations).

The fix special-cases the literal star with type == "count" inside _get_mapping_cols_with_alias and emits a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces

SELECT count(*) AS `source_count_*` FROM :tbl

NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match.

Caveats/things to watch out for when reviewing:

The fast-path is bounded to type == "count" AND literal star. Non-count aggregates and ordinary column names take the existing path unchanged
The alias name source_count_* (with * in the identifier) is preserved by the existing alias-rewriting code; it is backtick-quoted in the final SQL and accepted by Spark / Databricks SQL

Linked issues

None

Functionality

added relevant user documentation
added new CLI command
modified existing command: aggregate reconcile (Aggregate(agg_columns=["*"], type="count"))

Tests

manually tested
added unit tests
added integration tests

Manual end-to-end on Databricks Serverless: source 100 rows, target 1000 rows, Aggregate(agg_columns=["*"], type="count"). Emitted SQL is SELECT count(*) AS `source_count_*` FROM :tbl. The recon run yields aggregate=False, mismatch=1 in recon.aggregate_metrics.

Unit tests added in tests/unit/reconcile/query_builder/test_aggregate_query.py:

test_count_star_emits_unquoted_star: ["*"] produces count(*) (not count(`*`))
test_count_star_normalized_input_emits_unquoted_star: same after NormalizeReconConfigService rewrite to ["`*`"]
test_count_star_alongside_named_column: ["*", "s_acctbal"] produces count(*) and count(`s_acctbal`) in the same query
test_star_with_non_count_aggregate_is_unchanged: type="sum" keeps the existing path

The aggregate query builder treats every entry of agg_columns as a column identifier and pushes it through the dialect normalizer, which wraps non-identifier characters in backticks. Passing the literal star, e.g. Aggregate(agg_columns=["*"], type="count"), produces SELECT count(`*`) AS `source_count_*` FROM :tbl which fails at analysis with UNRESOLVED_COLUMN. There is currently no way to ask the reconcile aggregate engine to compute COUNT(*) (a true row count), even though the rules table already accepts and stores agg_column = "*". Special-case the literal star with type == "count" inside _get_mapping_cols_with_alias and emit a sqlglot Star expression instead of pushing "*" through the column-name normalizer. The downstream formatter in _agg_query_cols_with_alias then produces SELECT count(*) AS `source_count_*` FROM :tbl NormalizeReconConfigService rewrites entries in agg_columns to their ansi-normalized form before this builder is invoked, so the incoming value may be either the raw "*" or the wrapped "`*`". The check uses DialectUtils.unnormalize_identifier so both forms match. The change is bounded to count + literal star, so non-count aggregates and ordinary column names take the existing path unchanged. Co-authored-by: Isaac

Cover the AggregateQueryBuilder behavior introduced in this branch: - Aggregate(agg_columns=["*"], type="count") emits SELECT count(*) - The same holds when agg_columns has already been ansi-normalized to ["`*`"] by NormalizeReconConfigService - COUNT(*) coexists with COUNT(<col>) inside a single aggregate query - Star outside of count (e.g. type="sum") keeps the existing path so the special-case is bounded to the COUNT(*) use case Without the fix in this branch the first three cases produce SELECT count(`*`) ... and fail at SQL analysis with UNRESOLVED_COLUMN. Verified locally by reverting src/.../aggregate_query.py to origin/main: tests fail; with the patch applied: tests pass. Co-authored-by: Isaac

m-abulazm

In general, I agree with the idea. record count is something we need to run all the time. this is src/databricks/labs/lakebridge/reconcile/query_builder/count_query.py I would use that instead of coupling this with aggregate.

moomindani · 2026-05-07T00:40:33Z

Thanks for the review. I looked at count_query.py — it's already invoked unconditionally from Reconciliation.get_record_count for every non-schema report (including aggregate), and the result lands in the recon main table as record_count. So the bare "are source/target row counts equal" case is already covered without this PR.

The reason I put COUNT(*) in the aggregate builder rather than rerouting to CountQueryBuilder is that the public API in TriggerReconService.trigger_recon (now properly dispatching to the aggregate service via #2401) lets callers express things CountQueryBuilder can't:

Aggregate(agg_columns=["*"], type="count", group_by_columns=["region"]) — per-group COUNT(*). CountQueryBuilder has no GROUP BY support.
COUNT(*) alongside SUM(amount), MAX(date), etc. in a single emitted query — Table.aggregates: list[Aggregate] supports this and the builder groups them by group_by_columns_as_str. CountQueryBuilder is single-purpose.
Result destination — aggregate rules write to recon.aggregate_metrics with a rule_id per metric (used by store_aggregates_metrics). CountQueryBuilder's output goes to the recon main table as record_count. Routing ["*"] to CountQueryBuilder would mean the same Aggregate(...) config produces results in different tables depending on the column — inconsistent for users.

The user-facing surface I wanted was simply "let users put * in agg_columns for type=count", which keeps the Aggregate(...) schema uniform. Doing that without the special-case in _get_mapping_cols_with_alias is what fails today (COUNT(\*`)→UNRESOLVED_COLUMN`), so the change is scoped to just that one branch.

Let me know if you'd still prefer a different shape (e.g. surfacing this through a new dedicated config field instead of agg_columns=["*"]) — happy to iterate.

…count-star-support

moomindani · 2026-05-08T15:52:35Z

Reopened as #2424 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments here are preserved as history; further discussion will happen on #2424.

moomindani added 2 commits May 1, 2026 13:22

moomindani requested a review from a team as a code owner May 1, 2026 04:43

moomindani had a problem deploying to tool May 1, 2026 04:43 — with GitHub Actions Failure

m-abulazm requested changes May 1, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/recon-aggregate-…

6d9b335

…count-star-support

moomindani had a problem deploying to tool May 7, 2026 06:18 — with GitHub Actions Failure

moomindani mentioned this pull request May 8, 2026

Feat: support agg_columns=["*"] for COUNT(*) in aggregate reconcile #2424

Open

6 tasks

moomindani temporarily deployed to tool May 8, 2026 15:52 — with GitHub Actions Inactive

moomindani closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: support agg_columns=[""] for COUNT() in aggregate reconcile#2403

Feat: support agg_columns=[""] for COUNT() in aggregate reconcile#2403
moomindani wants to merge 3 commits into
databrickslabs:mainfrom
moomindani:feat/recon-aggregate-count-star-support

moomindani commented May 1, 2026

Uh oh!

m-abulazm left a comment

Uh oh!

moomindani commented May 7, 2026

Uh oh!

moomindani commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

moomindani commented May 1, 2026

Changes

What does this PR do?

Relevant implementation details

Caveats/things to watch out for when reviewing:

Linked issues

Functionality

Tests

Uh oh!

m-abulazm left a comment

Choose a reason for hiding this comment

Uh oh!

moomindani commented May 7, 2026

Uh oh!

moomindani commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants