Add StatisticsContext parameter to partition_statistics by asolimando · Pull Request #21815 · apache/datafusion

asolimando · 2026-04-23T21:00:44Z

Which issue does this PR close?

Closes #20184

Rationale for this change

ExecutionPlan::partition_statistics forces each operator to re-fetch child statistics internally, causing exponential recomputation in deep plans and making it impossible to inject enriched statistics from external sources (e.g., expression-level analyzers, custom statistics providers).

What changes are included in this PR?

Breaking change: the ExecutionPlan::partition_statistics signature changes from (&self, partition: Option<usize>) to (&self, partition: Option<usize>, ctx: &StatisticsContext). Migration guide added to docs/source/library-user-guide/upgrading/54.0.0.md.

Add a StatisticsContext parameter to partition_statistics that carries pre-computed child statistics, and a compute_statistics() utility that walks the plan tree bottom-up, threading child statistics through the context automatically. StatisticsContext carries one Arc<Statistics> per child node and is designed to be extended with additional context (e.g., expression-level analyzers, custom statistics providers) without further signature changes.

Operator categories

Leaf nodes (EmptyExec, PlaceholderRowExec, WorkTableExec, DataSourceExec): ignore the context, return their own stats. DataSourceExec delegates to the DataSource trait which has a separate partition_statistics that was not changed.
Passthrough operators (BufferExec, CooperativeExec, PartialSortExec, OutputRequirementExec): return ctx.child_stats()[0] directly.
Transform operators (FilterExec, ProjectionExec, AggregateExec, WindowAggExec, CoalesceBatchesExec, GlobalLimitExec, LocalLimitExec): use ctx.child_stats()[0] as input, then apply their transformation (selectivity, column projection, grouping cardinality, fetch limit, etc.).
Partition-merging operators (CoalescePartitionsExec, SortPreservingMergeExec, SortExec with !preserve_partitioning, RepartitionExec): always need overall child stats regardless of which output partition is requested, since they merge/redistribute input partitions. These call compute_statistics(child, None) internally instead of using the context.
Symmetric joins (SortMergeJoinExec): both sides use the same partition, so ctx.child_stats() is correct for both None and Some(i) cases.
Asymmetric joins (HashJoinExec CollectLeft, CrossJoinExec, NestedLoopJoinExec): the left (broadcast) side always needs overall stats, so they call compute_statistics(left, None) for the Some case. The right side is partitioned and uses ctx.child_stats()[1] directly. HashJoinExec Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs overall stats from both sides.
Union/Interleave: UnionExec uses ctx.child_stats() for the None case (reduces with stats_union). For Some(partition), Union remaps partition indices across children and calls compute_statistics on the specific child with the remapped index. InterleaveExec uses ctx.child_stats() directly (symmetric across all inputs).

Callers

All direct plan.partition_statistics(None) calls in optimizer rules (JoinSelection, AggregateStatistics, EnforceDistribution), display code, StatisticsRegistry, and tests are replaced with compute_statistics(plan, None).

Tests

No new tests added. This is a no-op refactoring confirmed by all existing tests passing unchanged across all affected crates (datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource).

What remains for follow-up

Adding expression-level analyzers and custom statistics providers to StatisticsContext (eliminates the separate StatisticsRegistry tree walk and the ExpressionAnalyzer injection machinery from Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122)
Extending DataSource::partition_statistics with context if needed
Caching to avoid redundant compute_statistics(child, None) calls: partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently call compute_statistics(child, None) internally when the requested partition is Some, triggering a separate bottom-up walk. A cache on StatisticsContext keyed by (plan node, partition) would let these reuse already-computed results.

Test plan

cargo fmt --all
cargo clippy --all-targets --all-features -- -D warnings (affected crates)
cargo test --profile ci on datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

Introduce StatisticsContext that carries pre-computed child statistics and external context for statistics computation. Change the ExecutionPlan::partition_statistics signature to accept it, and add compute_statistics() utility for bottom-up computation with automatic child stats threading. Update all ~35 in-tree ExecutionPlan implementations and ~40 call sites. Passthrough operators return ctx.child_stats() directly, transform operators use it instead of re-fetching from children, and operators that always need overall child stats (RepartitionExec, CoalescePartitionsExec, SortPreservingMergeExec, SortExec non-preserving, HashJoinExec CollectLeft/Auto, CrossJoinExec, NestedLoopJoinExec) call compute_statistics with None internally.

asolimando · 2026-04-23T21:10:50Z

Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed.

This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md‎, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide.

Looking forward to your feedback!

xudong963 · 2026-04-24T10:36:42Z

@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n

asolimando · 2026-04-28T10:49:09Z

@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n

Gentle reminder @xudong963 :)

xudong963

@asolimando thanks! I'm sorry that I'm busy with others this week.

This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;

Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?

xudong963 · 2026-04-30T03:37:02Z

+    ///
+    /// [`StatisticsContext`]: crate::statistics_context::StatisticsContext
+    /// [`compute_statistics`]: crate::statistics_context::compute_statistics
+    fn partition_statistics(


We should keep the old API, and add a new one https://datafusion.apache.org/contributor-guide/api-health.html#what-is-the-public-api-and-what-is-a-breaking-api-change

Noted and I will make sure to keep both APIs in the future! I will address this in the next iteration on the code and will resolve the discussion at that point.

xudong963 · 2026-04-30T03:38:20Z

+    let child_stats = plan
+        .children()
+        .iter()
+        .map(|child| compute_statistics(child.as_ref(), partition))


compute_statistics always recurses with the same partition. For partition-merging operators this is wasted work because they'll discard the context and recompute with None anyway

asolimando · 2026-04-30T10:18:28Z

@asolimando thanks! I'm sorry that I'm busy with others this week.

This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;

Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?

Thank you for your input @xudong963, no need to apologies, it's understandable!

You raise a fair point, we fully avoid the recomputation only for linear plans, but operators that call compute_statistics(child, None) internally don't benefit. This is noted in the "What remains for follow-up" section but I agree it might not be enough for the first iteration, and I anyway should have marked "partially closes #20184".

Re. the cache, I identified the need for the StatisticsRegistry already, and we discussed with @kosiew in the related PR (#21483, comment, branch asolimando/statistics-planner-with-statscache-v2). We agreed to defer it to limit scope, but this is the right place to discuss it.

One limitation I identified on the StatsCache (as I called it there), is around the cache key, which should "identify" an ExecutionPlan, which doesn't have any stable id other than its memory pointer ( so the cache key is effectively (Arc::as_ptr, partition)), but I am concerned of nodes being disposed (and re-used).

Cache lifecycle/scope:

single invocation of compute_statistics (as described in Let partition_statistics accept pre-computed children statistics #20184): if we agree on this, then the concern is not valid, as the plan tree is "stable" during the lifetime. When e.g. CoalescePartitionsExec calls compute_statistics(child, None) internally, the cache already has the subtree results, fully eliminating redundant walks.
multiple invocations of compute_statistics (same rule or cross-rules): here we necessarily need a stable node ID and we can't rely on the pointer, since nodes can be dropped/recreated

The scope of #20184 is, in my understanding, 1. (single walk), if you agree with that, I plan to use (Arc::as_ptr, partition) as cache key, and introducing node IDs and expanding the cache lifetime IMO be tackled as a followup (I can create issues for that, if the direction is confirmed), as with this solution we should already see computational benefits.

Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? Also, could I be added to the allowlist to trigger benchmark runs so I can iterate without requiring manual re-runs, in case I need multiple iterations?

WDYT?

xudong963 · 2026-04-30T13:13:05Z

Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context.

On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track?

On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call.

On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:

Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain of hash joins + repartitions) and time compute_statistics(plan, None) before/after this PR.
Optionally reuse a reproducer from [EPIC] Improve query planning speed #19795 (planning-speed EPIC) since deep plans are exactly that issue's pain point.

That should cleanly demonstrate the gain.

asolimando · 2026-04-30T18:04:31Z

Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context.

On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track?

On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call.

On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:

Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain of hash joins + repartitions) and time compute_statistics(plan, None) before/after this PR.

Optionally reuse a reproducer from [EPIC] Improve query planning speed #19795 (planning-speed EPIC) since deep plans are exactly that issue's pain point.

That should cleanly demonstrate the gain.

Thanks for the confirmation and the clarifications, I will hopefully get to it early next week and I will ping you back as soon as I will have some updates!

asolimando added 2 commits April 23, 2026 22:47

Add upgrade guide for partition_statistics signature change

de9b800

github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 23, 2026

asolimando mentioned this pull request Apr 23, 2026

Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122

Open

Merge branch 'main' into asolimando/partition-statistics-context

e135e8a

xudong963 reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StatisticsContext parameter to partition_statistics#21815

Add StatisticsContext parameter to partition_statistics#21815
asolimando wants to merge 3 commits intoapache:mainfrom
asolimando:asolimando/partition-statistics-context

asolimando commented Apr 23, 2026

Uh oh!

asolimando commented Apr 23, 2026

Uh oh!

xudong963 commented Apr 24, 2026

Uh oh!

asolimando commented Apr 28, 2026

Uh oh!

xudong963 left a comment

Uh oh!

xudong963 Apr 30, 2026

Uh oh!

asolimando Apr 30, 2026

Uh oh!

xudong963 Apr 30, 2026

Uh oh!

asolimando commented Apr 30, 2026

Uh oh!

xudong963 commented Apr 30, 2026

Uh oh!

asolimando commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asolimando commented Apr 23, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Operator categories

Callers

Tests

What remains for follow-up

Test plan

Uh oh!

asolimando commented Apr 23, 2026

Uh oh!

xudong963 commented Apr 24, 2026

Uh oh!

asolimando commented Apr 28, 2026

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

xudong963 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

xudong963 Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando commented Apr 30, 2026

Uh oh!

xudong963 commented Apr 30, 2026

Uh oh!

asolimando commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants