Add StatisticsContext parameter to partition_statistics#21815
Add StatisticsContext parameter to partition_statistics#21815asolimando wants to merge 3 commits intoapache:mainfrom
Conversation
Introduce StatisticsContext that carries pre-computed child statistics and external context for statistics computation. Change the ExecutionPlan::partition_statistics signature to accept it, and add compute_statistics() utility for bottom-up computation with automatic child stats threading. Update all ~35 in-tree ExecutionPlan implementations and ~40 call sites. Passthrough operators return ctx.child_stats() directly, transform operators use it instead of re-fetching from children, and operators that always need overall child stats (RepartitionExec, CoalescePartitionsExec, SortPreservingMergeExec, SortExec non-preserving, HashJoinExec CollectLeft/Auto, CrossJoinExec, NestedLoopJoinExec) call compute_statistics with None internally.
|
Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed. This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide. Looking forward to your feedback! |
|
@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n |
Gentle reminder @xudong963 :) |
xudong963
left a comment
There was a problem hiding this comment.
@asolimando thanks! I'm sorry that I'm busy with others this week.
This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;
Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?
| /// | ||
| /// [`StatisticsContext`]: crate::statistics_context::StatisticsContext | ||
| /// [`compute_statistics`]: crate::statistics_context::compute_statistics | ||
| fn partition_statistics( |
There was a problem hiding this comment.
We should keep the old API, and add a new one https://datafusion.apache.org/contributor-guide/api-health.html#what-is-the-public-api-and-what-is-a-breaking-api-change
There was a problem hiding this comment.
Noted and I will make sure to keep both APIs in the future! I will address this in the next iteration on the code and will resolve the discussion at that point.
| let child_stats = plan | ||
| .children() | ||
| .iter() | ||
| .map(|child| compute_statistics(child.as_ref(), partition)) |
There was a problem hiding this comment.
compute_statistics always recurses with the same partition. For partition-merging operators this is wasted work because they'll discard the context and recompute with None anyway
Thank you for your input @xudong963, no need to apologies, it's understandable! You raise a fair point, we fully avoid the recomputation only for linear plans, but operators that call Re. the cache, I identified the need for the One limitation I identified on the Cache lifecycle/scope:
The scope of #20184 is, in my understanding, 1. (single walk), if you agree with that, I plan to use Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? Also, could I be added to the allowlist to trigger benchmark runs so I can iterate without requiring manual re-runs, in case I need multiple iterations? WDYT? |
|
Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context. On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track? On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call. On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:
That should cleanly demonstrate the gain. |
Thanks for the confirmation and the clarifications, I will hopefully get to it early next week and I will ping you back as soon as I will have some updates! |
Which issue does this PR close?
Closes #20184
Rationale for this change
ExecutionPlan::partition_statisticsforces each operator to re-fetch child statistics internally, causing exponential recomputation in deep plans and making it impossible to inject enriched statistics from external sources (e.g., expression-level analyzers, custom statistics providers).What changes are included in this PR?
Breaking change: the
ExecutionPlan::partition_statisticssignature changes from(&self, partition: Option<usize>)to(&self, partition: Option<usize>, ctx: &StatisticsContext). Migration guide added todocs/source/library-user-guide/upgrading/54.0.0.md.Add a
StatisticsContextparameter topartition_statisticsthat carries pre-computed child statistics, and acompute_statistics()utility that walks the plan tree bottom-up, threading child statistics through the context automatically.StatisticsContextcarries oneArc<Statistics>per child node and is designed to be extended with additional context (e.g., expression-level analyzers, custom statistics providers) without further signature changes.Operator categories
DataSourcetrait which has a separatepartition_statisticsthat was not changed.ctx.child_stats()[0]directly.ctx.child_stats()[0]as input, then apply their transformation (selectivity, column projection, grouping cardinality, fetch limit, etc.).!preserve_partitioning, RepartitionExec): always need overall child stats regardless of which output partition is requested, since they merge/redistribute input partitions. These callcompute_statistics(child, None)internally instead of using the context.ctx.child_stats()is correct for bothNoneandSome(i)cases.compute_statistics(left, None)for theSomecase. The right side is partitioned and usesctx.child_stats()[1]directly. HashJoinExec Partitioned mode is symmetric (both use context). HashJoinExec Auto mode needs overall stats from both sides.ctx.child_stats()for theNonecase (reduces withstats_union). ForSome(partition), Union remaps partition indices across children and callscompute_statisticson the specific child with the remapped index. InterleaveExec usesctx.child_stats()directly (symmetric across all inputs).Callers
All direct
plan.partition_statistics(None)calls in optimizer rules (JoinSelection, AggregateStatistics, EnforceDistribution), display code, StatisticsRegistry, and tests are replaced withcompute_statistics(plan, None).Tests
No new tests added. This is a no-op refactoring confirmed by all existing tests passing unchanged across all affected crates (datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasource).
What remains for follow-up
StatisticsContext(eliminates the separateStatisticsRegistrytree walk and theExpressionAnalyzerinjection machinery from Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122)DataSource::partition_statisticswith context if neededcompute_statistics(child, None)calls: partition-merging operators (CoalescePartitions, SortPreservingMerge, etc.) and asymmetric joins (HashJoin CollectLeft, CrossJoin, NestedLoopJoin) currently callcompute_statistics(child, None)internally when the requested partition isSome, triggering a separate bottom-up walk. A cache onStatisticsContextkeyed by (plan node, partition) would let these reuse already-computed results.Test plan
cargo fmt --allcargo clippy --all-targets --all-features -- -D warnings(affected crates)cargo test --profile cion datafusion-physical-plan, datafusion-physical-optimizer, datafusion, datafusion-datasourceDisclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.