Optimize Dictionary groupings#21765
Conversation
|
If your interested in some of the discussions made leading up to this PR please view #21589 |
|
@alamb this PR should resolve the initial regression that was caused (described here) as well as provide performance boost. I would love to hear your thoughts this approach
seeing as how this PR is already large, I think this would be a nice follow up PR since as far as i know this work was never finished since #9017 was closed due to inactivity |
|
Do we have any performance benchmarks for this PR? |
|
(basically it is hard to justify an optimization without benchmark results, even if they need to be run manually at first) |
|
There were also micro benchmarks linked in the previous PR |
2f63833 to
837b382
Compare
837b382 to
6675b68
Compare
|
Removing any extra allocations like .to_vec() and repeated hash look ups into map |
|
speed up : 24 approachnormalize_dict_hash() is the path for arrays with values that arent expected to fit in cpu caches, that gives this implementation a guide but a loose one.
with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. the update to normalize_dict_hash() is minor compared to the PR as a whole. @alamb |
|
Pre-allocating the values buffer causes a regression |
|
@Rich-T-kid are you sure tpch-1 looks at the data at all in your benchmarks? It looks suspicially fast, I think it might only plan/setup the queries but not look at the data. |
|
reran the benchmarks after generating data. |
|
@adriangb these benchmarks reflect what my design doc mentions the more rows/greater scale the better it should perform. these were all being run on a mac-book maybe running it with the |
e30d6d1 to
5d6a9b0
Compare
|
Are these queries even using dictionary data types? |
|
run benchmarks |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
1 similar comment
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
Yea it seems that no benchmarks currently perform a group by using dictionary encodeded columns. I created this PR to address that. @Dandandan could you please take a look? #21860 |
185437e to
fc66c4a
Compare
## Which issue does this PR close? This PR provides the benchmarks mentioned in apache#7647 & apache#9017 <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Works towards closing apache#7647. ## Rationale for this change Currently the benchmark suite doesn't have any dictionary-encoded tables with aggregations performed on them. This makes it difficult to prove performance improvements, for example, a separate PR I'm working on (apache#21765) is hard to validate because the existing benchmarks don't exercise this path. This PR attempts to close that gap. <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Adds a new dict benchmark to dfbench that measures group-by performance on dictionary-encoded columns across varying cardinality (5/10/25%), null rates (0/15%), and value types (Utf8 and List<Utf8>), covering both single and multi-column group-by scenarios. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? -- <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>
## Which issue does this PR close? benchmarks for apache#21765. Also related to apache#21860 The goal is to merge this PR and then rebase the branch on apache#21765 to contain these benchmarks, so that they can be run and compared to the original. <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> ## Rationale for this change Originally this was included in apache#21765 but that PR is already very large. I decided to move it to its own separate PR <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? Adds benchmarks for the dictionary encoding array path of **new_group_values()**. <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? n/a <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? no <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>





Which issue does this PR close?
This PR make an effort towards #7000 & closing #7647 + #21466
This PR aim to close half of #7647
A separate follow up PR aims to close the multi-column + dictionary column case
Rationale for this change
Issue #7647 (Materialize Dictionaries in Group Keys) identified that DataFusion was not taking advantage of dictionary encoding during hash aggregation, instead of operating on the compact dictionary representation, it was deserializing dictionary arrays into a generic row-based format, throwing away all the encoding benefits. An initial attempt was made to fix this but it caused regressions and was ultimately rolled back.
This PR
This PR takes a different approach. Rather than materializing the dictionary into a generic representation, it introduces GroupValuesDictionary, a specialized implementation that operates directly on the dictionary's structure. The key insight is that dictionary encoded columns have inherently low cardinality by design, meaning the same values repeat frequently across rows. Instead of hashing and comparing every row independently, we maintain a mapping of unique value hashes to group IDs so that repeated values are resolved in O(1) after the first encounter. Null values are similarly cached so that subsequent null rows never pay the lookup cost again. For a more detailed explanation of the implementation see this design doc.
What changes are included in this PR?
Update the match statement in new_group_values to include a custom dictionary encoding branch that works for single fields that are of type Dictionary array
Are these changes tested?
Yes, about half of the code in this PR is about testing edge cases.
Are there any user-facing changes?
No