Optimize Dictionary groupings by Rich-T-kid · Pull Request #21765 · apache/datafusion

Rich-T-kid · 2026-04-21T14:28:54Z

Which issue does this PR close?

This PR make an effort towards #7000 & closing #7647 + #21466

This PR aim to close half of #7647
A separate follow up PR aims to close the multi-column + dictionary column case

Closes Materialize Dictionaries in Group Keys #7647.

Rationale for this change

Issue #7647 (Materialize Dictionaries in Group Keys) identified that DataFusion was not taking advantage of dictionary encoding during hash aggregation, instead of operating on the compact dictionary representation, it was deserializing dictionary arrays into a generic row-based format, throwing away all the encoding benefits. An initial attempt was made to fix this but it caused regressions and was ultimately rolled back.

This PR

This PR takes a different approach. Rather than materializing the dictionary into a generic representation, it introduces GroupValuesDictionary, a specialized implementation that operates directly on the dictionary's structure. The key insight is that dictionary encoded columns have inherently low cardinality by design, meaning the same values repeat frequently across rows. Instead of hashing and comparing every row independently, we maintain a mapping of unique value hashes to group IDs so that repeated values are resolved in O(1) after the first encounter. Null values are similarly cached so that subsequent null rows never pay the lookup cost again. For a more detailed explanation of the implementation see this design doc.

What changes are included in this PR?

Update the match statement in new_group_values to include a custom dictionary encoding branch that works for single fields that are of type Dictionary array

Are these changes tested?

Yes, about half of the code in this PR is about testing edge cases.

Are there any user-facing changes?

No

Rich-T-kid · 2026-04-21T14:30:39Z

If your interested in some of the discussions made leading up to this PR please view #21589

Rich-T-kid · 2026-04-21T17:38:52Z

@alamb this PR should resolve the initial regression that was caused (described here) as well as provide performance boost. I would love to hear your thoughts this approach

I think we need to create a benchmark that does aggregation queries on dictionary encoded string columns (I know the existing end to end TPCH, and ClickBench benchmarks do not do this). I will file a ticket shortly about this
source

seeing as how this PR is already large, I think this would be a nice follow up PR since as far as i know this work was never finished since #9017 was closed due to inactivity

alamb · 2026-04-21T18:24:39Z

Do we have any performance benchmarks for this PR?

alamb · 2026-04-21T18:25:49Z

(basically it is hard to justify an optimization without benchmark results, even if they need to be run manually at first)

Rich-T-kid · 2026-04-21T20:40:04Z

There were also micro benchmarks linked in the previous PR
Faster : 25
slower : 17
no-change : 24
I didn't expect to see the tpc-h sf1 have this many regressions. I think I have an idea as to what is causing this. currently working on a fix

…2 loop

Rich-T-kid · 2026-04-22T03:43:29Z

Removing any extra allocations like .to_vec() and repeated hash look ups into map

Rich-T-kid · 2026-04-22T03:57:28Z

speed up : 24
Slower : 6
no change : 36
last optimize that comes to mind is updating normalize_dict_hash() by eliminating per-insert allocations.
Currently, normalize_dict_hash() creates a HashMap<Vec, usize> where each unique key's raw bytes are heap allocated via .to_vec() before insertion. This allocation occurs once per unique value but is unnecessary since the underlying Arrow buffer already owns the bytes.
The plan is to pre-compute a Vec<Option<Cow<[u8]>>> of raw byte slices for all accessed values upfront, allowing the hashmap to store &[u8] references instead of owned Vec keys, eliminating the per-insert allocations entirely.

approach

normalize_dict_hash() is the path for arrays with values that arent expected to fit in cpu caches, that gives this implementation a guide but a loose one.
main three cases

keys outnumber the values & there is some benefit to this pre-allocation step as otherwise we'd be creating exactly d vectors where d is the number of unique elements in the values array that the keys array indexes into. we save by never allocating
the values array has very high cardinality. Here we save alot of compute and space by not creating a new allocation for each value. we only need the &[u8] for comparisons to determine a cache hit we save out by not creating extra vectors that are dropped at the end of the function
Values array has many unreferenced entries (bloated/non-canonicalized dictionary)
would only call get_raw_bytes() for values actually referenced by keys, so a values array with 10,000 entries where keys only reference 50 of them means you only build 50 slices instead of 10,000. The pre-allocation approach wins significantly here since the alternative would have allocated and dropped 50 Vec unnecessarily.
In each case I think its a win, in a perfect world I would benchmark this to see if eliminating the .to_vec() allocations outweighs the cost of the upfront pass over the accessed value indices to build the slice cache. I'll implement both approaches, benchmark them, and push the more optimized version.

with that being said I think these results show a generally a large improvement over the current approach to dealing with dictionary encoded columns in data-fusion. the update to normalize_dict_hash() is minor compared to the PR as a whole. @alamb

Rich-T-kid · 2026-04-22T04:42:58Z

Pre-allocating the values buffer causes a regression

Dandandan · 2026-04-22T08:02:14Z

@Rich-T-kid are you sure tpch-1 looks at the data at all in your benchmarks? It looks suspicially fast, I think it might only plan/setup the queries but not look at the data.
Perhaps the path is wrong or using it from the wrong directory?

Rich-T-kid · 2026-04-22T21:02:06Z

reran the benchmarks after generating data.

Rich-T-kid · 2026-04-22T21:02:37Z

tpch_sft1

Rich-T-kid · 2026-04-22T21:03:20Z

tpch_sf10

Rich-T-kid · 2026-04-22T21:03:36Z

tpch_mem_sf1

Rich-T-kid · 2026-04-22T21:04:06Z

tpch_mem_sf10

Rich-T-kid · 2026-04-22T21:09:00Z

@adriangb these benchmarks reflect what my design doc mentions the more rows/greater scale the better it should perform. these were all being run on a mac-book maybe running it with the run benchmarks command may yield slightly different results

Dandandan · 2026-04-23T04:10:59Z

Are these queries even using dictionary data types?
I think they are string views and primitives

Dandandan · 2026-04-23T04:11:11Z

run benchmarks

adriangbot · 2026-04-23T04:12:02Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

Dandandan · 2026-04-23T04:12:23Z

tpch_sf10

Still does not look correct

adriangbot · 2026-04-23T04:12:45Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

adriangbot · 2026-04-23T04:13:44Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

 * [new tag]             48.0.0-rc1     -> 48.0.0-rc1
 * [new tag]             48.0.0-rc2     -> 48.0.0-rc2
 * [new tag]             5.0.0          -> 5.0.0
 * [new tag]             5.0.0-rc1      -> 5.0.0-rc1
 * [new tag]             5.0.0-rc3      -> 5.0.0-rc3
 * [new tag]             6.0.0          -> 6.0.0
 * [new tag]             6.0.0-rc0      -> 6.0.0-rc0
 * [new tag]             7.0.0          -> 7.0.0
 * [new tag]             7.0.0-rc2      -> 7.0.0-rc2
 * [new tag]             8.0.0          -> 8.0.0
 * [new tag]             8.0.0-rc1      -> 8.0.0-rc1
 * [new tag]             8.0.0-rc2      -> 8.0.0-rc2
 * [new tag]             9.0.0          -> 9.0.0
 * [new tag]             9.0.0-rc1      -> 9.0.0-rc1
 * [new tag]             ballista-0.5.0 -> ballista-0.5.0
 * [new tag]             ballista-0.6.0 -> ballista-0.6.0
 * [new tag]             ballista-0.7.0 -> ballista-0.7.0
 * [new tag]             python-0.3.0   -> python-0.3.0
 * [new tag]             python-0.4.0   -> python-0.4.0
Switched to branch 'rich-t-kid/optimize-dictionary-grouping'

File an issue against this benchmark runner

Rich-T-kid · 2026-04-26T18:54:52Z

Yea it seems that no benchmarks currently perform a group by using dictionary encodeded columns. I created this PR to address that. @Dandandan could you please take a look? #21860

## Which issue does this PR close? This PR provides the benchmarks mentioned in apache#7647 & apache#9017  - Works towards closing apache#7647. ## Rationale for this change Currently the benchmark suite doesn't have any dictionary-encoded tables with aggregations performed on them. This makes it difficult to prove performance improvements, for example, a separate PR I'm working on (apache#21765) is hard to validate because the existing benchmarks don't exercise this path. This PR attempts to close that gap.  ## What changes are included in this PR? Adds a new dict benchmark to dfbench that measures group-by performance on dictionary-encoded columns across varying cardinality (5/10/25%), null rates (0/15%), and value types (Utf8 and List<Utf8>), covering both single and multi-column group-by scenarios.  ## Are these changes tested? --  ## Are there any user-facing changes? no   --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>

## Which issue does this PR close? benchmarks for apache#21765. Also related to apache#21860 The goal is to merge this PR and then rebase the branch on apache#21765 to contain these benchmarks, so that they can be run and compared to the original.  ## Rationale for this change Originally this was included in apache#21765 but that PR is already very large. I decided to move it to its own separate PR  ## What changes are included in this PR? Adds benchmarks for the dictionary encoding array path of **new_group_values()**.  ## Are these changes tested? n/a  ## Are there any user-facing changes? no   --------- Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>

revised-version-1

f7e84cb

Rich-T-kid mentioned this pull request Apr 21, 2026

Rich t kid/dictionary encoding hash optmize #21589

Closed

github-actions Bot added the physical-plan Changes to the physical-plan crate label Apr 21, 2026

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from 2f63833 to 837b382 Compare April 22, 2026 02:29

updated normalization fn to use a hashmap cache instead of doing a n^…

6675b68

…2 loop

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from 837b382 to 6675b68 Compare April 22, 2026 03:19

remove extra vector allocations for hash insertions

5d6a9b0

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch 2 times, most recently from e30d6d1 to 5d6a9b0 Compare April 23, 2026 02:54

Rich-T-kid mentioned this pull request Apr 26, 2026

Rich t kid/introduce dict benchmarks #21860

Merged

This was referenced Apr 27, 2026

Optimize Hash Aggregation for Multiple Dictionary-Encoded Group Keys #21878

Open

Add a is_normalized flag to DictionaryArray apache/arrow-rs#9841

Open

removed un-needed ut8 check

fc66c4a

Rich-T-kid force-pushed the rich-t-kid/optimize-dictionary-grouping branch from 185437e to fc66c4a Compare April 28, 2026 16:10

removed normalization

5db571f

Rich-T-kid mentioned this pull request May 3, 2026

Add benchmarks for dictionary path of new_group_values #22004

Merged

Merge branch 'main' into rich-t-kid/optimize-dictionary-grouping

1e858b9

Conversation

Rich-T-kid commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

This PR

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

alamb commented Apr 21, 2026

Uh oh!

alamb commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 21, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

approach

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Dandandan commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_sft1

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_sf10

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_mem_sf1

Uh oh!

Rich-T-kid commented Apr 22, 2026

tpch_mem_sf10

Uh oh!

Rich-T-kid commented Apr 22, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Dandandan commented Apr 23, 2026

tpch_sf10

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Rich-T-kid commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rich-T-kid commented Apr 21, 2026 •

edited

Loading

Rich-T-kid commented Apr 22, 2026 •

edited

Loading