-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add a memory bound FileStatisticsCache for the Listing Table #20047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mkleen
wants to merge
90
commits into
apache:main
Choose a base branch
from
mkleen:file-stats-cache
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
90 commits
Select commit
Hold shift + click to select a range
216d62a
Add a default FileStatisticsCache implementation for the ListingTable
mkleen 81f868a
fixup! Add a default FileStatisticsCache implementation for the Listi…
mkleen a01c299
Adapt memory usage when removing entries
mkleen 9907189
Adapt heapsize for &str
mkleen 28cb372
Fix formatting
mkleen 251e48a
Adapt heapsize for &str and add another scalarvalue
mkleen 255896c
Add better error message
mkleen e073409
Add todo to add heapsize for ordering in CachedFileMetadata
mkleen a91effe
Fix comment/docs on DefaultFileStatisticsCache
mkleen c65b8fe
Simplify test data generation
mkleen d025b27
Remove potential stale entry, if entry is too large
mkleen ad8bc83
Fix typo in sql logic test comment
mkleen 5773e88
Fix comment about default behaviour in cache manager
mkleen 20f5998
Fix variable name in test
mkleen a8b83d8
Fix variable name in test
mkleen c27c2b8
Disable cache for sql logic test
mkleen 99a312b
Include key into memory estimation
mkleen 7694254
Fix fmt
mkleen 774dc5f
Fix clippy
mkleen cb6b4d8
minor
mkleen e371582
Add more key memory accounting
mkleen 50714d0
Fix Formatting
mkleen 6b21bd1
Account path as string and remove dependency to object_store
mkleen 996de78
Improve error handling
mkleen 3af0ce1
Fix fmt
mkleen 3445710
Remove path.clone
mkleen c97adfa
Simplify accounting for statistics
mkleen 7f6340e
Adapt offset buffer
mkleen 0a5d3c8
Fix heap size for Arc
mkleen 673276e
Adapt estimate in test
mkleen 03591ae
Fix sql logic test
mkleen 497371c
Register cache from cachemanager at listing table
mkleen af71c2b
Revert slt
mkleen 71720ca
Add tablescoping for file stats cache
mkleen 7e64ad7
Adapt slt
mkleen 4902bd2
Fix linter
mkleen ac90305
Remove uneeded clone
mkleen 494143f
Rename cache_unit to file_statistics_cache
mkleen 77705ad
Simplify heap size accounting
mkleen 9a95a4b
Adapt comments in test
mkleen 1851d59
Seperate drop table clean-ups
mkleen e1d49a1
fixup! Seperate drop table clean-ups
mkleen f069267
Increase default limit to 10 mb
mkleen 3fca43e
Increase default limit to 20 mb
mkleen 06a00c2
Fix comment
mkleen be61ff6
Fix deregister logic
mkleen 74a8696
Fix slt
mkleen 36c1e22
Add table reference to FileStatisticsCacheEntry
mkleen b72a2fc
fixup! Add table reference to FileStatisticsCacheEntry
mkleen dc74700
Fix comment
mkleen 59c7cfe
Fix runtime_env entry
mkleen cd42155
Add cache for all benchmark runs
mkleen d6a5815
Add cache to listing table creation
mkleen a0b4186
fixup! Add cache to listing table creation
mkleen 47df32a
Adapt limit to 20M in configs.md
mkleen fc05625
fixup! Adapt limit to 20M in configs.md
mkleen 3e589cd
Fix linter
mkleen 4cbb857
Add cache to listing table in _read_type()
mkleen 494ab5e
Add ListView and LargeListView to heapsize
mkleen 1987c73
fixup! Add ListView and LargeListView to heapsize
mkleen e0af041
Remove array.slt
mkleen c1fbe3a
Add table ref to ListingTableUrl
mkleen 041b457
Add heapsize for table-scoped-path
mkleen 758ac8a
Make list_entries table-scoped
mkleen a8bfdb2
fixup! Make list_entries table-scoped
mkleen 3a55cfb
fixup! fixup! Make list_entries table-scoped
mkleen c40fa27
Improve heap size estimation for Arc
mkleen a871d2a
fixup! Improve heap size estimation for Arc
mkleen 8f95d97
Update migration guide
mkleen 49caa3b
fixup! Update migration guide
mkleen f82ba41
Improve heapsize estimation for TableReference
mkleen 81c91f8
Improve memory handling when inserting
mkleen 9b09a91
Fix comments in Cache Manager
mkleen aeda623
Improve upgrade guide
mkleen e767d62
Fix upgrade guide
mkleen 3df09dd
Return stale entries from cache
mkleen e408763
Fix upgrade guide
mkleen 9c0632d
Fix Arc<str> heapsize test
mkleen 48ddc28
Remove const i32 cast from heapsize estimation
mkleen 8239e72
Fix heapsize estimation for Arc<T>
mkleen 7f65cdc
Fix comment in cache_manager
mkleen d0db342
Fix linter + clippy
mkleen f76ae44
Adapt test acording to heapsize estimation changes
mkleen 96845ec
Always add tableref to partioned files
mkleen 87ad810
fixup! Always add tableref to partioned files
mkleen 1f4539f
Add table to statistics_cache output
mkleen 858e168
Adopt test to new output
mkleen e55503d
Adopt configuration with '0' value
mkleen abff629
Update configs.md
mkleen 7c9c830
Merge branch 'main' into file-stats-cache
mkleen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,9 +21,11 @@ use std::mem; | |
| use std::sync::Arc; | ||
|
|
||
| use datafusion_catalog::Session; | ||
| use datafusion_common::{HashMap, Result, ScalarValue, assert_or_internal_err}; | ||
| use datafusion_datasource::ListingTableUrl; | ||
| use datafusion_common::{ | ||
| HashMap, Result, ScalarValue, TableReference, assert_or_internal_err, | ||
| }; | ||
| use datafusion_datasource::PartitionedFile; | ||
| use datafusion_datasource::{FileExtensions, ListingTableUrl}; | ||
| use datafusion_expr::{BinaryExpr, Operator, lit, utils}; | ||
|
|
||
| use arrow::{ | ||
|
|
@@ -382,6 +384,7 @@ fn try_into_partitioned_file( | |
|
|
||
| let mut pf: PartitionedFile = object_meta.into(); | ||
| pf.partition_values = partition_values; | ||
| pf.table_reference.clone_from(table_path.get_table_ref()); | ||
|
|
||
| Ok(Some(pf)) | ||
| } | ||
|
|
@@ -416,8 +419,15 @@ pub async fn pruned_partition_list<'a>( | |
| table_path | ||
| ); | ||
|
|
||
| // if no partition col => simply list all the files | ||
| Ok(objects.map_ok(|object_meta| object_meta.into()).boxed()) | ||
| // if no partition col => list all the files | ||
| Ok(objects | ||
| .try_filter_map(|object_meta| { | ||
| futures::future::ready(object_meta_to_partitioned_file( | ||
| object_meta, | ||
| table_path.get_table_ref(), | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Table-reference needs to be always passed on the |
||
| )) | ||
| }) | ||
| .boxed()) | ||
| } else { | ||
| let df_schema = DFSchema::from_unqualified_fields( | ||
| partition_cols | ||
|
|
@@ -442,6 +452,22 @@ pub async fn pruned_partition_list<'a>( | |
| } | ||
| } | ||
|
|
||
| fn object_meta_to_partitioned_file( | ||
| object_meta: ObjectMeta, | ||
| table_ref: &Option<TableReference>, | ||
| ) -> Result<Option<PartitionedFile>> { | ||
| Ok(Some(PartitionedFile { | ||
| object_meta, | ||
| partition_values: vec![], | ||
| range: None, | ||
| statistics: None, | ||
| ordering: None, | ||
| extensions: FileExtensions::new(), | ||
| metadata_size_hint: None, | ||
| table_reference: table_ref.clone(), | ||
| })) | ||
| } | ||
|
|
||
| /// Extract the partition values for the given `file_path` (in the given `table_path`) | ||
| /// associated to the partitions defined by `table_partition_cols` | ||
| pub fn parse_partitions_for_path<'a, I>( | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the file statistics cache is now tablescoped, this should be reflected in the output.