-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat(parquet): dictionary fallback heuristic based on compression efficiency #9700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 7 commits
d7700cf
3aacbe5
83ded36
674a1b0
13e042e
becdcef
1de7b01
701ff2b
1b6dd37
da73778
b392738
bd914bb
898e7e5
1e81173
34b819d
692018e
4511917
289e4e1
3ff12c8
1fcea02
389d038
8873bc1
8ba290b
1827d04
87a19c4
14ab09e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -1691,7 +1691,7 @@ mod tests { | |||||
| use crate::data_type::AsBytes; | ||||||
| use crate::file::metadata::{ColumnChunkMetaData, ParquetMetaData, ParquetMetaDataReader}; | ||||||
| use crate::file::properties::{ | ||||||
| BloomFilterPosition, EnabledStatistics, ReaderProperties, WriterVersion, | ||||||
| BloomFilterPosition, DictionaryFallback, EnabledStatistics, ReaderProperties, WriterVersion, | ||||||
| }; | ||||||
| use crate::file::serialized_reader::ReadOptionsBuilder; | ||||||
| use crate::file::{ | ||||||
|
|
@@ -2572,6 +2572,74 @@ mod tests { | |||||
| ); | ||||||
| } | ||||||
|
|
||||||
| #[test] | ||||||
| fn arrow_writer_dictionary_fallback_on_unfavorable_compression() { | ||||||
| let schema = Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)])); | ||||||
|
|
||||||
| let mut builder = StringBuilder::with_capacity(100, 329 * 10_000); | ||||||
|
|
||||||
| // Generate an array of 10 unique 10 character strings. | ||||||
| // This results in a dictionary encoding larger than the plain encoded data, | ||||||
| // which should trigger a fallback to PLAIN encoding. | ||||||
| for i in 0..10 { | ||||||
| let value = i | ||||||
| .to_string() | ||||||
| .repeat(10) | ||||||
| .chars() | ||||||
| .take(10) | ||||||
| .collect::<String>(); | ||||||
|
|
||||||
| builder.append_value(value); | ||||||
| } | ||||||
|
|
||||||
| let array = Arc::new(builder.finish()); | ||||||
|
|
||||||
| let batch = RecordBatch::try_new(schema, vec![array]).unwrap(); | ||||||
|
|
||||||
| let file = tempfile::tempfile().unwrap(); | ||||||
|
|
||||||
| // Set dictionary fallback to trigger fallback to PLAIN encoding on unfavorable compression | ||||||
| let props = WriterProperties::builder() | ||||||
| .set_dictionary_fallback(DictionaryFallback::OnUnfavorableCompression) | ||||||
| .set_data_page_size_limit(1) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There's an issue here due to the fact that with these settings the page is flushed before the check for dict fallback is called. Keeping the batch size at 1 but the row count at 2 should allow the check to actually force fallback, resulting in one RLE_DICTIONARY encoded data page and 5 PLAIN encoded. |
||||||
| .set_write_batch_size(1) | ||||||
| .build(); | ||||||
|
|
||||||
| let mut writer = | ||||||
| ArrowWriter::try_new(file.try_clone().unwrap(), batch.schema(), Some(props)) | ||||||
| .expect("Unable to write file"); | ||||||
| writer.write(&batch).unwrap(); | ||||||
| writer.close().unwrap(); | ||||||
|
|
||||||
| let options = ReadOptionsBuilder::new().with_page_index().build(); | ||||||
| let reader = | ||||||
| SerializedFileReader::new_with_options(file.try_clone().unwrap(), options).unwrap(); | ||||||
|
|
||||||
| let column = reader.metadata().row_group(0).columns(); | ||||||
|
|
||||||
| assert_eq!(column.len(), 1); | ||||||
|
|
||||||
| // We should write one row before falling back to PLAIN encoding so there should still be a | ||||||
| // dictionary page. | ||||||
| assert!( | ||||||
| column[0].dictionary_page_offset().is_some(), | ||||||
| "Expected a dictionary page" | ||||||
| ); | ||||||
|
|
||||||
| assert!(reader.metadata().offset_index().is_some()); | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The following is not testing the encoding, merely counting the number of data pages. Rather than this you should be examining the page encoding stats. let options = ReadOptionsBuilder::new()
.with_encoding_stats_as_mask(false)
.build();
...
// check page encoding stats, should be one dict page, one dict encoded page, and 9
// plain encoded pages
let stats = column[0].page_encoding_stats().unwrap();
println!("pes: {stats:?}");
assert!(
stats
.iter()
.any(|s| s.page_type == PageType::DICTIONARY_PAGE)
);
let num_dict_encoded: i32 = stats
.iter()
.filter(|s| {
s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::RLE_DICTIONARY
})
.map(|s| s.count)
.sum();
assert_eq!(num_dict_encoded, 1);
let num_plain_encoded: i32 = stats
.iter()
.filter(|s| {
s.page_type == PageType::DATA_PAGE && s.encoding == Encoding::PLAIN
})
.map(|s| s.count)
.sum();
assert_eq!(num_plain_encoded, 9);Coded this way, the test fails with indicating that all pages are dict encoded and fallback did not occur |
||||||
| let offset_indexes = &reader.metadata().offset_index().unwrap()[0]; | ||||||
|
|
||||||
| let page_locations = offset_indexes[0].page_locations.clone(); | ||||||
|
|
||||||
| // We should fallback to PLAIN encoding after the first row and our max page size is 1 bytes | ||||||
| // so we expect one dictionary encoded page and then a page per row thereafter. | ||||||
| assert_eq!( | ||||||
| page_locations.len(), | ||||||
| 10, | ||||||
| "Expected 10 pages but got {page_locations:#?}" | ||||||
| ); | ||||||
| } | ||||||
|
|
||||||
| #[test] | ||||||
| fn arrow_writer_float_nans() { | ||||||
| let f16_field = Field::new("a", DataType::Float16, false); | ||||||
|
|
@@ -4827,6 +4895,48 @@ mod tests { | |||||
| assert_eq!(get_dict_page_size(col1_meta), 1024 * 1024 * 4); | ||||||
| } | ||||||
|
|
||||||
| #[test] | ||||||
| fn test_dict_page_size_decided_by_compression_fallback() { | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As a test, I saved the output from this and examined the sizing. Without the heuristic, the encoded size for col0 is 8658384 bytes (the default fallback mechanism kicked in after 7 pages). With the heuristic, col1 is 8391126 bytes, a savings of 3%. I also modified the test to mod the index with 32767. In that instance, col1 was still 8391126 bytes, but col0 was only 2231581, nearly 4X smaller. I know this is not entirely representative, but it does again point out the pitfalls of too simplistic an approach. Edit: I did a test of spark with the latter file (32k cardinality). By default, it opts to fallback for all pages, so the file is even larger. If I modify the global
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have modified the test in 1b6dd37 to demonstrate a case when even an early fallback decision brings about 12% compression. But I generally agree with your assessment, so more work is needed. Another quirk is seen in this test: a dictionary page is still flushed to encode the first data page, even though there is no benefit. Parquet-java takes care to hand over the accumulated values to the plain encoder to be re-encoded. |
||||||
| let array = Arc::new(Int64Array::from_iter(0..1024 * 1024)); | ||||||
| let schema = Arc::new(Schema::new(vec![ | ||||||
| Field::new("col0", arrow_schema::DataType::Int64, false), | ||||||
| Field::new("col1", arrow_schema::DataType::Int64, false), | ||||||
| ])); | ||||||
| let batch = | ||||||
| arrow_array::RecordBatch::try_new(schema.clone(), vec![array.clone(), array]).unwrap(); | ||||||
|
|
||||||
| let props = WriterProperties::builder() | ||||||
| .set_dictionary_page_size_limit(1024 * 1024) | ||||||
| .set_column_dictionary_page_size_limit(ColumnPath::from("col1"), 1024 * 1024 * 4) | ||||||
| .set_column_dictionary_fallback( | ||||||
| ColumnPath::from("col1"), | ||||||
| DictionaryFallback::OnUnfavorableCompression, | ||||||
| ) | ||||||
| .build(); | ||||||
| let mut writer = ArrowWriter::try_new(Vec::new(), schema, Some(props)).unwrap(); | ||||||
| writer.write(&batch).unwrap(); | ||||||
| let data = Bytes::from(writer.into_inner().unwrap()); | ||||||
|
|
||||||
| let mut metadata = ParquetMetaDataReader::new(); | ||||||
| metadata.try_parse(&data).unwrap(); | ||||||
| let metadata = metadata.finish().unwrap(); | ||||||
| let col0_meta = metadata.row_group(0).column(0); | ||||||
| let col1_meta = metadata.row_group(0).column(1); | ||||||
|
|
||||||
| let get_dict_page_size = move |meta: &ColumnChunkMetaData| { | ||||||
| let mut reader = | ||||||
| SerializedPageReader::new(Arc::new(data.clone()), meta, 0, None).unwrap(); | ||||||
| let page = reader.get_next_page().unwrap().unwrap(); | ||||||
| match page { | ||||||
| Page::DictionaryPage { buf, .. } => buf.len(), | ||||||
| _ => panic!("expected DictionaryPage"), | ||||||
| } | ||||||
| }; | ||||||
|
|
||||||
| assert_eq!(get_dict_page_size(col0_meta), 1024 * 1024); | ||||||
| assert_eq!(get_dict_page_size(col1_meta), 8192); | ||||||
| } | ||||||
|
|
||||||
| struct WriteBatchesShape { | ||||||
| num_batches: usize, | ||||||
| rows_per_batch: usize, | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about performance here. It would be nice if after we've collected enough samples and decided on dict vs fallback, we stop gathering these statistics.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a mind to keep comparing after every encoded data page, for cases when the configured minimal sample is still not indicative of the overall value distribution and the efficiency degrades somewhere farther down the page chunk. But I understand the concern. Since this behavior is tunable per column through the writer API, I think it's OK to cut counting. For consistency, this should be also done in the generic encoder, I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't want to flag it in both places. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair...but then perhaps the size limit will catch it. In any event, we should stop collectin after we have actually fallen back 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already the case, with the
plain_data_size_countermember set toNonein bothflush_dict_pageimplementations, and the collecting is also not happening in the put methods in case there is no dictionary. Though if I implement a fix for #9739, this may need to be refactored.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counting shuts down after reaching the sample size threshold in 3ff12c8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not too granular, I guess, if it only happens on page flush and the fallback might be decided on an earlier write batch.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further reworked in 8873bc1.