[Parquet]: GH-563: Make path_in_schema optional#9678
[Parquet]: GH-563: Make path_in_schema optional#9678etseidl wants to merge 4 commits intoapache:mainfrom
path_in_schema optional#9678Conversation
|
less bloat for the win! |
|
A lot of other parquet implementations require this field, due to their generated thrift parser, even if they do not actually use the field for anything. I would be totally in favor of deprecating and skipping the field, but maybe a more compatible alternative would be to write the field as an empty list instead. |
Well, parquet-java actually uses the field to populate its version of |
See also related mailing list discussion: |
alamb
left a comment
There was a problem hiding this comment.
TLDR is I think this is a great idea. I also think it woudl be ok to merge this into arrow-rs even if there is not a broader consensus on the mailing list that we should do it in the format itself
My thinking is that some usecases are basically using parquet with the same read/writer and compatibility with older java based implementations is not important. This is the same thing for some of the newer encodings too
Letting users choose between "better/faster/stronger" and "more compatible" I think is very much a good idea
| /// | ||
| /// The `path_in_schema` field in the Thrift metadata is redundant and wastes a great | ||
| /// deal of space. Parquet file footers can be made much smaller by omitting this field. | ||
| /// Because the field was originally a mandatory field, this property defaults to `true` |
There was a problem hiding this comment.
If we choose to go this way I think it would help to give some more context here on what types of readers would be affected (basically all the parquet-java based readers prior to late 2026)
We could also perhaps provide a link to the discussion: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
| /// Should the writer should emit the `path_in_schema` element of the | ||
| /// `ColumnMetaData` Thrift struct. |
There was a problem hiding this comment.
I think it is worth making it more apparently that this will cause incompatibilities with some older readers:
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. | |
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. WARNING: this will make the parquet | |
| /// file unreadable by some older parquet readers. See LINK HERE for details |
|
I added a list of related PRs to this issue's description: |
|
I wonder if we should just merge this into the Rust implementation of Parquet (with a caveat that this will make metadata smaller / more effiicent at the cost of incompatibility with older / java based readers)? |
|
I'm game. I don't see a reason not to since it's opt in |
| { | ||
| let buf = TrackedWrite::new(&mut buffer); | ||
| let writer = ParquetMetaDataWriter::new_with_tracked(buf, &metadata); | ||
| let writer = ParquetMetaDataWriter::new_with_tracked(buf, &metadata) |
There was a problem hiding this comment.
Should we perhaps make a new benchmark so we can evaluate the performance both in default mode and wihtout this part of the metadata?
There was a problem hiding this comment.
Yes, that's probably worthwhile. I had made this modification for the purposes of the parquet-format submission.
ac323f0 to
1d4eb04
Compare
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing deprecate_path_in_schema (feaae15) to f725bc9 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
I think this PR is ready except for:
- Don't change the existing writer benchmark (otherwise we are not benchmarking with default settings)
- Update the docs a bit more to explain the implications of not writing path_in_schema
Which issue does this PR close?
none
Rationale for this change
This is a proof of concept implementation for apache/parquet-format#563
What changes are included in this PR?
Since version 57.0.0, this crate has been tolerant of a missing
path_in_schema. This PR adds options to cease writing the field as well. The option defaults to continuing to write the field.See related discussion on parquet mailing list: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
Are these changes tested?
Yes
Are there any user-facing changes?
No, this only adds an optional behavior change that defaults to no change
Related PRs
path_in_schemaoptional parquet-format#563ColumnMetaData.path_in_schemaoptional parquet-format#564path_in_schemaoptional parquet-java#3470