fix: evaluate ManifestGroup file filters#664
Conversation
| ICEBERG_UNWRAP_OR_FAIL( | ||
| auto group, | ||
| ManifestGroup::Make(file_io_, schema_, GetSpecsById(), std::move(manifests))); | ||
| group->FilterFiles(Expressions::Equal("partition.data_bucket_16_2", Literal::Int(1))); |
There was a problem hiding this comment.
I'm afraid that partition field name like data_bucket_16_2 is not spec-compliant but implementation-specific.
Per the spec from https://iceberg.apache.org/spec/#partitioning, users are discouraged to directly write partition filters: the partitioning can change and the correct partition filters are always derived from column predicates.
Java ManifestGroup builds the file evaluator with DataFile.getType(EMPTY_STRUCT), so it does not allow binding concrete partition fields under partition. For logical data predicates, callers should use FilterData(...), letting projection derive the appropriate partition/manifest filters for each spec. I'd recommend to be aligned with the Java approach at this moment.
| } | ||
|
|
||
| const auto& data_file = data_file_.get(); | ||
| switch (static_cast<DataFileFieldPosition>(pos)) { |
There was a problem hiding this comment.
Should we expose partition_spec_id field as an optional field? FYI Java puts it at pos 3 (zero-based).
| case DataFileFieldPosition::kUpperBounds: | ||
| return std::make_shared<IntMapLike<std::vector<uint8_t>>>(data_file.upper_bounds); | ||
| case DataFileFieldPosition::kKeyMetadata: | ||
| return ToView(data_file.key_metadata); |
There was a problem hiding this comment.
Do we want to return std::monostate if any std container is empty? The main point is that we do not use std::optional to represent a missing value for these fields. Besides, fields of DataFileStructLike might be missing when ManifestGroup has specified projection.
| size_t size() const override { return values_.get().size(); } | ||
|
|
||
| private: | ||
| std::reference_wrapper<const std::vector<T>> values_; |
There was a problem hiding this comment.
Is it better to use std::span<const T> here?
| if (file_filter_ && file_filter_->op() != Expression::Operation::kTrue && | ||
| !columns.empty() && | ||
| std::ranges::find(columns, Schema::kAllColumns) == columns.end()) { | ||
| auto spec_iter = specs_by_id_.find(manifest.partition_spec_id); |
There was a problem hiding this comment.
As my other comment, partition_spec_id varies among manifest files so let's remove its support for now.
|
Updated after the review feedback:
Local validation passed with pre-commit, the focused |
wgtmac
left a comment
There was a problem hiding this comment.
Thanks for improving this! I have some questions with regards to the test.
|
|
||
| auto result = group->Entries(); | ||
| EXPECT_THAT(result, IsError(ErrorKind::kInvalidExpression)); | ||
| EXPECT_THAT(result, HasErrorMessage("Cannot find field 'partition.data_bucket_16_2'")); |
There was a problem hiding this comment.
Why we don't skip unknown predicates? Usually we just ignore unknown or malformed predicates in practice (e.g. assume they are always evaluated to match), otherwise any malformed predicate can prohibit any table scan to proceed.
There was a problem hiding this comment.
I checked the Java behavior before changing this. Java ManifestGroup builds the file evaluator with DataFile.getType(EMPTY_STRUCT), and Evaluator binds the expression eagerly, so unknown fields fail binding rather than being treated as match-all.
I kept the C++ behavior aligned with that. In this path, silently ignoring malformed or unknown FilterFiles(...) predicates would broaden the scan and hide caller bugs, which felt riskier than failing fast.
For logical predicates on table columns, callers should use FilterData(...), which can then be projected into the appropriate partition/manifest filters per spec.
| ICEBERG_UNWRAP_OR_FAIL( | ||
| auto group, | ||
| ManifestGroup::Make(file_io_, schema_, GetSpecsById(), std::move(manifests))); | ||
| group->FilterFiles(Expressions::GreaterThanOrEqual("record_count", Literal::Long(10))); |
There was a problem hiding this comment.
Does case sensitivity matter here? For example what if users set case-inventive and provide RECORD_COUNT as the column name?
Summary
Fixes #663.
This implements
ManifestGroup::FilterFiles()instead of accepting the filter as a silent no-op. The change adds aDataFileStructLikewrapper, binds a file-level evaluator against file metadata, and applies it to each manifest entry before returning it.The evaluator schema is aligned with Java
ManifestGroup: concrete partition fields are not exposed underpartition, so callers should useFilterData(...)for logical data predicates. Supported file metadata includesspec_id, and unsupported concrete partition filters fail during binding rather than being ignored.The reader projection now includes file-filter referenced metadata columns when callers use
Select(...). Empty container-backed optional metadata is exposed as null so projected or missing metadata does not look present to the evaluator.Tests
uvx pre-commit run --files src/iceberg/manifest/manifest_group.cc src/iceberg/manifest/manifest_entry.h src/iceberg/row/manifest_wrapper.cc src/iceberg/row/manifest_wrapper.h src/iceberg/test/manifest_group_test.cccmake --build build --target manifest_test -j 8./build/src/iceberg/test/manifest_test --gtest_filter='ManifestGroupVersions/ManifestGroupTest.FilterFiles*'./build/src/iceberg/test/manifest_test