Skip to content

fix: evaluate ManifestGroup file filters#664

Open
fallintoplace wants to merge 3 commits into
apache:mainfrom
fallintoplace:fix/manifest-group-filter-files-datafile-evaluator
Open

fix: evaluate ManifestGroup file filters#664
fallintoplace wants to merge 3 commits into
apache:mainfrom
fallintoplace:fix/manifest-group-filter-files-datafile-evaluator

Conversation

@fallintoplace
Copy link
Copy Markdown
Contributor

@fallintoplace fallintoplace commented May 20, 2026

Summary

Fixes #663.

This implements ManifestGroup::FilterFiles() instead of accepting the filter as a silent no-op. The change adds a DataFile StructLike wrapper, binds a file-level evaluator against file metadata, and applies it to each manifest entry before returning it.

The evaluator schema is aligned with Java ManifestGroup: concrete partition fields are not exposed under partition, so callers should use FilterData(...) for logical data predicates. Supported file metadata includes spec_id, and unsupported concrete partition filters fail during binding rather than being ignored.

The reader projection now includes file-filter referenced metadata columns when callers use Select(...). Empty container-backed optional metadata is exposed as null so projected or missing metadata does not look present to the evaluator.

Tests

  • uvx pre-commit run --files src/iceberg/manifest/manifest_group.cc src/iceberg/manifest/manifest_entry.h src/iceberg/row/manifest_wrapper.cc src/iceberg/row/manifest_wrapper.h src/iceberg/test/manifest_group_test.cc
  • cmake --build build --target manifest_test -j 8
  • ./build/src/iceberg/test/manifest_test --gtest_filter='ManifestGroupVersions/ManifestGroupTest.FilterFiles*'
  • ./build/src/iceberg/test/manifest_test

ICEBERG_UNWRAP_OR_FAIL(
auto group,
ManifestGroup::Make(file_io_, schema_, GetSpecsById(), std::move(manifests)));
group->FilterFiles(Expressions::Equal("partition.data_bucket_16_2", Literal::Int(1)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that partition field name like data_bucket_16_2 is not spec-compliant but implementation-specific.

Per the spec from https://iceberg.apache.org/spec/#partitioning, users are discouraged to directly write partition filters: the partitioning can change and the correct partition filters are always derived from column predicates.

Java ManifestGroup builds the file evaluator with DataFile.getType(EMPTY_STRUCT), so it does not allow binding concrete partition fields under partition. For logical data predicates, callers should use FilterData(...), letting projection derive the appropriate partition/manifest filters for each spec. I'd recommend to be aligned with the Java approach at this moment.

}

const auto& data_file = data_file_.get();
switch (static_cast<DataFileFieldPosition>(pos)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose partition_spec_id field as an optional field? FYI Java puts it at pos 3 (zero-based).

Comment thread src/iceberg/row/manifest_wrapper.cc Outdated
case DataFileFieldPosition::kUpperBounds:
return std::make_shared<IntMapLike<std::vector<uint8_t>>>(data_file.upper_bounds);
case DataFileFieldPosition::kKeyMetadata:
return ToView(data_file.key_metadata);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return std::monostate if any std container is empty? The main point is that we do not use std::optional to represent a missing value for these fields. Besides, fields of DataFileStructLike might be missing when ManifestGroup has specified projection.

Comment thread src/iceberg/row/manifest_wrapper.cc Outdated
size_t size() const override { return values_.get().size(); }

private:
std::reference_wrapper<const std::vector<T>> values_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to use std::span<const T> here?

Comment thread src/iceberg/manifest/manifest_group.cc Outdated
if (file_filter_ && file_filter_->op() != Expression::Operation::kTrue &&
!columns.empty() &&
std::ranges::find(columns, Schema::kAllColumns) == columns.end()) {
auto spec_iter = specs_by_id_.find(manifest.partition_spec_id);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my other comment, partition_spec_id varies among manifest files so let's remove its support for now.

@fallintoplace
Copy link
Copy Markdown
Contributor Author

Updated after the review feedback:

  • bind file filters against DataFile metadata with an empty partition struct, matching Java ManifestGroup
  • make concrete partition.* file filters fail during binding instead of exposing implementation-specific partition field names
  • expose spec_id through DataFileStructLike for file-level filters
  • return null for empty container-backed optional metadata
  • use std::span in the vector-backed array wrapper
  • refresh regression tests for record count, selected metadata columns, spec_id, and rejected partition metadata filters

Local validation passed with pre-commit, the focused FilterFiles* manifest tests, and the full manifest_test suite.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving this! I have some questions with regards to the test.


auto result = group->Entries();
EXPECT_THAT(result, IsError(ErrorKind::kInvalidExpression));
EXPECT_THAT(result, HasErrorMessage("Cannot find field 'partition.data_bucket_16_2'"));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't skip unknown predicates? Usually we just ignore unknown or malformed predicates in practice (e.g. assume they are always evaluated to match), otherwise any malformed predicate can prohibit any table scan to proceed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the Java behavior before changing this. Java ManifestGroup builds the file evaluator with DataFile.getType(EMPTY_STRUCT), and Evaluator binds the expression eagerly, so unknown fields fail binding rather than being treated as match-all.

I kept the C++ behavior aligned with that. In this path, silently ignoring malformed or unknown FilterFiles(...) predicates would broaden the scan and hide caller bugs, which felt riskier than failing fast.

For logical predicates on table columns, callers should use FilterData(...), which can then be projected into the appropriate partition/manifest filters per spec.

ICEBERG_UNWRAP_OR_FAIL(
auto group,
ManifestGroup::Make(file_io_, schema_, GetSpecsById(), std::move(manifests)));
group->FilterFiles(Expressions::GreaterThanOrEqual("record_count", Literal::Long(10)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does case sensitivity matter here? For example what if users set case-inventive and provide RECORD_COUNT as the column name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ManifestGroup::FilterFiles silently ignores file filters

2 participants