Skip to content

[Spark] Allow USING INVENTORY table identifier to resolve non-Delta sources#7037

Open
awbarbeau wants to merge 2 commits into
delta-io:masterfrom
awbarbeau:feature/vacuum-inventory-non-delta-source
Open

[Spark] Allow USING INVENTORY table identifier to resolve non-Delta sources#7037
awbarbeau wants to merge 2 commits into
delta-io:masterfrom
awbarbeau:feature/vacuum-inventory-non-delta-source

Conversation

@awbarbeau

@awbarbeau awbarbeau commented Jun 16, 2026

Copy link
Copy Markdown

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #7036.

VACUUM ... USING INVENTORY <table_identifier> previously called getDeltaTable(p, "VACUUM").toDf(sparkSession) on the resolved identifier plan, requiring the inventory source to be a Delta table. The downstream consumer VacuumCommand.getFilesFromInventory only needs a DataFrame matching INVENTORY_SCHEMA. The subquery path already accepts any analyzable relation, so the identifier-only Delta restriction is unnecessary and inconsistent.

This PR resolves the analyzed plan via Dataset.ofRows(...) so identifier and subquery paths behave the same way. Schema validation in getFilesFromInventory continues to reject malformed sources.

Target-table controls are unchanged; the VACUUM target still goes through Delta-specific safety and protocol checks.

How was this patch tested?

Added vacuum using inventory non-Delta table identifier in DeltaVacuumSuite, which covers a Parquet inventory source registered as a managed table. Existing inventory tests provide regression coverage for the unchanged paths.

Does this PR introduce any user-facing changes?

Yes (relaxation only). VACUUM ... USING INVENTORY <identifier> no longer requires the inventory source to be a Delta table. Sources with a wrong schema still fail with DELTA_INVALID_INVENTORY_SCHEMA. Non-breaking change.

…ources

Currently VACUUM ... USING INVENTORY <table_identifier> requires the
inventory source to be a Delta table. The inventory is ultimately consumed
as a DataFrame and validated against INVENTORY_SCHEMA in
VacuumCommand.getFilesFromInventory, so the Delta-only restriction on the
identifier path is unnecessary and inconsistent with the subquery path,
which already accepts any analyzable relation.

This change resolves the inventoryTable plan via Dataset.ofRows so the
identifier and subquery paths behave the same way. Existing inventory
schema validation still rejects malformed sources.

Scope:
- Only changes how the inventory source is resolved.
- Does not change target-table controls; the VACUUM target still goes
  through Delta-specific safety and protocol checks.

Resolves delta-io#7036

Signed-off-by: Alex Barbeau <30359706+awbarbeau@users.noreply.github.com>
- Update VacuumTableCommand class scaladoc to reflect that the inventory
  source is no longer required to be a Delta table.
- Test improvements:
  * Rename test "non-delta" -> "non-Delta" for consistent capitalization.
  * Rename local val from inventoryTable to inventoryTableName to avoid
    shadowing the case class field name.
  * Drop hidden-directory rows from the inventory data, since hidden-file
    handling is already covered by the adjacent
    "vacuum using inventory delta table and should not touch hidden files"
    test. The non-Delta test now focuses solely on identifier resolution.

Signed-off-by: Alex Barbeau <30359706+awbarbeau@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request][Spark] Allow USING INVENTORY table identifier to resolve non-Delta sources

1 participant