Skip to content

[Spark] Add DSv2 ReplaceData write path#7051

Open
murali-db wants to merge 1 commit into
delta-io:masterfrom
murali-db:dsv2-dml-nr-pr2-replace-data
Open

[Spark] Add DSv2 ReplaceData write path#7051
murali-db wants to merge 1 commit into
delta-io:masterfrom
murali-db:dsv2-dml-nr-pr2-replace-data

Conversation

@murali-db

Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Add the copy-on-write ReplaceData write path to the Kernel-based DSv2 Spark connector, so row-level operations that rewrite touched files (e.g. UPDATE / DELETE planned as ReplaceData) can commit through the connector.

  • Introduce the ReplaceData write classes:
    • DeltaReplaceDataOperation captures the configured SparkScan and exposes the scan-selected AddFiles that need to be replaced.
    • DeltaReplaceDataWriteBuilder / DeltaReplaceDataWriterFactory / DeltaReplaceDataWriter build and run the executor-side write of the replacement data files.
    • DeltaReplaceDataBatchWrite drives the Kernel transaction on the driver: it snapshots the scan-selected files up front (so commit() and createBatchWriterFactory() agree on the exact same file set), writes the new data files, and commits remove-file actions for the replaced files alongside the new add-file actions.
    • DeltaReplaceDataCommitMessage carries the per-task writer results back to the driver for commit.
  • Route ReplaceData through DeltaV2WriteBuilder, capturing the SparkScan selection before newWriteBuilder() is invoked.
  • Derive remove-file partition values from the scan snapshot via PartitionUtils, reusing the existing partition / _metadata plumbing in SparkScan and DeltaV2Table.
  • Make the engine-info helper on DeltaV2BatchWrite package-private so the new write path can reuse it.

How was this patch tested?

  • Added DeltaReplaceDataBatchWriteTest covering the driver-side transaction wiring: scan-selection snapshotting, remove-file derivation from the captured scan, and commit-message aggregation.
  • Added DeltaReplaceDataWriterTest covering the executor-side writer factory and writer, including the data-file commit messages they produce.

Does this PR introduce any user-facing change?

No. The DSv2 Kernel connector lives under io.delta.spark.internal.v2; this change adds an internal write path and does not change any public API or existing behavior.

@murali-db murali-db force-pushed the dsv2-dml-nr-pr2-replace-data branch 3 times, most recently from 96073f9 to 3b770fb Compare June 22, 2026 18:05
@murali-db murali-db force-pushed the dsv2-dml-nr-pr2-replace-data branch from 3b770fb to 7575417 Compare June 22, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant