Skip to content

[Spark] Clean up the read-time CDF changelog implementation#7075

Open
SanJSp wants to merge 14 commits into
delta-io:masterfrom
SanJSp:changelog-cleanup
Open

[Spark] Clean up the read-time CDF changelog implementation#7075
SanJSp wants to merge 14 commits into
delta-io:masterfrom
SanJSp:changelog-cleanup

Conversation

@SanJSp

@SanJSp SanJSp commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Follow-up cleanup for the read-time CDF (changelog) reader added in #6794 / #6830 / #6886, addressing reviewer feedback. The changelog V2 reader stays gated behind spark.databricks.delta.changelogV2.enabled and is master-only (unreleased).

Changes, each in its own commit:

  • Terminology: rename "Auto-CDF" to "read-time CDF" across changelog comments, docs and config descriptions (comment/string text only). (comment)
  • Schema check: replace strict schema equality in DeltaChangelogBatch with SchemaUtils.isReadCompatible, so additive, read-compatible changes (new columns, relaxed nullability) are accepted and internal field metadata is ignored. The check is wrapped in a small isReadCompatible helper on the v2 SchemaUtils (forbidTightenNullability = true, matching CDCReaderBase). (comment)
  • Naming: rename the ambiguous dataSchema field to endDataSchema. (comment)
  • Row-tracking detection: a Metadata action fully replaces the prior table configuration, so an absent delta.enableRowTracking key now means the table default (disabled) rather than an inherited value; the key is read via DeltaConfigs.ROW_TRACKING_ENABLED instead of a string literal. This makes the in-range row-tracking-disabled check correct; the boundary case is already caught earlier in DeltaChangelogScanBuilder. (supersedes, config helper, in-range check)
  • Exceptions: replace generic RuntimeException wrapping in the changelog read path with a DELTA_CHANGELOG_READ_FAILED error class. Causes that already carry a Spark error class are rethrown unchanged so their user-facing class is preserved. (comment, comment)
  • Docs: add a class-level comment describing the DeltaChangelogBatch read flow.

How was this patch tested?

Does this PR introduce any user-facing changes?

No

@SanJSp SanJSp marked this pull request as ready for review June 24, 2026 12:20
SanJSp added 14 commits June 24, 2026 12:34
Replaces the "Auto-CDF" wording with "read-time CDF" across changelog
comments, docs and config descriptions. Comment/string text only, no
code identifiers changed.

Reviewer comment:
delta-io#6794 (comment)
The schema field holds the ending-version schema used as the read schema
for the changelog range. Renames it to endDataSchema to disambiguate from
per-commit schemas.

Reviewer comment:
delta-io#6794 (comment)
The changelog read path compared per-commit and start schemas to the end
schema with strict equality. This rejects additive, read-compatible changes
(new columns, relaxed nullability) and trips on internal field metadata.
Replaces both checks with a SchemaUtils.isReadCompatible helper in the v2
schema utils that delegates to Delta's read-compatibility check
(forbidTightenNullability = true, matching CDCReaderBase).

Reviewer comment:
delta-io#6794 (comment)
A Metadata action fully replaces the prior table configuration: all configs
are listed explicitly. The changelog loop treated an absent row-tracking key
as "inherit the prior value" and only failed on an explicit non-"true" value,
so a Metadata commit that drops row tracking could slip through. Treats an
absent key as the table default (disabled), and reads the config key from
DeltaConfigs.ROW_TRACKING_ENABLED instead of a string literal. The in-range
check now correctly rejects row tracking being disabled within the range; the
boundary case is already caught earlier in DeltaChangelogScanBuilder.

Reviewer comments:
delta-io#6794 (comment)
delta-io#6794 (comment)
delta-io#6794 (comment)
The changelog read path wrapped unexpected failures while processing commit
actions and planning input partitions in bare RuntimeExceptions, discarding
the user-facing error class machinery. Adds a DELTA_CHANGELOG_READ_FAILED
error class and a throwChangelogReadFailed helper: a cause that already
carries a Spark error class is rethrown unchanged (preserving e.g.
DELTA_CHANGELOG_ROW_TRACKING_DISABLED_IN_RANGE), anything else is wrapped in
DELTA_CHANGELOG_READ_FAILED instead of a RuntimeException.

Reviewer comments:
https://github.com/delta-io/delta/pull/6830/files#r3301759577
delta-io#6886 (comment)
Adds a class-level comment describing how the changelog batch turns a commit
range into CDC input partitions and where the CDC tail columns are appended.
Reflows comments to satisfy javafmtCheck (google-java-format) after the
wording and isReadCompatible changes above.
The suite already rejects row tracking disabled within the range; these add
the positive cases: toggling row tracking off (and back on) entirely before
the range, and disabling it after the range, must not fail a read whose range
stays row-tracking-enabled throughout.
Reads the in-range row-tracking flag via TableConfig.ROW_TRACKING_ENABLED
.fromMetadata(md) (kernel Metadata is already in scope), which returns a
Boolean with parsing and the table default backed in, instead of fetching the
raw config string and parsing it by hand. Matches the existing pattern in
ProtocolMetadataAdapterV2 and drops the DeltaConfigs Scala interop. Also
extracts the duplicated start/per-commit schema check into a private
requireReadCompatible helper.
Converting the schema check to isReadCompatible left the rejection branch
(DELTA_CHANGELOG_SCHEMA_CHANGE_IN_RANGE) untested. Adds an integration test
for a non-read-compatible mid-range change (DROP COLUMN) and direct unit tests
for SchemaUtils.isReadCompatible (additive / relaxed-nullability accepted,
dropped column / tightened nullability rejected). Also fixes a doubled
"read-time CDF read path" comment.
Locks the two branches: a SparkThrowable cause is rethrown unchanged (its
user-facing error class preserved), any other cause is wrapped in
DELTA_CHANGELOG_READ_FAILED.
@SanJSp SanJSp force-pushed the changelog-cleanup branch from 39ec98d to 6bfdb4a Compare June 24, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant