Document the udt (UserDefinedType) data type disposition#7093
Draft
sanujbasu wants to merge 1 commit into
Draft
Conversation
Spark writes UserDefinedType (`udt`) columns into `schemaString`, but the protocol's schema type system does not define `udt`, so such columns are non-conformant today even though they already exist in tables in the wild. Add a note (parallel to the existing `void` note) documenting the disposition: a reader that cannot run the engine's deserialization code MUST read the column as its physical `sqlType`, which is the on-disk Parquet representation. No table feature is introduced: `udt` adds no new physical representation (it is an annotation over an existing physical type), so a UDT-unaware reader that reads the `sqlType` reads correct data. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Spark writes
UserDefinedType(udt) columns intometaData.schemaString, but the protocol's schema serialization type system does not defineudt(only primitive / struct / array / map / variant). Such columns are therefore non-conformant today even though they already exist in tables in the wild, and a reader that rejects the unknown type fails to read the entire table.This adds a note under Schema Serialization Format > Primitive Types, parallel to the existing
voidnote, documenting the disposition rather than gating it behind a table feature:Why no table feature:
udtintroduces no new physical representation. It is an engine-specific annotation (class/pyClassreference JVM/Python deserialization code) over an existing physical type. A UDT-unaware reader that reads thesqlTypereads correct data, so unliketimestampNtz/variantthere is nothing for a reader to opt into. This mirrors thevoidprecedent.A companion kernel implementation (
DataType::UserDefinedread support in delta-kernel-rs) is proposed separately.How was this patch tested?
Documentation-only change.
Does this PR introduce any user-facing changes?
No behavioral change. Documents the disposition of an existing, previously-undocumented data type that Spark already writes.
Authored with assistance from Claude Code.