Skip to content

Document the udt (UserDefinedType) data type disposition#7093

Draft
sanujbasu wants to merge 1 commit into
delta-io:masterfrom
sanujbasu:udt-protocol-note
Draft

Document the udt (UserDefinedType) data type disposition#7093
sanujbasu wants to merge 1 commit into
delta-io:masterfrom
sanujbasu:udt-protocol-note

Conversation

@sanujbasu

Copy link
Copy Markdown
Contributor

Description

Spark writes UserDefinedType (udt) columns into metaData.schemaString, but the protocol's schema serialization type system does not define udt (only primitive / struct / array / map / variant). Such columns are therefore non-conformant today even though they already exist in tables in the wild, and a reader that rejects the unknown type fails to read the entire table.

This adds a note under Schema Serialization Format > Primitive Types, parallel to the existing void note, documenting the disposition rather than gating it behind a table feature:

Existing tables may contain columns of Spark's udt (UserDefinedType) complex type... A reader that does not implement that engine code MUST interpret the column as its physical sqlType; the sqlType is the on-disk Parquet representation.

Why no table feature: udt introduces no new physical representation. It is an engine-specific annotation (class/pyClass reference JVM/Python deserialization code) over an existing physical type. A UDT-unaware reader that reads the sqlType reads correct data, so unlike timestampNtz/variant there is nothing for a reader to opt into. This mirrors the void precedent.

A companion kernel implementation (DataType::UserDefined read support in delta-kernel-rs) is proposed separately.

How was this patch tested?

Documentation-only change.

Does this PR introduce any user-facing changes?

No behavioral change. Documents the disposition of an existing, previously-undocumented data type that Spark already writes.

Authored with assistance from Claude Code.

Spark writes UserDefinedType (`udt`) columns into `schemaString`, but the
protocol's schema type system does not define `udt`, so such columns are
non-conformant today even though they already exist in tables in the wild. Add
a note (parallel to the existing `void` note) documenting the disposition: a
reader that cannot run the engine's deserialization code MUST read the column
as its physical `sqlType`, which is the on-disk Parquet representation.

No table feature is introduced: `udt` adds no new physical representation (it is
an annotation over an existing physical type), so a UDT-unaware reader that
reads the `sqlType` reads correct data.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant