Skip to content

Document the udt (UserDefinedType) data type disposition#7093

Draft
sanujbasu wants to merge 1 commit into
delta-io:masterfrom
sanujbasu:udt-protocol-note
Draft

Document the udt (UserDefinedType) data type disposition#7093
sanujbasu wants to merge 1 commit into
delta-io:masterfrom
sanujbasu:udt-protocol-note

Conversation

@sanujbasu

Copy link
Copy Markdown
Contributor

Description

Spark writes UserDefinedType (udt) columns into metaData.schemaString, but the protocol's schema serialization type system does not define udt (only primitive / struct / array / map / variant). Such columns are therefore non-conformant today even though they already exist in tables in the wild, and a reader that rejects the unknown type fails to read the entire table.

This adds a note under Schema Serialization Format > Primitive Types, parallel to the existing void note, documenting the disposition rather than gating it behind a table feature:

Existing tables may contain columns of Spark's udt (UserDefinedType) complex type... A reader that does not implement that engine code MUST interpret the column as its physical sqlType; the sqlType is the on-disk Parquet representation.

Why no table feature: udt introduces no new physical representation. It is an engine-specific annotation (class/pyClass reference JVM/Python deserialization code) over an existing physical type. A UDT-unaware reader that reads the sqlType reads correct data, so unlike timestampNtz/variant there is nothing for a reader to opt into. This mirrors the void precedent.

A companion kernel implementation (DataType::UserDefined read support in delta-kernel-rs) is proposed separately.

How was this patch tested?

Documentation-only change.

Does this PR introduce any user-facing changes?

No behavioral change. Documents the disposition of an existing, previously-undocumented data type that Spark already writes.

Authored with assistance from Claude Code.

Spark writes UserDefinedType (`udt`) columns into `schemaString`, but the
protocol's schema type system does not define `udt`, so such columns are
non-conformant today even though they already exist in tables in the wild. Add
a note (parallel to the existing `void` note) documenting the disposition: a
reader that cannot run the engine's deserialization code MUST read the column
as its physical `sqlType`, which is the on-disk Parquet representation.

No table feature is introduced: `udt` adds no new physical representation (it is
an annotation over an existing physical type), so a UDT-unaware reader that
reads the `sqlType` reads correct data.

Co-authored-by: Isaac
Comment thread PROTOCOL.md

Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's explode out the example to make it more clear what the full structure can look like I think. Thisi s quite hard to read right now.
  2. I'm not clear what "that engine code" means
  3. Is sqlType a parquet or delta schema? (e.g. could it have a timestampNtz column in it?)
  4. I think we want to say "Writers MUST preserve udt columns", it's not an option to just drop them :)

Comment thread PROTOCOL.md

Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any caveats about column mapping with these types?

Comment thread PROTOCOL.md

Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does spark write information today about UDTs into the parquet files themselves? If so is it required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants