Document the udt (UserDefinedType) data type disposition by sanujbasu · Pull Request #7093 · delta-io/delta

sanujbasu · 2026-06-25T02:58:14Z

Description

Spark writes UserDefinedType (udt) columns into metaData.schemaString, but the protocol's schema serialization type system does not define udt (only primitive / struct / array / map / variant). Such columns are therefore non-conformant today even though they already exist in tables in the wild, and a reader that rejects the unknown type fails to read the entire table.

This adds a note under Schema Serialization Format > Primitive Types, parallel to the existing void note, documenting the disposition rather than gating it behind a table feature:

Existing tables may contain columns of Spark's udt (UserDefinedType) complex type... A reader that does not implement that engine code MUST interpret the column as its physical sqlType; the sqlType is the on-disk Parquet representation.

Why no table feature: udt introduces no new physical representation. It is an engine-specific annotation (class/pyClass reference JVM/Python deserialization code) over an existing physical type. A UDT-unaware reader that reads the sqlType reads correct data, so unlike timestampNtz/variant there is nothing for a reader to opt into. This mirrors the void precedent.

A companion kernel implementation (DataType::UserDefined read support in delta-kernel-rs) is proposed separately.

How was this patch tested?

Documentation-only change.

Does this PR introduce any user-facing changes?

No behavioral change. Documents the disposition of an existing, previously-undocumented data type that Spark already writes.

Authored with assistance from Claude Code.

Spark writes UserDefinedType (`udt`) columns into `schemaString`, but the protocol's schema type system does not define `udt`, so such columns are non-conformant today even though they already exist in tables in the wild. Add a note (parallel to the existing `void` note) documenting the disposition: a reader that cannot run the engine's deserialization code MUST read the column as its physical `sqlType`, which is the on-disk Parquet representation. No table feature is introduced: `udt` adds no new physical representation (it is an annotation over an existing physical type), so a UDT-unaware reader that reads the `sqlType` reads correct data. Co-authored-by: Isaac

nicklan · 2026-06-25T20:56:32Z


 Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

+Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.


Let's explode out the example to make it more clear what the full structure can look like I think. Thisi s quite hard to read right now.

I'm not clear what "that engine code" means

Is sqlType a parquet or delta schema? (e.g. could it have a timestampNtz column in it?)

I think we want to say "Writers MUST preserve udt columns", it's not an option to just drop them :)

emkornfield · 2026-06-25T20:57:45Z


 Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

+Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.


are there any caveats about column mapping with these types?

emkornfield · 2026-06-25T21:03:01Z


 Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

+Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.


does spark write information today about UDTs into the parquet files themselves? If so is it required?

nicklan reviewed Jun 25, 2026

View reviewed changes

emkornfield reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document the udt (UserDefinedType) data type disposition#7093

Document the udt (UserDefinedType) data type disposition#7093
sanujbasu wants to merge 1 commit into
delta-io:masterfrom
sanujbasu:udt-protocol-note

sanujbasu commented Jun 25, 2026

Uh oh!

nicklan Jun 25, 2026

Uh oh!

emkornfield Jun 25, 2026

Uh oh!

emkornfield Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

		Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.

Uh oh!

Conversation

sanujbasu commented Jun 25, 2026

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Uh oh!

nicklan Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

emkornfield Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

emkornfield Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants