Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2748,6 +2748,8 @@ See Parquet [timestamp type](https://github.com/apache/parquet-format/blob/maste

Note: Existing tables may have `void` data type columns. Behavior is undefined for `void` data type columns but it is recommended to drop any `void` data type columns on reads (as is implemented by the Spark connector).

Note: Existing tables may contain columns of Spark's `udt` (UserDefinedType) complex type, serialized as `{"type":"udt", "class"/"pyClass"/"serializedClass", "sqlType": <type>}`. The `class`/`pyClass` identify engine-specific (JVM/Python) deserialization code and are not part of this protocol. A reader that does not implement that engine code MUST interpret the column as its physical `sqlType`; the `sqlType` is the on-disk Parquet representation. Writers that preserve a `udt` column MUST store its data physically as `sqlType` and retain the annotation in `schemaString`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's explode out the example to make it more clear what the full structure can look like I think. Thisi s quite hard to read right now.
  2. I'm not clear what "that engine code" means
  3. Is sqlType a parquet or delta schema? (e.g. could it have a timestampNtz column in it?)
  4. I think we want to say "Writers MUST preserve udt columns", it's not an option to just drop them :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any caveats about column mapping with these types?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does spark write information today about UDTs into the parquet files themselves? If so is it required?


### Struct Type

A struct is used to represent both the top-level schema of the table as well as struct columns that contain nested columns. A struct is encoded as a JSON object with the following fields:
Expand Down