Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion protocol_rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,
| 2024-04-30 | [collated-string-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/collated-string-type.md) | https://github.com/delta-io/delta/issues/2894 | Collated String Type |
| 2025-03-13 | [checkpoint-protection.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/checkpoint-protection.md) | https://github.com/delta-io/delta/issues/4152 | Checkpoint Protection |
| 2025-03-18 | [iceberg-writer-compat-v1.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/iceberg-writer-compat-v1.md) | https://github.com/delta-io/delta/issues/4284 | IcebergWriterCompatV1 |
| 2025-11-20 | [materialize-partition-columns.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/materialize-partition-columns.md) | https://github.com/delta-io/delta/issues/5555 | Materialize Partition Columns |
| 2025-11-20 | [materialize-partition-columns.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/materialize-partition-columns.md) | https://github.com/delta-io/delta/issues/5555 | Materialize Partition Columns |
| 2026-06-22 | [void-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/void-type.md) | https://github.com/delta-io/delta/issues/7072 | Void Type |

### Accepted RFCs

Expand Down
74 changes: 74 additions & 0 deletions protocol_rfcs/void-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Void Type
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/7072**

The `voidType` reader/writer table feature adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema.

`void` is a data type with a single possible value: `NULL`. A column ends up with this type when the writer has no information about its actual type, typically because every value observed so far has been `NULL` (for example, `CREATE TABLE t AS SELECT NULL AS a`, or schema evolution that adds a column containing only `NULL`s).

Today, `void` columns are represented by omitting them from data files and reconstructing them as all-`NULL` columns on read (the missing columns mechanism). That representation cannot encode four schema shapes - a table whose columns are all `void`, a `struct` whose fields are all `void`, a `void` nested in an `array`, and a `void` nested in a `map` - because in each case omitting the `void` column(s) would leave the enclosing `struct`, `array`, or `map` (or the table itself) with nothing written to a data file, and therefore nowhere to record whether the enclosing value is `NULL`, empty, or how long it is. Writers must reject writing data in those cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true only for some engines like Spark. I think it's officially undefined since it is not supported.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's becoming official in #6966. I built this RFC assuming that protocol clarification makes it in.

Before that change, the protocol basically said tables can have void, behavior is undefined but it's recommended to drop it upon reads.


The `voidType` table feature lifts these restrictions by storing the `void` columns those shapes require - the **structural** `void` columns - using the Parquet [`UNKNOWN` logical type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null): a column whose values are always `NULL` but which is physically present in the data file, so it can carry the information about its enclosing complex value. Older clients cannot read `UNKNOWN` columns, so the representation is gated behind this reader/writer feature.

--------

> ***New section after the [Variant Shredding](/PROTOCOL.md#variant-shredding) section***

# Void Type

`void` is a primitive data type (see [Primitive Types](/PROTOCOL.md#primitive-types)) with a single possible value, `NULL`. A `void` column can be represented in a data file in one of two ways:

- **Omitted** - the column is not written to the data file and is reconstructed on read as an all-`NULL` column, following the [rule](/PROTOCOL.md#consistency-between-table-metadata-and-data-files) that a column present in the table schema but absent from a data file is read as `NULL`.
- **Stored as `UNKNOWN`** - the column is written using the Parquet [`UNKNOWN` logical type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null). Its values are always `NULL`, but unlike an omitted column it is physically present in the data file. This representation requires the `voidType` table feature.

A `void` column may be changed to any other data type through supported schema-evolution operations; this does not require the [Type Widening](/PROTOCOL.md#type-widening) table feature, even when the `void` column is stored as `UNKNOWN`.

## Void columns without the table feature

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can enforce anything on the case without a table feature. A legacy writer/reader does not have knowledge of this new proposal and cannot retroactively ban certain operations.

It seems to me, if someone wants the void type to behave as expected, they must have the table feature and from there it's up to the engine whether they want to omit or materialize the void type column.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is not ideal. void never officially made it into the protocol, but got accidentally introduced to tables by the Spark connector when Spark got NullType support. Then there were various revisions to the Protocol to make sure void is mentioned there, but it was all vague because even the Spark connector did not handle it properly causing query failures. After #6966, the behavior will be defined, and both Spark connector and kernel-rs follows that version of the Protocol.

I understand external clients may now become protocol-incompliant, but if they somehow managed to read what is written by Spark (which I think is the reference implementation) previously, and if they wrote something that Spark could read before, then they should still be protocol-compliant. In any case, this comment is more for #6966 than this PR.


When the `voidType` feature is not supported, `void` columns can only be **omitted**. Because a `void` column is never written to a data file, writers must reject **writing data** to a table whose schema contains any of the following shapes, in which omitting the `void` column(s) would leave nowhere to record the nullability or length of an enclosing value:
- a `void` type directly inside an `array` or `map` at any nesting level;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?

Also, by 'directly,' we mean arrays like ARRAY<VOID>. But with this feature, are we also going to unblock void indirectly inside an array or map, right? Like ARRAY<STRUCT<INT, VOID>>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?

That is Spark limitation, and I don't think it belongs in Delta Protocol. FWIW The Protocol does not say anything about nullability of map keys in general.

Also, by 'directly,' we mean arrays like ARRAY<VOID>. But with this feature, are we also going to unblock void indirectly inside an array or map, right? Like ARRAY<STRUCT<INT, VOID>>?

Indirect voids are already unblocked, it's just an implementation detail that Spark connector blocks this, but otherwise there's no reason for blocking. This feature does not affect those voids.

- a `struct` (at any nesting level) whose fields are all `void`; or
- a table whose columns are all `void`.

These restrictions are stated in terms of the **table schema**, not the schema of any individual data file. A table with such a schema can still be created, altered through metadata-only operations, and read. It can be made writable by evolving its schema - for example, by changing a `void` column to another type - or by enabling the `voidType` feature.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column is omitted, then the schema of the individual data files are all the same, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be differences due to schema evolution/type widening. Voids will always be missing, but other columns can differ.


## Void Type table feature

The `voidType` table feature lifts the restrictions above by allowing those schema shapes: the `void` columns that cannot be omitted - the **structural** `void` columns - are instead stored using the Parquet `UNKNOWN` logical type.

To support this feature:
- The table must be on Reader Version 3 and Writer Version 7.
- The feature `voidType` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`.

The feature has a dual purpose:

1. A client that supports `voidType` is guaranteed to correctly read and write `void` columns that rely on the missing columns mechanism. Enabling the feature for a table that only uses such columns is **optional**; a user may choose to enable it so that only clients capable of handling `void` columns correctly interact with the table.
2. A client that supports `voidType` is also guaranteed to correctly read and write structural `void` columns (those stored as `UNKNOWN`). Enabling the feature is **required** to write data for a schema that needs the `UNKNOWN` representation, because clients that do not support the feature cannot read `UNKNOWN` columns.

### Structural void columns

A `void` column is **structural** when it cannot be omitted and is therefore stored as `UNKNOWN`, because omitting it would leave an enclosing `struct`, `array`, or `map` (or the table) with nothing written to the data file. This arises only in the schema shapes that the missing columns mechanism cannot represent (the shapes listed in [Void columns without the table feature](#void-columns-without-the-table-feature)):
- a `void` directly inside an `array` or `map` (at any nesting level): the `void` element or value is structural and must be stored as `UNKNOWN`.
- a `struct` whose fields are all `void`, or a table whose columns are all `void`: the writer chooses which `void` column is structural and must store it as `UNKNOWN`.

A `void` column in any other position is never structural: it can be omitted, and does not require the feature. A schema that contains one of the shapes above is said to **require** the `voidType` feature.

### Writer Requirements for Void Type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also explain how writers should handle statistics for VOID columns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any special handling for stats, so the default rules from the protocol should be sufficient.


When Void Type is supported (when the `writerFeatures` field of a table's `protocol` action contains `voidType`), writers:
- must write the table's structural `void` columns to data files (see [Structural void columns](#structural-void-columns)).
- should omit any non-structural `void` column.

### Reader Requirements for Void Type

When Void Type is supported (when the `readerFeatures` field of a table's `protocol` action contains `voidType`), readers:
- must allow a `void` data type anywhere in a Delta table schema.
- must return only `NULL` values for a `void` column regardless of how it is represented.
- must, within a single scan, correctly combine data files that represent the same column differently - omitted, written as an all-`NULL` column, or (after a type change) written with a concrete type - into the requested read schema.

### Removing the Void Type feature

Because clients that do not support the feature cannot read `UNKNOWN` columns, removing `voidType` requires that the table no longer depend on the `UNKNOWN` representation. In the version that removes `voidType` from the `writerFeatures` and `readerFeatures` fields of the table's `protocol` action, writers:
- must ensure that the table schema does not require the feature - it must contain none of the shapes in [Structural void columns](#structural-void-columns).
- must ensure that no data file reachable by the table (including via time travel within the retained history) contains a column stored as `UNKNOWN`. This may require rewriting existing data files so that every `void` column is represented by omission.

After the feature is removed, the table reverts to representing `void` columns only by omission, and the shapes that require the feature are again rejected when writing data.