Skip to content

[PROTOCOL RFC] Full Void Type Support#7073

Open
ZiyaZa wants to merge 3 commits into
delta-io:masterfrom
ZiyaZa:void-table-feature-rfc
Open

[PROTOCOL RFC] Full Void Type Support#7073
ZiyaZa wants to merge 3 commits into
delta-io:masterfrom
ZiyaZa:void-table-feature-rfc

Conversation

@ZiyaZa

@ZiyaZa ZiyaZa commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (Protocol RFC)

Description

Associated Github issue for discussions: #7072

This PR adds the proposed protocol change for full VOID support everywhere in Delta table schema. Current protocol is being clarified in #6966 to specify how VOID currently needs to be handled, but this RFC further defines a new table feature that will allow tables to persist VOID columns as UNKNOWN type in Parquet, and hence lift the schema limitations we have today.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

Creates a new Protocol RFC.

## Void columns without the table feature

When the `voidType` feature is not supported, `void` columns can only be **omitted**. Because a `void` column is never written to a data file, writers must reject **writing data** to a table whose schema contains any of the following shapes, in which omitting the `void` column(s) would leave nowhere to record the nullability or length of an enclosing value:
- a `void` type directly inside an `array` or `map` at any nesting level;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?

Also, by 'directly,' we mean arrays like ARRAY<VOID>. But with this feature, are we also going to unblock void indirectly inside an array or map, right? Like ARRAY<STRUCT<INT, VOID>>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For map, shall we specify that it is allowed only for the value? I recall that we don't allow VOID keys anyway, right?

That is Spark limitation, and I don't think it belongs in Delta Protocol. FWIW The Protocol does not say anything about nullability of map keys in general.

Also, by 'directly,' we mean arrays like ARRAY<VOID>. But with this feature, are we also going to unblock void indirectly inside an array or map, right? Like ARRAY<STRUCT<INT, VOID>>?

Indirect voids are already unblocked, it's just an implementation detail that Spark connector blocks this, but otherwise there's no reason for blocking. This feature does not affect those voids.

Comment thread protocol_rfcs/void-type.md Outdated

When Void Type is supported (when the `writerFeatures` field of a table's `protocol` action contains `voidType`), writers:
- must store the table's structural `void` columns as `UNKNOWN` (see [Structural void columns](#structural-void-columns)).
- may store any non-structural `void` column either by omission or as an `UNKNOWN` column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we force writers to always write non-structural void columns by omission?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that would make the protocol more difficult to implement and is not strictly necessary. It only makes dropping the feature easier if we have less UNKNOWNs.

But I'll update it to "should" to show it's the preferred behavior.

Comment thread protocol_rfcs/void-type.md Outdated

A `void` column in any other position is never structural: it can be omitted, and does not require the feature. A schema that contains one of the shapes above is said to **require** the `voidType` feature.

### Writer Requirements for Void Type

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also explain how writers should handle statistics for VOID columns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any special handling for stats, so the default rules from the protocol should be sufficient.

@c27kwan c27kwan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal!

Comment thread protocol_rfcs/void-type.md Outdated
# Void Type
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/7072**

This protocol change adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema, via a new reader/writer table feature, `voidType`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This protocol change adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema, via a new reader/writer table feature, `voidType`.
The `voidType` reader/writer table feature adds support for using the `void` data type (also known as `NullType` in Spark, `UnknownType` in Iceberg, and `UNKNOWN` in Parquet) anywhere in a Delta table schema.


`void` is a data type with a single possible value: `NULL`. A column ends up with this type when the writer has no information about its actual type, typically because every value observed so far has been `NULL` (for example, `CREATE TABLE t AS SELECT NULL AS a`, or schema evolution that adds a column containing only `NULL`s).

Today, `void` columns are represented by omitting them from data files and reconstructing them as all-`NULL` columns on read (the missing columns mechanism). That representation cannot encode four schema shapes - a table whose columns are all `void`, a `struct` whose fields are all `void`, a `void` nested in an `array`, and a `void` nested in a `map` - because in each case omitting the `void` column(s) would leave the enclosing `struct`, `array`, or `map` (or the table itself) with nothing written to a data file, and therefore nowhere to record whether the enclosing value is `NULL`, empty, or how long it is. Writers must reject writing data in those cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true only for some engines like Spark. I think it's officially undefined since it is not supported.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's becoming official in #6966. I built this RFC assuming that protocol clarification makes it in.

Before that change, the protocol basically said tables can have void, behavior is undefined but it's recommended to drop it upon reads.

- a `struct` (at any nesting level) whose fields are all `void`; or
- a table whose columns are all `void`.

These restrictions are stated in terms of the **table schema**, not the schema of any individual data file. A table with such a schema can still be created, altered through metadata-only operations, and read. It can be made writable by evolving its schema - for example, by changing a `void` column to another type - or by enabling the `voidType` feature.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column is omitted, then the schema of the individual data files are all the same, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be differences due to schema evolution/type widening. Voids will always be missing, but other columns can differ.


A `void` column may be changed to any other data type through supported schema-evolution operations; this does not require the [Type Widening](/PROTOCOL.md#type-widening) table feature, even when the `void` column is stored as `UNKNOWN`.

## Void columns without the table feature

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can enforce anything on the case without a table feature. A legacy writer/reader does not have knowledge of this new proposal and cannot retroactively ban certain operations.

It seems to me, if someone wants the void type to behave as expected, they must have the table feature and from there it's up to the engine whether they want to omit or materialize the void type column.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is not ideal. void never officially made it into the protocol, but got accidentally introduced to tables by the Spark connector when Spark got NullType support. Then there were various revisions to the Protocol to make sure void is mentioned there, but it was all vague because even the Spark connector did not handle it properly causing query failures. After #6966, the behavior will be defined, and both Spark connector and kernel-rs follows that version of the Protocol.

I understand external clients may now become protocol-incompliant, but if they somehow managed to read what is written by Spark (which I think is the reference implementation) previously, and if they wrote something that Spark could read before, then they should still be protocol-compliant. In any case, this comment is more for #6966 than this PR.

Comment thread protocol_rfcs/void-type.md Outdated
### Reader Requirements for Void Type

When Void Type is supported (when the `readerFeatures` field of a table's `protocol` action contains `voidType`), readers:
- must recognize and tolerate a `void` data type anywhere in a Delta table schema.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This phrasing is a bit weird. "must allow"

Comment thread protocol_rfcs/void-type.md Outdated

When Void Type is supported (when the `readerFeatures` field of a table's `protocol` action contains `voidType`), readers:
- must recognize and tolerate a `void` data type anywhere in a Delta table schema.
- must read a `void` column stored as `UNKNOWN` as an all-`NULL` column.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify Parquet here, since this seems targeted.

Although maybe a more neutral way to frame this is "must return only null values for columns defined as void in the table schema". Whether it's omitted or materialized, the actual behaviour is that. It's less about the underlying data files' schema than it is about the actual Delta table's schema.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded it to not mention any type and just say return all null independent of representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants