-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[WIP] Interval Types RFC #7078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rajayush143
wants to merge
5
commits into
delta-io:master
Choose a base branch
from
rajayush143:interval_type_rfc
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+123
−0
Open
[WIP] Interval Types RFC #7078
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,80 @@ | ||||||
| # Interval Types | ||||||
| **Associated Github issue for discussions: https://github.com/delta-io/delta/issues/7077** | ||||||
|
|
||||||
| This protocol change adds support for interval types. It consists of two changes to the protocol: | ||||||
|
|
||||||
| - One new reader/writer table feature | ||||||
| - Two new primitive types (year-month and day-second) | ||||||
|
|
||||||
| -------- | ||||||
|
|
||||||
| > ***Add a new section in front of the [Primitive Types](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types) section.*** | ||||||
|
|
||||||
| # Interval Types Table Feature | ||||||
|
|
||||||
| This table feature (`intervalTypes`) adds the year-month and day-second interval types from ANSI SQL: | ||||||
|
|
||||||
| 1. **interval year to month**: A signed number of months, e.g. `INTERVAL '1-6' YEAR TO MONTH` represents 18 (1 year + 6 months = 18 months). | ||||||
| 2. **interval day to second**: A signed number of microseconds, e.g. `INTERVAL '1 00:00:01.000000' DAY TO SECOND` represents 86,401,000,000 (1 day + 1 second = 86,400,000,000 + 1,000,000 μs). | ||||||
|
|
||||||
| To support this feature: | ||||||
| - The table must be on **Reader Version 3** and **Writer Version 7**. | ||||||
| - The feature `intervalTypes` must be listed in the table `protocol`'s `readerFeatures` and `writerFeatures`. | ||||||
|
|
||||||
| ## Type Definitions | ||||||
|
|
||||||
| In the schema, interval types are serialized in `Metadata.schemaString` as | ||||||
|
|
||||||
| - `interval year to month` | ||||||
| - `interval day to second` | ||||||
|
|
||||||
| These are the canonical type-name strings. ANSI SQL also permits narrowed spellings that denote the same two types: for year-month, `interval year` and `interval month`; for day-second, `interval day`, `interval hour`, `interval minute`, `interval second`, and any `<start> to <end>` range between those fields (e.g. `interval day to minute`, `interval hour to second`). Mixed-family spellings (e.g. `interval month to day`) are not valid. Tables written by existing engines may use these narrowed spellings. | ||||||
|
|
||||||
| ### Reader Requirements | ||||||
|
|
||||||
| When this table feature is supported, readers must: | ||||||
|
|
||||||
| - Interpret `interval year to month` as a signed count of months, and `interval day to second` as a signed count of microseconds. | ||||||
| - Accept the narrowed spellings above and normalize each to its family: any year-month spelling is treated as `interval year to month`, and any day-second spelling is treated as `interval day to second`. | ||||||
|
|
||||||
| ### Writer Requirements | ||||||
|
|
||||||
| When this table feature is supported, writers must: | ||||||
|
|
||||||
| - Serialize an interval field's type in `Metadata.schemaString` using the canonical `interval year to month` or `interval day to second` form. | ||||||
|
|
||||||
| ## Partition Value Serialization | ||||||
|
|
||||||
| Intervals can be a partition value, so we define Partition Value Serialization as the ANSI literal form for interval types as defined by the Spark SQL guide [1]. We provide an example below: | ||||||
|
|
||||||
| ``` | ||||||
| Interval Year Month: "INTERVAL '1-0' YEAR TO MONTH" | ||||||
| Interval Day Second: "INTERVAL '7 12:34:56.123456' DAY TO SECOND" | ||||||
| ``` | ||||||
|
|
||||||
| Where `'1-0'` refers to `years-months` and `'7 12:34:56.123456'` refers to `days hours:minutes:seconds.microseconds`. | ||||||
|
|
||||||
| ## Per-file Statistics | ||||||
|
|
||||||
| Interval columns do not support `minValues`/`maxValues` statistics or data skipping. Writers must not record `minValues` or `maxValues` for interval columns, and readers must not perform data skipping over interval columns. The per-column `nullCount` and the per-file `numRecords` statistics are unaffected and are still recorded as normal, since they do not require interpreting interval values. This is consistent with existing tables that contain interval types, which do not record `minValues`/`maxValues` for these columns. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I don't think we need to note this |
||||||
|
|
||||||
| ## Parquet Format | ||||||
|
|
||||||
| We use raw `int32` values to represent year-month intervals and raw `int64` values to represent day-second intervals in Parquet. This allows us to support signed intervals & microsecond precision while matching existing interval types. | ||||||
|
|
||||||
| ## Feature Interactions | ||||||
|
|
||||||
| Beyond the partition-value and statistics behavior described above, interval types have no special interactions with other table features. | ||||||
|
|
||||||
| > ***Add new rows to the [Primitive Types](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types) table.*** | ||||||
|
|
||||||
| | Type Name | Description | | ||||||
| | --- | --- | | ||||||
| | interval year to month | Signed duration in the precision of months | | ||||||
| | interval day to second | Signed duration in the precision of microseconds | | ||||||
|
|
||||||
| # References | ||||||
|
|
||||||
| [1] https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal | ||||||
|
|
||||||
| [2] https://github.com/delta-io/delta/issues/7077 | ||||||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to a definition of interval types. https://docs.databricks.com/aws/en/sql/language-manual/data-types/interval-type would be one option, but maybe there's a more generic "sql reference". I didn't find one in a 20 second search but there should be one somewhere :)