-
Notifications
You must be signed in to change notification settings - Fork 4.1k
GH-24868: [C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType #37166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
b205109
e317bf4
1c46c2e
18c88a2
4d3eb44
5bc3266
02c3108
10d20a3
d737130
e646a79
0fafcca
18984fe
ff60349
78c6bd4
90668e0
09fd14f
8b80ced
2327085
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -148,6 +148,109 @@ Fixed shape tensor | |
| by this specification. Instead, this extension type lets one use fixed shape tensors | ||
| as elements in a field of a RecordBatch or a Table. | ||
|
|
||
| .. _variable_shape_tensor_extension: | ||
|
|
||
| Variable shape tensor | ||
| ===================== | ||
|
|
||
| * Extension name: `arrow.variable_shape_tensor`. | ||
|
|
||
| * The storage type of the extension is: ``StructArray`` where struct | ||
| is composed of **data** and **shape** fields describing a single | ||
| tensor per row: | ||
|
|
||
| * **data** is a ``List`` holding tensor elements of a single tensor. | ||
| Data type of the list elements is uniform across the entire column. | ||
| * **shape** is a ``FixedSizeList<int32>[ndim]`` of the tensor shape where | ||
| the size of the list ``ndim`` is equal to the number of dimensions of the | ||
| tensor. | ||
|
|
||
| * Extension type parameters: | ||
|
|
||
| * **value_type** = the Arrow data type of individual tensor elements. | ||
|
|
||
| Optional parameters describing the logical layout: | ||
|
|
||
| * **dim_names** = explicit names to tensor dimensions | ||
| as an array. The length of it should be equal to the shape | ||
| length and equal to the number of dimensions. | ||
|
|
||
| ``dim_names`` can be used if the dimensions have well-known | ||
| names and they map to the physical layout (row-major). | ||
|
|
||
| * **permutation** = indices of the desired ordering of the | ||
| original dimensions, defined as an array. | ||
|
|
||
| The indices contain a permutation of the values [0, 1, .., N-1] where | ||
| N is the number of dimensions. The permutation indicates which | ||
| dimension of the logical layout corresponds to which dimension of the | ||
| physical tensor (the i-th dimension of the logical view corresponds | ||
| to the dimension with number ``permutations[i]`` of the physical tensor). | ||
|
|
||
| Permutation can be useful in case the logical order of | ||
| the tensor is a permutation of the physical order (row-major). | ||
|
|
||
| When logical and physical layout are equal, the permutation will always | ||
| be ([0, 1, .., N-1]) and can therefore be left out. | ||
|
|
||
| * **uniform_shape** = sizes of individual tensors dimensions are | ||
| guaranteed to stay constant in uniform dimensions and can vary in | ||
|
rok marked this conversation as resolved.
Outdated
|
||
| non-uniform dimensions. This holds over all tensors in the array. | ||
| Sizes in uniform dimensions are represented with int32 values, while | ||
| sizes of the non-uniform dimensions are not known in advance and are | ||
| represented with 0s. If ``uniform_shape`` is not provided it is assumed | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we rather take "-1" istead of "0"? We have some other places where we use -1 for "unknown" (eg null counts)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or JSON supports
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Switched language to |
||
| that all dimensions are non-uniform. | ||
| An array containing a tensor with shape (2, 3, 4) and whose first and | ||
| last dimensions are uniform would have ``uniform_shape`` (2, 0, 4). | ||
| This allows for interpreting the tensor correctly without accounting for | ||
| uniform dimensions while still permitting optional optimizations that | ||
| take advantage of the uniformity. | ||
|
|
||
| * Description of the serialization: | ||
|
|
||
| The metadata must be a valid JSON object that optionally includes | ||
| dimension names with keys **"dim_names"** and ordering of dimensions | ||
| with key **"permutation"**. | ||
| Shapes of tensors can be defined in a subset of dimensions by providing | ||
| key **"uniform_shape"**. | ||
| Minimal metadata is an empty JSON object. | ||
|
|
||
| - Example of minimal metadata is: | ||
|
|
||
| ``{}`` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, one more small nitpick: the minimal metadata is actually no metadata, which is typically represented as an empty string (I am actually not fully sure if in this case the metadata key could also just not be present in the field metadata), instead of an empty json dict (I don't think we should necessarily recommend using an empty dict)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a fair point! Empty string feels like the safer choice here. See my suggested change below.
jorisvandenbossche marked this conversation as resolved.
Outdated
|
||
|
|
||
| - Example with ``dim_names`` metadata for NCHW ordered data: | ||
|
rok marked this conversation as resolved.
Outdated
|
||
|
|
||
| ``{ "dim_names": ["C", "H", "W"] }`` | ||
|
|
||
| - Example with ``uniform_shape`` metadata for a set of color images | ||
| with variable width: | ||
|
pitrou marked this conversation as resolved.
Outdated
rok marked this conversation as resolved.
Outdated
|
||
|
|
||
| ``{ "dim_names": ["H", "W", "C"], "uniform_shape": [400, 0, 3] }`` | ||
|
|
||
| - Example of permuted 3-dimensional tensor: | ||
|
|
||
| ``{ "permutation": [2, 0, 1] }`` | ||
|
|
||
| This is the physical layout shape and the shape of the logical | ||
| layout would given an individual tensor of shape [100, 200, 500] | ||
| be ``[500, 100, 200]``. | ||
|
rok marked this conversation as resolved.
Outdated
|
||
|
|
||
| .. note:: | ||
|
|
||
| With the exception of ``permutation``, the parameters and storage | ||
| of VariableShapeTensor relate to the *physical* storage of the tensor. | ||
|
|
||
| For example, consider a tensor with: | ||
|
rok marked this conversation as resolved.
Outdated
|
||
| shape = [10, 20, 30] | ||
| dim_names = [x, y, z] | ||
| permutations = [2, 0, 1] | ||
|
|
||
| This means the logical tensor has names [z, x, y] and shape [30, 10, 20]. | ||
|
|
||
| Elements in a variable shape tensor extension array are stored | ||
| in row-major/C-contiguous order. | ||
|
rok marked this conversation as resolved.
Outdated
|
||
|
|
||
| ========================= | ||
| Community Extension Types | ||
| ========================= | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.