Support Shredded Lists/Array in variant_get#8354
Support Shredded Lists/Array in variant_get#8354sdf-jkl wants to merge 57 commits intoapache:mainfrom
variant_get#8354Conversation
…ed_list_support
scovich
left a comment
There was a problem hiding this comment.
I'm not sure I understand how these unit tests will translate to variant_get?
Could you elaborate please? I am currently trying to build just the Shredded |
No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish! |
|
My question is: does The reason I am asking is that since we use the output of The only way to work with list arrays I came up with so far, is to build new arrays with
|
…ed_list_support
…ed_list_support
|
Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet. Do we have an issue tracking variant_to_arrow types support? If not, I can make one. |
I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...) |
|
Everything looks good, code-wise -- nice and clean. But there's still an open question of whether we intend to follow the jsonpath spec in our path step logic, as e.g. spark does? The jsonpath spec requires In contrast, our current struct handling code currently returns an error if safe casting is disabled and:
|
|
@alamb -- any opinions about supporting jsonpath semantics or not? Or ideas on who we should seek input from? |
liamzwbao
left a comment
There was a problem hiding this comment.
Nice changes, thanks! One small nit on the tests.
klion26
left a comment
There was a problem hiding this comment.
LGMT, thanks for the contribution
…ed_list_support
|
@alamb following up on this. Please check #8354 (comment) |
In my opinion supporing jsonpath sounds like a good idea in general The usecase I know of (and have) for variant_get is to mirror spark's variant_get function: https://docs.databricks.com/aws/en/sql/language-manual/functions/variant_get So having the arrow-rs implementation match sounds good too |
|
Filed #9606 to track the jsonpath semantics -- it goes beyond just this PR. |
| DataType::List(_) => take_list_like_index_as_shredding_state::< | ||
| GenericListArray<i32>, | ||
| >(typed_value.as_ref(), *index)?, |
There was a problem hiding this comment.
Aside: It just makes my eyes hurt when fmt does stuff like this. But I don't know a way to make it better, unless we want to drastically shorten the function name
There was a problem hiding this comment.
take_list_like_index_state?
Last one still won't fit 😢
VariantPathElement::Index { index } => {
let state = match typed_value.data_type() {
DataType::List(_) => take_list_like_index_state::<GenericListArray<i32>>(
typed_value.as_ref(),
*index,
)?,
DataType::LargeList(_) => take_list_like_index_state::<GenericListArray<i64>>(
typed_value.as_ref(),
*index,
)?,
DataType::ListView(_) => take_list_like_index_state::<GenericListViewArray<i32>>(
typed_value.as_ref(),
*index,
)?,
DataType::LargeListView(_) => take_list_like_index_state::<
GenericListViewArray<i64>,
>(typed_value.as_ref(), *index)?,
_ => {There was a problem hiding this comment.
Making it take_list_index_state works:
VariantPathElement::Index { index } => {
let state = match typed_value.data_type() {
DataType::List(_) => {
take_list_index_state::<GenericListArray<i32>>(typed_value.as_ref(), *index)?
}
DataType::LargeList(_) => {
take_list_index_state::<GenericListArray<i64>>(typed_value.as_ref(), *index)?
}
DataType::ListView(_) => take_list_index_state::<GenericListViewArray<i32>>(
typed_value.as_ref(),
*index,
)?,
DataType::LargeListView(_) => take_list_index_state::<GenericListViewArray<i64>>(
typed_value.as_ref(),
*index,
)?,| // Peel away the prefix of path elements that traverses the shredded parts of this variant | ||
| // column. Shredding will traverse the rest of the path on a per-row basis. | ||
| let mut shredding_state = input.shredding_state().borrow(); | ||
| let mut shredding_state = input.shredding_state().clone(); |
There was a problem hiding this comment.
This is ultimately cloning an ArrayRef (cheap) BinaryViewArray (not expensive, but not cheap either -- has to go through ArrayData). Should we change ShreddingState::value to Option<Cow<'a, BinaryViewArray>> to avoid unnecessary cloning for struct-only paths? Then we could get rid of BorrowedShreddingState entirely -- which BTW I think this PR currently leaves as dead code (but no clippy warnings because it's pub)
There was a problem hiding this comment.
BorrowedShreddingState is still used in unshred_variant.rs, but that's it:
I tried using Cow before, but I was Cowing the whole ShreddingState, which was unnecessary.
I switched to .clone because value: Option<BinaryViewArray> is almost zero copy:
pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
// ...
pub struct GenericByteViewArray<T>
where
T: ByteViewType + ?Sized,
{
data_type: DataType, // 24 bytes
views: ScalarBuffer<u128>, // zero copy cloning
buffers: Arc<[Buffer]>, // zero copy cloning
phantom: PhantomData<T>, // zero sized
nulls: Option<NullBuffer>, // mostly zero copy cloning + 3usize
}Cowing just the BinaryViewArray should work.
There was a problem hiding this comment.
I'd defer to @alamb on how important this is -- I don't have a good sense of how expensive these shallow-deep clones are in practice, but I do know he was chasing a whole workstream of PR to avoid unnecessary ArrayData etc, which seems related.
There was a problem hiding this comment.
Well that solves that, then! How should we sequence that and the removal of BorrowedShreddingState?
There was a problem hiding this comment.
We could push that PR forward, file a separate PR for BorrowedShreddingState and once they're done go back here.
There was a problem hiding this comment.
Or push this one as is and fix it later
…th element (#9676) # Which issue does this PR close? Currently this is the only place in `main` that handles `Path` in `variant_get`. Other `variant_get` related PRs already follow the JSONPath sementics. (#9598 and #8354) - Closes #9606. # Rationale for this change Check issue # What changes are included in this PR? - Changed `variant_get` field path handling when can't cast to Struct - Updated the related unit test to check the new logic - Cleaned up some nearby tests # Are these changes tested? Yes, unit tests # Are there any user-facing changes? Yes, behavior change for `variant_get` kernel
Which issue does this PR close?
Rationale for this change
We should be able to
variant_getusing Indices to path throughVariantArraysWhat changes are included in this PR?
Are these changes tested?
Yes, unit tested.
Are there any user-facing changes?