Skip to content

Support Shredded Lists/Array in variant_get#8354

Open
sdf-jkl wants to merge 57 commits intoapache:mainfrom
sdf-jkl:shredded_list_support
Open

Support Shredded Lists/Array in variant_get#8354
sdf-jkl wants to merge 57 commits intoapache:mainfrom
sdf-jkl:shredded_list_support

Conversation

@sdf-jkl
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl commented Sep 16, 2025

Which issue does this PR close?

Rationale for this change

We should be able to variant_get using Indices to path through VariantArrays

What changes are included in this PR?

Are these changes tested?

Yes, unit tested.

Are there any user-facing changes?

Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments that are hopefully helpful.

Also, we should (eventually) support nesting -- arrays and structs inside arrays.
Let's get simple lists of primitives working first, tho!

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how these unit tests will translate to variant_get?

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Sep 19, 2025

I'm not sure I understand how these unit tests will translate to variant_get?

Could you elaborate please?

I am currently trying to build just the Shredded List VariantArray test case, and while doing so learning how we could build them in shred_variant later. Once have a good way of building simple Shredded List VariantArray it will be easy to work on the rest of the unit tests for variant_get

No worries -- the current iteration does look it produces a correct shredded variant containing a list, so I should probably just be patient and let you finish!

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Sep 23, 2025

Hey @scovich I see that your current implementation of follow_shredded_path_element for VariantPathElement::Field when following the shredded path is successful, it returns a ShreddedPathStep::Success(field.shredding_state()) that holds a ShreddingState::Typed that holds a reference to the typed_value array. (That we later use for the next steps)

My question is: does ShreddedPathStep::Success() necessarily have to require the input ShreddingState to be a reference?

The reason I am asking is that since we use the output of follow_shredded_path_element to get the values from the shredded VariantArray, shouldn't we be free to drop the outer array once we extract the relevant typed_value?

The only way to work with list arrays I came up with so far, is to build new arrays with arrow_select::take, combining the path index and GenericListArray offsets.
But by using this method we create new arrays within the scope of the function and can't use a reference to the array in the ShreddedPathStep::Success.
(I just pushed a commit with a non-working implementation of the idea)

Should we instead look for another way to represent a resulting array consisting of slices instead?

I just saw the #8392

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Sep 25, 2025

Hey @scovich I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Sep 26, 2025

I made it work for a one of the simple tests and it doesn't go through with the second one because Variant to Arrow does not support utf8 yet.

Do we have an issue tracking variant_to_arrow types support? If not, I can make one.

I'm not sure we have a tracking issue for utf8 support in variant_to_arrow, but I've also noticed that it's an annoying gap for unit testing (we all seem to reach for string values...)

@sdf-jkl sdf-jkl marked this pull request as ready for review February 26, 2026 04:10
Comment thread parquet-variant-compute/src/variant_get.rs Outdated
@scovich
Copy link
Copy Markdown
Contributor

scovich commented Mar 2, 2026

Everything looks good, code-wise -- nice and clean.

But there's still an open question of whether we intend to follow the jsonpath spec in our path step logic, as e.g. spark does?
#8354 (comment)

The jsonpath spec requires foo[100] to return NULL if foo is not an array, and also requires returning NULL if foo has fewer than 101 elements. Similarly, foo.bar should return NULL if foo is not a struct and should also return NULL if foo has no field named bar. Safe casting would only influence actual casting decisions, e.g. a variant_get call that specifically requests a string and the requested path points to a struct.

In contrast, our current struct handling code currently returns an error if safe casting is disabled and:

  • a Field path step encounters a "wrong" type (L169)
  • an Index path step encounters a "wrong" type (L224)
  • an Index path step is out of bounds (L99)

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Mar 2, 2026

@alamb -- any opinions about supporting jsonpath semantics or not? Or ideas on who we should seek input from?

Copy link
Copy Markdown
Contributor

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes, thanks! One small nit on the tests.

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Copy link
Copy Markdown
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, thanks for the contribution

@sdf-jkl
Copy link
Copy Markdown
Contributor Author

sdf-jkl commented Mar 19, 2026

@alamb following up on this.

Please check #8354 (comment)

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 21, 2026

@alamb -- any opinions about supporting jsonpath semantics or not? Or ideas on who we should seek input from?

In my opinion supporing jsonpath sounds like a good idea in general

The usecase I know of (and have) for variant_get is to mirror spark's variant_get function: https://docs.databricks.com/aws/en/sql/language-manual/functions/variant_get

So having the arrow-rs implementation match sounds good too

@scovich
Copy link
Copy Markdown
Contributor

scovich commented Mar 23, 2026

Filed #9606 to track the jsonpath semantics -- it goes beyond just this PR.

Comment thread parquet-variant-compute/src/variant_get.rs Outdated
Comment thread parquet-variant/src/path.rs
Comment on lines +179 to +181
DataType::List(_) => take_list_like_index_as_shredding_state::<
GenericListArray<i32>,
>(typed_value.as_ref(), *index)?,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: It just makes my eyes hurt when fmt does stuff like this. But I don't know a way to make it better, unless we want to drastically shorten the function name ☹️

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take_list_like_index_state?

Last one still won't fit 😢

        VariantPathElement::Index { index } => {
            let state = match typed_value.data_type() {
                DataType::List(_) => take_list_like_index_state::<GenericListArray<i32>>(
                    typed_value.as_ref(),
                    *index,
                )?,
                DataType::LargeList(_) => take_list_like_index_state::<GenericListArray<i64>>(
                    typed_value.as_ref(),
                    *index,
                )?,
                DataType::ListView(_) => take_list_like_index_state::<GenericListViewArray<i32>>(
                    typed_value.as_ref(),
                    *index,
                )?,
                DataType::LargeListView(_) => take_list_like_index_state::<
                    GenericListViewArray<i64>,
                >(typed_value.as_ref(), *index)?,
                _ => {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it take_list_index_state works:

        VariantPathElement::Index { index } => {
            let state = match typed_value.data_type() {
                DataType::List(_) => {
                    take_list_index_state::<GenericListArray<i32>>(typed_value.as_ref(), *index)?
                }
                DataType::LargeList(_) => {
                    take_list_index_state::<GenericListArray<i64>>(typed_value.as_ref(), *index)?
                }
                DataType::ListView(_) => take_list_index_state::<GenericListViewArray<i32>>(
                    typed_value.as_ref(),
                    *index,
                )?,
                DataType::LargeListView(_) => take_list_index_state::<GenericListViewArray<i64>>(
                    typed_value.as_ref(),
                    *index,
                )?,

// Peel away the prefix of path elements that traverses the shredded parts of this variant
// column. Shredding will traverse the rest of the path on a per-row basis.
let mut shredding_state = input.shredding_state().borrow();
let mut shredding_state = input.shredding_state().clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ultimately cloning an ArrayRef (cheap) BinaryViewArray (not expensive, but not cheap either -- has to go through ArrayData). Should we change ShreddingState::value to Option<Cow<'a, BinaryViewArray>> to avoid unnecessary cloning for struct-only paths? Then we could get rid of BorrowedShreddingState entirely -- which BTW I think this PR currently leaves as dead code (but no clippy warnings because it's pub)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BorrowedShreddingState is still used in unshred_variant.rs, but that's it:

fn try_new_opt(shredding_state: BorrowedShreddingState<'a>) -> Result<Option<Self>> {


I tried using Cow before, but I was Cowing the whole ShreddingState, which was unnecessary.

I switched to .clone because value: Option<BinaryViewArray> is almost zero copy:

pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
// ...
pub struct GenericByteViewArray<T>
where
    T: ByteViewType + ?Sized,
{
    data_type: DataType, // 24 bytes
    views: ScalarBuffer<u128>, // zero copy cloning
    buffers: Arc<[Buffer]>, // zero copy cloning
    phantom: PhantomData<T>,  // zero sized
    nulls: Option<NullBuffer>,  // mostly zero copy cloning + 3usize
}

Cowing just the BinaryViewArray should work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd defer to @alamb on how important this is -- I don't have a good sense of how expensive these shallow-deep clones are in practice, but I do know he was chasing a whole workstream of PR to avoid unnecessary ArrayData etc, which seems related.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich the changes in this PR (#9610) will make value field an ArrayRef. This way there'd be no need for Cow or BorrowedShreddingState.

https://github.com/apache/arrow-rs/pull/9610/changes#diff-5f5ebb25cba94551493d3a9caa2ab4d94d19b1ce4ce3e3eac7662109a8966794R420

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that solves that, then! How should we sequence that and the removal of BorrowedShreddingState?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could push that PR forward, file a separate PR for BorrowedShreddingState and once they're done go back here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or push this one as is and fix it later

scovich pushed a commit that referenced this pull request Apr 10, 2026
…th element (#9676)

# Which issue does this PR close?

Currently this is the only place in `main` that handles `Path` in
`variant_get`. Other `variant_get` related PRs already follow the
JSONPath sementics. (#9598 and #8354)
- Closes #9606.

# Rationale for this change

Check issue

# What changes are included in this PR?

- Changed `variant_get` field path handling when can't cast to Struct
- Updated the related unit test to check the new logic
- Cleaned up some nearby tests

# Are these changes tested?
Yes, unit tests

# Are there any user-facing changes?
Yes, behavior change for `variant_get` kernel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support VariantPathElement::Index for Variant Arrays for variant_get

5 participants