[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route#21554
Conversation
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit ae3fdc2.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
@ahkcs Thanks Kai, can you pls fix this one: looks like we need to add this dependency's license files. |
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:
* `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
NOTICE (no shared `commons-text-*` license files yet in this plugin).
Apache 2.0; LICENSE and NOTICE extracted from the released jar.
* `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
the shared `jackson-LICENSE` / `jackson-NOTICE` files via
`mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
files are needed.
Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.
Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.
Addresses review feedback on opensearch-project#21554.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Fixed in aa2fb0a. Added the SHA + LICENSE + NOTICE for |
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:
* `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
NOTICE (no shared `commons-text-*` license files yet in this plugin).
Apache 2.0; LICENSE and NOTICE extracted from the released jar.
* `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
the shared `jackson-LICENSE` / `jackson-NOTICE` files via
`mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
files are needed.
Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.
Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.
Addresses review feedback on opensearch-project#21554.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
d2f73b7 to
fa534e7
Compare
* Default empty array() return type to ARRAY<VARCHAR> PPL's `array()` no-arg form delegates to Calcite's `SqlLibraryOperators.ARRAY` return-type inference, which returns ARRAY<NULL> for an empty operand list and ARRAY<UNKNOWN> when all operands are typeless nulls. Both markers are fine for the v2 engine — `ArrayImplementor.internalCast` only consumes the element type when there are elements to cast, so an empty result Object list flows straight through to ExprCollectionValue regardless of declared element type. The analytics-engine route is stricter. When isthmus walks a RexCall like `mvjoin(array(), '-')`, it reaches its first operand's type and feeds it to `io.substrait.isthmus.TypeConverter.toSubstrait`, which throws `UnsupportedOperationException: Unable to convert the type UNKNOWN`. Substrait has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't serialize the call at all. Two PPL ITs hit this directly: * `CalciteArrayFunctionIT.testMvjoinWithEmptyArray` * `CalciteArrayFunctionIT.testMvdedupWithEmptyArray` Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives the call a substrait-serializable type without affecting any value computation: the result list is empty either way. # Test plan * Unit tests: `:core:test --tests "*ArrayFunction*"` — passes locally (no existing tests asserted on the empty-array element type). * IT: `CalciteArrayFunctionIT` force-routed through the analytics-engine path via opensearch-project/OpenSearch#21554's plugin set — testMvjoinWithEmptyArray and testMvdedupWithEmptyArray now pass (were UNKNOWN type errors); pass-rate moved 26/60 → 28/60. Companion to opensearch-project/OpenSearch#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Add unit tests for empty/UNKNOWN ARRAY → VARCHAR fallback Cover the four shapes that exercise the return-type inference path introduced in 666dc0e: * array() — 0 operands, fallback fires → ARRAY<VARCHAR> * array(NULL) — typeless-null operand, fallback fires → ARRAY<VARCHAR> * array(1) — INTEGER operand, fallback does NOT fire → ARRAY<INTEGER> * array('a', 'b') — VARCHAR operands, fallback does NOT fire → ARRAY<VARCHAR> The third case is the regression guard requested by review — confirms concrete element types pass through unchanged and the fallback is scoped strictly to the {@code NULL}/{@code UNKNOWN} markers. The harness uses Calcite's {@link ExplicitOperatorBinding} bound to {@link SqlLibraryOperators#ARRAY} so the inference's internal {@code SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(...)} call resolves the same operator the production code delegates to — mocking {@code SqlOperatorBinding} directly hits NPEs deep inside Calcite's least-restrictive-type computation. Addresses review feedback on #5421. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Trim inline comment per review feedback The "why" lives in the PR description; the inline comment now points there instead of duplicating it. Addresses dai-chen's review note on Signed-off-by: Kai Huang <ahkcs@amazon.com> #5421. --------- Signed-off-by: Kai Huang <ahkcs@amazon.com>
44d604e to
a652406
Compare
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:
* `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
NOTICE (no shared `commons-text-*` license files yet in this plugin).
Apache 2.0; LICENSE and NOTICE extracted from the released jar.
* `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
the shared `jackson-LICENSE` / `jackson-NOTICE` files via
`mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
files are needed.
Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.
Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.
Addresses review feedback on opensearch-project#21554.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:
* `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
NOTICE (no shared `commons-text-*` license files yet in this plugin).
Apache 2.0; LICENSE and NOTICE extracted from the released jar.
* `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
the shared `jackson-LICENSE` / `jackson-NOTICE` files via
`mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
files are needed.
Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.
Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.
Addresses review feedback on opensearch-project#21554.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
a652406 to
ae3fdc2
Compare
|
❌ Gradle check result for ae3fdc2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for ae3fdc2: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…ength / array_slice / array_distinct / mvjoin Onboards the PPL `array(a, b, …)` constructor and four array-consuming functions to the analytics-engine route by mapping their Calcite lowering targets through Substrait to DataFusion's native make_array / array_length / array_slice / array_distinct / array_to_string. Same templated shape as the `replace` PR (opensearch-project#21527), with two extensions: ScalarFunction enum constants (5) + STANDARD_PROJECT_OPS / ARRAY_RETURNING_PROJECT_OPS membership + opensearch_array_functions.yaml extension entries + ADDITIONAL_SCALAR_SIGS Calcite-op→Substrait-name bridges + scalarFunctionAdapters() entries for the 3 functions that need operand normalization = onboarded to the analytics route. Capability lookup at OpenSearchProjectRule keys on the call's return type; for array-returning functions (`array(...)`, `array_slice`, `array_distinct`) the return type resolves to `SqlTypeName.ARRAY`, which previously hit `default → null` in `FieldType.fromSqlTypeName` and emptied the viable-backend list before the registration could match. * `FieldType.ARRAY` added to the analytics SPI enum. * `SqlTypeName.ARRAY → FieldType.ARRAY` mapping in `fromSqlTypeName`. * `ARRAY_RETURNING_PROJECT_OPS` registered against `Set.of(FieldType.ARRAY)` only — separate from `STANDARD_PROJECT_OPS` so `FieldType.ARRAY` doesn't pollute filter / aggregate capabilities (no meaningful semantics over array-typed values there). * `ArrowSchemaFromCalcite.toArrowField` recurses into the component type to build the matching Arrow `List<inner>` field — without this the result schema would have a bare `List` with no element field and the backend's Arrow IPC reader would fail to bind result columns. Substrait's standard catalog has no array_* entries, so isthmus' `RexExpressionConverter` would fail with "Unable to convert call …" on every array call. New `opensearch_array_functions.yaml` declares: * `make_array(any1, …)` → `list<any1>` (variadic, min: 0). * `array_length(list<any1>)` → `i64?`. * `array_slice(list<any1>, i64, i64)` → `list<any1>` (with i32 fallback). * `array_distinct(list<any1>)` → `list<any1>`. * `array_to_string(list<any1>, string)` → `string?` (with varchar fallback). Loaded via `SimpleExtension.load("/opensearch_array_functions.yaml")` and merged into the plugin's extension collection in `DataFusionPlugin.loadSubstraitExtensions()`. Substrait's call-conversion path (and DataFusion's signature matcher) is strict about operand types in ways Calcite's PPL lowering doesn't naturally satisfy. Three adapters bridge the gap: * `MakeArrayAdapter` — implements `ScalarFunctionAdapter` directly (not `AbstractNameMappingAdapter`). PPL's `ArrayFunctionImpl` infers `ARRAY<commonElementType>` for the call's return type but does NOT widen the individual operand types. So `array(1, 1.5)` produces a RexCall whose operands are `(INTEGER, DECIMAL(2,1))` but whose return type is `ARRAY<DOUBLE>`. Substrait's variadic-`any1` consistency validator throws an `AssertionError` in that case (not a recoverable exception — it fatally exits the search-thread JVM). The adapter extracts the call's component type and CASTs each non-matching operand to it before emission. * `ArrayToStringAdapter` — declares a local `array_to_string` op and name-maps `SqlLibraryOperators.ARRAY_JOIN` → it. * `ArraySliceAdapter` — passes the `ARRAY_SLICE` call through unchanged but coerces the index operands (positions 1, 2, optional 3) to `BIGINT`. PPL's parser types positive integer literals as `DECIMAL(20,0)`; DataFusion's `array_slice` signature accepts only integer indexes and refuses to coerce decimal arguments. Two third-party dependencies that surfaced as fatal `NoClassDefFoundError` during execution of array-returning calls: * `commons-text` to analytics-engine — Calcite's `SqlFunctions` class statically references `org.apache.commons.text.similarity.LevenshteinDistance`. Without it, any Calcite RelNode walk that touches `SqlFunctions.<clinit>` poisons the search-thread JVM. * `jackson-datatype-jsr310` to **arrow-flight-rpc** (the parent plugin that bundles `arrow-vector`). `arrow-vector`'s `JsonStringArrayList` eagerly registers `JavaTimeModule` on its ObjectMapper in `<clinit>`, so any reader of an Arrow `ListVector` (i.e. every array-returning DataFusion call flowing through analytics-engine) hits a fatal NoClassDefFoundError. The dep belongs on arrow-flight-rpc's classpath because that plugin defines arrow-vector's classloader; bundling it in analytics-backend-datafusion (the child plugin) is invisible to arrow-vector. Marked `compileOnly` here to avoid jar-hell with arrow-flight-rpc's `api` dependency. * Before: 1/60 (testArrayWithMix only — exercises an error path that fails before the ARRAY capability lookup). * After: 9/60. Newly passing: testArray, testArrayWithString, testArrayLength, testMvjoinWithStringArray, testMvjoinWithStringifiedNumbers, testMvjoinWithMixedStringValues, testMvjoinWithStringBooleans, testMvjoinWithSpecialDelimiters, testMvjoinWithArrayFromRealFields, testMvjoinWithMultipleRealFields. The remaining 51 failures fall into three buckets: * 50 — out-of-scope S1+ functions (`mvfind`, `mvzip`, `reduce`, `transform`, `forall`, `filter`, `exists`, `ITEM`). These are PPL UDFs without direct DataFusion equivalents and need either lambda-substrait wiring or custom UDF registration on the Rust side. * 5 — `testMvindexRange*` family. PPL's `mvindex(arr, from, to)` lowers to `ARRAY_SLICE(arr, from+1, to+1)` (1-based shift) but the lowering is missing the +1, so DataFusion's 1-based array_slice returns a window shifted by one. Fix belongs in the SQL plugin's PPL→Calcite lowering layer. * 1 — `testMvindexRangeMixed` JSON formatting mismatch (test code expects bare `[a,b,c]` but the response is `\"[\\\"a\\\",\\\"b\\\",\\\"c\\\"]\"`). Signed-off-by: Kai Huang <ahkcs@amazon.com>
…th) → 1-based-(start, end) for DataFusion Calcite's `SqlLibraryOperators.ARRAY_SLICE` is the Spark / Hive flavor — 0-based start, third arg is the length-of-elements to take. PPL's `MVIndexFunctionImp.resolveRange` (in the SQL plugin) emits this form, e.g. `mvindex(arr=[1..5], 1, 3)` → `ARRAY_SLICE(arr, 1, 3)` meaning "start at 0-based position 1, take 3 elements" → expected `[2, 3, 4]`. DataFusion's native `array_slice` is the Postgres / Snowflake flavor — 1-based start, third arg is the inclusive end-index. So the same call `array_slice(arr, 1, 3)` returns elements at 1-based positions 1..3 → `[1, 2, 3]`. Off-by-one across every `mvindex` range query. Convert the operands in the adapter rather than the SQL plugin's PPL lowering, because the lowering's existing semantics are correct for Calcite's local executor (used by every non-analytics path); the bug is only in the bridge to DataFusion. start' = start + 1 end' = start + length (== start + 1 + (length - 1)) `MVIndexFunctionImp` already normalizes negative indexes to non-negative 0-based positions before invoking ARRAY_SLICE (it uses `arrayLen + idx`), so the arithmetic above applies uniformly. Empirically: `mvindex(arr=[1..5], 1, 3)` now returns the correct values `[2, 3, 4]` (was `[1, 2, 3]`); negative form `mvindex(arr, -3, -1)` returns `[3, 4, 5]` (was `[2, 3]`); mixed `mvindex(arr, -4, 2)` returns `[2, 3]` matching the PPL spec. The 5 `testMvindexRange*` tests still don't pass on the IT, but for an unrelated reason — array-typed result values are being returned as JSON-stringified scalars (`"[2,3,4]"`) instead of typed arrays. That's a response-formatting issue affecting every array-returning test (also `testArray`, `testArrayWithString`) and lives in a different code path; it'll be addressed separately. Signed-off-by: Kai Huang <ahkcs@amazon.com>
…t) → DataFusion array_element
PPL's `mvindex(arr, N)` single-element form lowers (in `MVIndexFunctionImp.resolveSingleElement`)
to Calcite's `SqlStdOperatorTable.ITEM` operator with a 1-based index (already
converted from PPL's 0-based input). DataFusion's native single-element array
accessor is `array_element` (also 1-based), so a name-mapping adapter + yaml
extension entry are sufficient.
Templated shape:
ScalarFunction.ITEM (SqlKind.ITEM)
+ STANDARD_PROJECT_OPS membership (returns the array's element type, which
resolves through the existing FieldType.fromSqlTypeName → SUPPORTED_FIELD_TYPES
capability lookup for non-array element types — array-of-array is rare in
PPL and not exercised by the current test surface)
+ scalarFunctionAdapters() entry → ArrayElementAdapter
↳ rewrites SqlStdOperatorTable.ITEM to a locally-declared SqlFunction
named "array_element"
↳ coerces the index operand to BIGINT (PPL's parser produces DECIMAL
for positive integer literals; DataFusion's array_element rejects
DECIMAL indexes, same as array_slice)
+ ADDITIONAL_SCALAR_SIGS bridge for the locally-declared op
+ opensearch_array_functions.yaml extension entry:
array_element(list<any1>, i64) → any1?
# Pass-rate (CalciteArrayFunctionIT, force-routed)
* Before this commit: 9/60.
* After this commit: 12/60.
Newly passing: testMvindexSingleElementPositive,
testMvindexSingleElementNegative,
testMvindexSingleElementNegativeMiddle.
The other 3 tests that hit the ITEM rejection (testMvfindWith*) are
multi-step queries where ITEM is one node in a tree that also includes
unrelated S1+ functions (mvfind/mvzip/etc.); they remain blocked by
the upstream functions, not by ITEM itself.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ithout JSON-stringifying
The row-oriented fragment-execution wire format (`FragmentExecutionResponse`,
used when arrow-flight streaming is disabled — every single-node test cluster
today) shipped each cell through OpenSearch's `writeGenericValue` /
`readGenericValue`, which preserves `List` values as `ArrayList<Object>`. On
the coordinator side, `RowResponseCodec.decode` then re-materialized the rows
into a `VectorSchemaRoot` for `Iterable<VectorSchemaRoot>`-style consumers.
Two bugs in that re-materialization were eating array values:
1. `inferArrowType` walked rows for the first non-null cell and matched
against {Long, Integer, …, CharSequence, byte[], Number}. {@code List}
wasn't in the chain, so it fell through to {@code break} and the
fallback {@link ArrowType.Utf8} — every array column became a VARCHAR
column.
2. `setVectorValue` for {@link VarCharVector} called {@code value.toString()}.
For a {@code JsonStringArrayList} that returns the JSON form
{@code "[2,3,4]"}, which then got serialized as a JSON string in the
final response. Tests like {@code testMvindexRangePositive} saw their
array result come back as a string `"[2,3,4]"` instead of an array
`[2, 3, 4]`.
Fix:
* Replace {@code inferArrowType} with {@code inferField} that returns a
full {@link Field}. For {@code List} cells, build a list field with the
inner element type inferred from the first non-null element (with a
fallback that scans later rows in case the first list is empty/all-null).
* Add a {@code ListVector} arm to {@code setVectorValue} that delegates to
a new {@code writeListValue}. The writer bypasses {@link UnionListWriter}
entirely — it writes directly to the list's offset / validity buffers and
to the inner data vector via the inner vector's typed `setSafe`. The
writer-based API requires per-element `ArrowBuf` allocations for varchar
elements that are easy to leak or use-after-free; the direct path is
simpler and avoids both classes of bug.
Plus a separate Arrow gotcha that surfaced once arrays started flowing
through correctly:
* {@code ListVector.getObject} for a {@code VarCharVector} child returns a
{@code JsonStringArrayList} whose elements are Arrow's {@link Text} class,
not Java {@link String}. {@code ExprValueUtils.fromObjectValue} doesn't
recognize {@code Text} and threw "unsupported object class
org.apache.arrow.vector.util.Text". {@code ArrowValues.toJavaValue} now
mirrors its top-level VarChar branch for list cells: when a list value
comes back from a {@code ListVector}, normalize each {@code Text} element
to a {@link String} before handing the list upward.
* Before: 12/60 (mvindex range tests still showed expected-vs-actual
diff because `[2,3,4]` came back as a JSON string, not an array).
* After: 26/60.
Newly passing:
testMvindexRangePositive, testMvindexRangeNegative, testMvindexRangeMixed,
testMvindexRangeFirstThree, testMvindexRangeLastThree,
testMvindexRangeSingleElement,
testMvdedupWithDuplicates, testMvdedupWithAllDuplicates,
testMvdedupWithNoDuplicates, testMvdedupWithStrings,
testArrayWithString,
testSplitWithSemicolonDelimiter, testSplitWithMultiCharDelimiter,
testSplitWithEmptyDelimiter.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:
* `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
NOTICE (no shared `commons-text-*` license files yet in this plugin).
Apache 2.0; LICENSE and NOTICE extracted from the released jar.
* `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
the shared `jackson-LICENSE` / `jackson-NOTICE` files via
`mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
files are needed.
Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.
Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.
Addresses review feedback on opensearch-project#21554.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
… inference
{@code RowResponseCodec.scalarArrowType} ordered its instanceof checks
{Long, Integer, Short, Byte, Double, Float, Boolean, CharSequence, byte[],
Number(fallback) → Int(64)}. BigDecimal extends {@link Number} but isn't any
of the typed scalar arms, so it fell through to the {@code Number} fallback
and got encoded as a 64-bit integer column — silently truncating fractional
digits.
This bites PPL flows whose common element type is {@code DECIMAL} (e.g.
{@code array(1, -1.5, 2, 1.0)} — the v2-side {@code ArrayImplementor.internalCast}
explicitly maps the DECIMAL target to BigDecimal cells). The element values
{@code -1.5} and {@code 1.0} round to {@code -1} and {@code 1} when forced
through Int(64), so the array reads back as {@code [1, -1, 2, 1]} instead of
{@code [1, -1.5, 2, 1.0]}.
Promote BigDecimal cells to FloatingPoint(DOUBLE) — same precision the v2
engine uses for decimal-typed PPL results, so behavior matches across both
execution paths. The list writer's {@code Float8Vector} arm already uses
{@code ((Number) element).doubleValue()}, which correctly extracts the
fractional value from a BigDecimal.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
PPL `mvzip(left, right [, sep])` element-wise zips two arrays into a list of
strings, joined per pair by a separator (default `,`). DataFusion has no
stdlib equivalent — `array_concat` is end-to-end concatenation, and Substrait's
lambda support is too thin for a transform/zip rewrite — so this onboards a
custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session
context and wires the Java side to route to it.
Templated shape (extends the existing pattern from convert_tz):
Rust side:
udf::mvzip::MvzipUdf — Signature::user_defined; coerce_types pins the
first two args to ListArray and the optional 3rd to Utf8; invoke_with_args
iterates per row, takes min(len(left), len(right)) elements, stringifies
each (matching `Objects.toString(elem, "")` for null elements), and
builds a List<Utf8>. Defensive Null-element-type arm handles the empty
array case before the SQL-plugin VARCHAR-default kicks in.
Registered on each session context via udf::register_all alongside
convert_tz. 7 unit tests cover the basic / custom-sep / truncation /
null-element / null-array / empty-array / numeric-array shapes.
Java side:
ScalarFunction.MVZIP enum entry (SqlKind.OTHER_FUNCTION; resolves through
identifier-name valueOf("MVZIP") since PPL's MVZipFunctionImpl registers
under the function name "mvzip").
MvzipAdapter — locally-declared SqlFunction("mvzip") + ADDITIONAL_SCALAR_SIGS
bridge so isthmus emits a Substrait scalar function call with the exact
name the Rust UDF is registered under.
DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership
(returns ARRAY<VARCHAR>, registered against FieldType.ARRAY); adapter
registration in scalarFunctionAdapters().
opensearch_array_functions.yaml: two impls for arity-2 and arity-3.
* Before: 28/60.
* After: 34/60.
Newly passing — all 5 testMvzip* variants:
testMvzipBasic, testMvzipWithCustomDelimiter, testMvzipNested,
testMvzipWithEmptyArray, testMvzipWithBothEmptyArrays.
(Test count delta is +6 because the test class also exercises mvzip in 1
other test under a different name, picked up by the same fix.)
This PR's run also picks up the SQL-plugin companion #5421 which defaults
empty `array()` to ARRAY<VARCHAR>. Without that companion the testMvzipWith*EmptyArray
variants would still fail — substrait would reject the input ARRAY<NULL>
type before reaching the UDF. The Rust UDF's Null-element arm exists as a
defensive backstop in case the call ever reaches it with a null-typed list.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
PPL `mvfind(arr, regex)` finds the 0-based index of the first array element
matching a regex pattern (Java `Matcher.find` substring-match semantics), or
NULL if no match. DataFusion has no stdlib equivalent, and rewriting in terms
of array_position requires per-element regex evaluation that's only
expressible with substrait lambda support — out of scope here. Onboards a
custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session
context, mirroring the mvzip/convert_tz pattern.
Templated shape:
Rust side:
udf::mvfind::MvfindUdf — Signature::user_defined; coerce_types pins arg 0
to a list type and arg 1 to Utf8; invoke_with_args walks each row and
finds the first non-null element whose stringified form matches the
regex via Rust's `regex` crate (`Regex::is_match` is unanchored, same
as Java's `Matcher.find`). Scalar pattern operands compile once up
front and surface invalid-regex errors at plan time (mirrors the SQL
plugin's plan-time `tryCompileLiteralPattern`); column-valued patterns
compile per row and yield NULL for invalid patterns. Supports list
element types Utf8 / Int{8,16,32,64} / UInt{8,16,32,64} / Float{32,64}
/ Boolean / Null. 7 unit tests cover the basic-match / no-match /
null-array / empty-array / null-element / numeric-array / unanchored
shapes.
Registered on each session context via udf::register_all alongside
convert_tz and mvzip.
Java side:
ScalarFunction.MVFIND enum entry (SqlKind.OTHER_FUNCTION; resolves
through identifier-name valueOf("MVFIND") since PPL's
MVFindFunctionImpl registers under the function name "mvfind").
MvfindAdapter — locally-declared SqlFunction("mvfind") +
ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar
function call with the exact name the Rust UDF is registered under.
DataFusionAnalyticsBackendPlugin: STANDARD_PROJECT_OPS membership
(returns INTEGER, registered against the existing scalar
SUPPORTED_FIELD_TYPES); adapter registration in
scalarFunctionAdapters().
opensearch_array_functions.yaml: arity-2 impl returning `i32?`.
* Before: 34/60.
* After: 42/60.
Newly passing — 8 of 9 testMvfind* variants:
testMvfindWithMatch, testMvfindWithFirstMatch, testMvfindWithMultipleMatches,
testMvfindWithNoMatch, testMvfindWithEmptyArray, testMvfindWithNumericArray,
testMvfindWithCaseInsensitive, testMvfindWithComplexRegex.
Remaining mvfind failure:
testMvfindWithDynamicRegex — fails with "Unable to convert call
CONCAT(string, string)" because the test computes the pattern via
`concat('ban', '.*')` and substrait can't bind the CONCAT call. This is a
separate analytics-engine CONCAT type-conversion issue, not mvfind-specific.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
… UDF
PPL `mvappend(arg1, arg2, …)` flattens a mixed list of array and scalar
arguments into one array, dropping null arguments and null elements within
array arguments. DataFusion's `array_concat` is the closest stdlib match but
only accepts arrays (not mixed array+scalar) and preserves nulls — different
semantics. Onboards as a custom Rust ScalarUDF on the analytics-backend-datafusion
plugin's session context, mirroring the mvzip / mvfind pattern.
Templated shape:
Rust side:
udf::mvappend::MvappendUdf — Signature::user_defined; per-row walk over
operands, skipping NULL args and NULL elements inside array args, with
explicit Arrow type arms for {Int8/16/32/64, UInt8/16/32/64,
Float32/64, Boolean, Utf8/LargeUtf8/Utf8View}. The string arms output
List<Utf8> or List<Utf8View> depending on the inferred element type so
the result schema matches what `return_type` declared (DataFusion's
execution-time schema check rejects mismatches). Defensive Null
element-type arm covers the empty-array shape. 6 unit tests.
Registered on each session context via udf::register_all.
Java side:
ScalarFunction.MVAPPEND enum entry (SqlKind.OTHER_FUNCTION; resolves
through identifier-name valueOf("MVAPPEND")).
MvappendAdapter — locally-declared SqlFunction("mvappend") +
ADDITIONAL_SCALAR_SIGS bridge. Casts every scalar operand to the
call's array component type and every array operand to
ARRAY<componentType> before substrait emission, so the UDF sees a
single uniform element type across all positions.
DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership
(returns ARRAY<commonType>); adapter registration in
scalarFunctionAdapters().
opensearch_array_functions.yaml: variadic min:1 entry with `list<any1?>`
return type.
* Before: 0/15.
* After: 6/15.
Newly passing:
testMvappendWithMultipleElements, testMvappendWithSingleElement,
testMvappendWithArrayFlattening, testMvappendWithStringValues,
testMvappendWithNestedArrays, testMvappendWithRealFields.
* 8 tests fail with "Unable to convert the type ANY". Root cause is
PPL's MVAppendFunctionImpl.updateMostGeneralType using strict
Object.equals on each pair of operand types, returning Calcite's
ANY type when any two don't match — including when they only differ
in nullability tag (a literal 3 is INTEGER NOT NULL but the
component type of `array(1, 2)` is INTEGER NULLABLE). Substrait
can't serialize ANY. The fix belongs in the SQL plugin's
MVAppendFunctionImpl (use typeFactory.leastRestrictive instead of
Object.equals) and isn't addressed here.
* testMvappendInWhereClause — uses `where array_length(combined) = 2`
which the analytics-engine planner rejects with "No backend can
evaluate filter predicate [EQUALS] on fields [combined:ARRAY]".
Filter-side capability gap unrelated to mvappend.
* testMvappendWithComplexExpression — fails substrait conversion on
a nested mvappend call ("Unable to convert call mvappend(list, …)"),
likely the same nullability widening pattern flowing through nested
calls. Same upstream fix applies.
Unchanged at 43/60 — mvappend isn't exercised there.
Signed-off-by: Kai Huang <ahkcs@amazon.com>
… lists; add Decimal128 element support Two follow-ons to the initial mvappend onboarding (40b2161), both surfaced once the SQL companion opensearch-project#5424 (`MVAppendFunctionImpl.leastRestrictive`) let homogeneous-type calls reach substrait conversion. # Uniform-list operand reshape Substrait's variadic-`any1` argument shape requires every operand at the same variadic position to share a type. PPL's `mvappend(arg, …)` accepts a mix of bare scalars and arrays, which substrait's signature matcher rejected with `Unable to convert call mvappend(list<i32?>, i32?, i32?)`. `MvappendAdapter` now wraps each scalar operand in a singleton `make_array(scalar)` call (using the locally-declared `MakeArrayAdapter.LOCAL_MAKE_ARRAY_OP`) so by the time the substrait converter sees the operands they're uniformly `list<componentType>`. The yaml impl was correspondingly tightened from `args: [{ value: any1 }] variadic` to `args: [{ value: list<any1?> }] variadic`. Rust UDF (`udf::mvappend`) keeps its scalar-handling branch intact as a defensive fallback, but in practice every operand it sees is a list now. # Decimal128 element type Calcite's leastRestrictive widening on INT + DECIMAL produces DECIMAL(p, s) which substrait converts to Decimal128(p, s); the Java adapter casts every operand's element type to that. The Rust UDF needed an explicit `DataType::Decimal128(p, s)` branch — Decimal128Builder requires `.with_precision_and_scale(p, s)` configuration before use, and Decimal128Array elements are read via the `i128`-valued `value(i)` accessor (not via the generic `build!` macro). # Pass-rate (CalciteMVAppendFunctionIT, force-routed, with companion opensearch-project#5424 applied) * Before this commit: 6/15 (initial mvappend onboarding). * After this commit: 10/15. Newly passing: testMvappendWithMixedArrayAndScalar (uniform-list reshape), testMvappendWithComplexExpression (uniform-list reshape), testMvappendWithIntAndDouble (Decimal128 element), testMvappendWithNumericArrays (Decimal128 element). Remaining 5 failures: * testMvappendWithMixedTypes / WithFieldsAndLiterals / WithEmptyArray / WithNull — call legitimately widens to ARRAY<ANY> because operands contain pairs of types with no common widened type (INT + VARCHAR). The Calcite engine handles ANY via Object generic dispatch; substrait can't encode it. Out of scope without changing PPL UDF semantics. * testMvappendInWhereClause — uses `where array_length(combined) = 2` which the analytics-engine planner rejects with "No backend can evaluate filter predicate [EQUALS] on fields [combined:ARRAY]". Filter-side capability gap unrelated to mvappend. Signed-off-by: Kai Huang <ahkcs@amazon.com>
… contexts create_session_context (the Rust-side builder behind df_create_session_context) built a fresh DataFusion SessionContext but never called udf::register_all on it. Every fragment query routed through df_execute_with_context reused that handle's ctx via query_executor::execute_with_context, so substrait function references to mvappend / mvfind / mvzip / convert_tz failed planning with "This feature is not implemented: Unsupported function name". The matching register_all call exists in execute_query / local_executor / indexed_executor — this just brings the FFM session-context path to parity. Verified: CalciteMVAppendFunctionIT against the analytics-engine route now passes 10/15 (was 0/15) with the SQL companion opensearch-project#5424 widening fix applied. The remaining 5 are pre-existing ARRAY<ANY>/UNKNOWN substrait-encoding gaps (heterogeneous mvappend signatures, empty-array default, filter-on-array predicate) tracked in this PR's "What's left" section. Signed-off-by: Kai Huang <ahkcs@amazon.com>
ae3fdc2 to
459bffc
Compare
`MVAppendFunctionImpl.updateMostGeneralType` used strict {@code Object.equals}
to compare each operand's component type against the running "most general"
type, falling back to Calcite's {@code ANY} on any mismatch. That's too
aggressive: {@code Object.equals} returns false for type pairs that differ
only in nullability tag (e.g. {@code array(1, 2)} synthesizes INTEGER NULLABLE
for its component while literal {@code 3} is INTEGER NOT NULL), and for
straightforwardly-widenable numerics like INTEGER + DOUBLE. The PPL UDF result
would then be {@code ARRAY<ANY>}.
The Calcite engine's enumerable runtime tolerates {@code ANY} because
{@code MVAppendImplementor.eval} processes elements through {@code Object} —
the declared element type is unused at execution time. The analytics-engine
route is stricter: substrait can't serialize {@code ANY}, so isthmus throws
{@code UnsupportedOperationException: Unable to convert the type ANY} during
the substrait conversion phase.
Widen with {@link RelDataTypeFactory#leastRestrictive} — the same routine
{@code SqlLibraryOperators.ARRAY} uses for its return-type inference. Falls
back to ANY only when {@code leastRestrictive} returns null (genuinely
incompatible operand types like INT + VARCHAR), preserving the original
behavior on those queries.
# Test plan
* {@code :core:test --tests "*MVAppend*"} — passes (no existing test asserted
on the {@code ANY} fallback).
* Companion to opensearch-project/OpenSearch#21554 — unblocks 8+ tests in
{@code CalciteMVAppendFunctionIT} force-routed through the analytics-engine
path that previously failed with "Unable to convert the type ANY".
Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ill the cluster Substrait's plan validators (VariadicParameterConsistencyValidator, RelOptUtil.eq via Litmus.THROW, etc.) throw AssertionError directly via explicit `throw new AssertionError(...)` rather than via the `assert` keyword, so the JVM -da flag doesn't gate them. When a malformed plan triggers one inside a search-thread call to SubstraitRelVisitor.apply, the AssertionError propagates uncaught up the analytics-engine fragment handler stack, OpenSearchUncaughtExceptionHandler classifies it as fatal, and the entire cluster JVM exits. Wrap the visitor.apply call in a narrow try/catch that re-raises the AssertionError as IllegalStateException with the original message and cause preserved. The analytics-engine error path already buckets IllegalStateException at the fragment boundary into a normal HTTP 500 response — the cluster stays up and the failure shows in the per-query report instead. This came up while diagnosing CalciteMVAppendFunctionIT failures: malformed ARRAY<ANY> plans were taking down the cluster mid-test instead of producing per-test failures, masking the underlying substrait conversion error. Signed-off-by: Kai Huang <ahkcs@amazon.com>
459bffc to
1a0f571
Compare
|
❌ Gradle check result for 1a0f571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 1a0f571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 1a0f571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 1a0f571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…ST path Self-contained QA ITs in sandbox/qa/analytics-engine-rest exercising the PPL collection functions onboarded in this PR through POST /_analytics/ppl against a parquet-backed `calcs` dataset, no SQL plugin checkout required. ArrayFunctionIT (22 tests): - array constructor (mixed-numeric BigDecimal → Double promotion + int+string) - array_length - mvindex range (array_slice — 0-based-(start, length) → 1-based-(start, end)) - mvindex single (array_element via ITEM rename) - mvdedup (array_distinct) - mvjoin (array_to_string rename) - mvzip (Rust UDF, default + custom delimiter + nested) - mvfind (Rust UDF, match / no-match / dynamic regex via concat() Sig bridge) - split (returns array) MVAppendFunctionIT (6 tests): - uniform-typed scalar variadic (multiple, single, string) - array operands (flattening, nested string arrays) - VARCHAR field references via real calcs row Tests gated on SQL companion opensearch-project#5424 (testMvappendWith{IntAndDouble, MixedArrayAndScalar, NumericArrays, ComplexExpression}) are intentionally absent — they fail with "Unable to convert the type ANY" until MVAppendFunctionImpl's leastRestrictive widening + DECIMAL→DOUBLE promotion + operand pre-cast is published as unified-query-core. A top-of-class block lists them with a pointer back to opensearch-project#5424. Lambda-based functions (transform, mvmap, reduce, forall, exists, filter) and empty-array operands are absent for the architectural reasons in this PR's "What's left" section: substrait extension YAML doesn't support declaring func<…> lambda-typed args, and array() defaults to ARRAY<UNKNOWN> which substrait can't encode without #5421. Local verification (per `docs/dev/ppl-analytics-engine-routing.md` SOP): - :sandbox:qa:analytics-engine-rest:integTest --tests "*ArrayFunctionIT" — 22/22 - :sandbox:qa:analytics-engine-rest:integTest --tests "*MVAppendFunctionIT" — 6/6 - :check -p sandbox — all 718 tasks green Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
❌ Gradle check result for cf8ee09: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❕ Gradle check result for cf8ee09: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21554 +/- ##
============================================
- Coverage 73.48% 73.38% -0.11%
+ Complexity 74646 74544 -102
============================================
Files 5980 5980
Lines 338777 338777
Branches 48848 48848
============================================
- Hits 248964 248608 -356
- Misses 70026 70362 +336
- Partials 19787 19807 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
Onboards eight PPL collection functions to the analytics-engine route, plus the row-codec / decimal handling fixes, the
concat()Sig bridge, and a Rust-side fix that's load-bearing for any of these UDFs to actually work end-to-end:array(a, b, …)PPLBuiltinOperators.ARRAYmake_arrayMakeArrayAdapter(rename + variadic operand widening)array_length(arr)SqlLibraryOperators.ARRAY_LENGTHarray_lengtharray_slice/mvindex(arr, from, to)SqlLibraryOperators.ARRAY_SLICEarray_sliceArraySliceAdapter(BIGINT index coerce + 0-based-(start, length)→ 1-based-(start, end))mvdedupSqlLibraryOperators.ARRAY_DISTINCTarray_distinctmvjoin(arr, sep)SqlLibraryOperators.ARRAY_JOINarray_to_stringArrayToStringAdapter(rename)mvindex(arr, N)single-elementSqlStdOperatorTable.ITEMarray_elementArrayElementAdapter(rename + BIGINT index coerce)mvzip(left, right [, sep])PPLBuiltinOperators.MVZIP(UDF)udf::mvzipMvzipAdaptermvfind(arr, regex)PPLBuiltinOperators.MVFIND(UDF)udf::mvfindMvfindAdaptermvappend(arg1, arg2, …)PPLBuiltinOperators.MVAPPEND(UDF)udf::mvappendMvappendAdapter(uniform-list reshape)ARRAY_RETURNING_PROJECT_OPSis a separate set fromSTANDARD_PROJECT_OPSsoFieldType.ARRAYdoesn't pollute filter / aggregate capabilities.Critical Rust-side fix (
session_context.rs)create_session_context(the Rust-side builder behinddf_create_session_context) built a fresh DataFusionSessionContextbut never calledudf::register_allon it. Every fragment query routed throughdf_execute_with_contextreused that handle'sctxviaquery_executor::execute_with_context, so substrait function references tomvappend/mvfind/mvzip/convert_tzfailed planning withThis feature is not implemented: Unsupported function name. The matchingregister_allcall exists onexecute_query/local_executor/indexed_executor— this just brings the FFM session-context path to parity. Without this, none of the custom UDFs in this PR actually worked at runtime.Custom Rust UDFs
DataFusion has no stdlib equivalent for
mvzip(element-wise zip into strings),mvfind(regex match → first index), ormvappend(mixed array+scalar flatten with null filtering). All three onboard as Rust ScalarUDFs on the analytics-backend-datafusion plugin's session context, mirroring the existingconvert_tzpattern. Each has unit tests (mvzip 7, mvfind 7, mvappend 6).Substrait extension catalog
New
opensearch_array_functions.yamldeclaresmake_array,array_length,array_slice,array_distinct,array_to_string,array_element,mvzip,mvfind,mvappend. Loaded viaSimpleExtension.load("/opensearch_array_functions.yaml")and merged inDataFusionPlugin.loadSubstraitExtensions().Operand-normalization adapters (5)
MakeArrayAdapter— implementsScalarFunctionAdapterdirectly. PPL'sArrayFunctionImplinfersARRAY<commonElementType>for the call's return type but does NOT widen individual operand types. Soarray(1, 1.5)produces a RexCall whose operands are(INTEGER, DECIMAL(2,1))but whose return type isARRAY<DOUBLE>. Substrait's variadic-any1consistency validator throws a fatalAssertionErrorin that case. The adapter extracts the call's component type and CASTs each non-matching operand.ArrayToStringAdapter— declares a localarray_to_stringop and name-mapsSqlLibraryOperators.ARRAY_JOIN→ it.ArraySliceAdapter— two transforms: BIGINT index coercion and(0-based start, length)→(1-based start, 1-based end inclusive). Without the latter,mvindex(arr=[1..5], 1, 3)returns[1, 2, 3]instead of the expected[2, 3, 4].ArrayElementAdapter— renamesSqlStdOperatorTable.ITEM→array_elementand coerces the index operand to BIGINT.MvappendAdapter— wraps each scalar operand in a singletonmake_array(scalar)call so substrait's variadic-any1sees a uniformlist<componentType>shape across all variadic positions.Sig bridges
SqlLibraryOperators.CONCAT_FUNCTION→concat— PPL'sconcat()is a function-form CONCAT (operator name"CONCAT"), distinct from||(SqlStdOperatorTable.CONCAT) which isthmus' default Sig table already binds to substraitconcat. Without this bridge,concat(a, b)flowing into any analytics-engine query (e.g.mvfind(arr, concat('ban', '.*'))) fails substrait conversion withUnable to convert call CONCAT(string, string).RowResponseCodeclist support + decimal handlingThe row-oriented fragment-execution wire format ships each cell through OpenSearch's
writeGenericValue/readGenericValue(preservesListvalues asArrayList<Object>), then re-materializes them into aVectorSchemaRooton the coordinator. Three bugs ate array values:inferArrowTypewalked rows for the first non-null cell and matched against{Long, Integer, …, CharSequence, byte[], Number}.Listwasn't in the chain → fell through toUtf8— every array column became VARCHAR.setVectorValueforVarCharVectorcalledvalue.toString(). For aJsonStringArrayListthat returns the JSON form[2,3,4], which then serialized as a JSON string in the final response.Numberfallback inscalarArrowTypematchedBigDecimal(extendsNumber) beforeDouble/Float— encoding decimal cells asInt(64)and silently truncating fractional digits.Fixes:
inferArrowTypewithinferField(returns a fullField, buildsList<inner>for list cells).ListVectorarm tosetVectorValuethat writes directly to the list's offset / validity buffers and the inner data vector — bypassingUnionListWriter's tricky per-elementArrowBuflifecycle.BigDecimaltoFloatingPoint(DOUBLE)before theNumberfallback — fixestestArraymixed-numericarray(1, -1.5, 2, 1.0)returning[1, -1, 2, 1].Plus a related Arrow gotcha:
ListVector.getObjectfor aVarCharVectorchild returns elements typed asorg.apache.arrow.vector.util.Text(notString).ExprValueUtils.fromObjectValuerejected those as "unsupported object class".ArrowValues.toJavaValuenow normalizesText → Stringfor list cells.Other plumbing
ArrowSchemaFromCalcite.toArrowField— recurses into the component type for ARRAY → builds the matchingList<inner>field.commons-textto analytics-engine — Calcite'sSqlFunctions.<clinit>referencesLevenshteinDistance.jackson-datatype-jsr310toarrow-flight-rpc—arrow-vector'sJsonStringArrayListeagerly registersJavaTimeModuleon its ObjectMapper in<clinit>.regex1.10 Cargo dep on the analytics-backend-datafusion crate — backs theudf::mvfindper-row regex match.Current Status (empirically verified, post-rebase)
Tested per the Mustang + SQL plugin SOP —
:integ-test:integTestRemoteagainst an externally-managed cluster with the full sandbox plugin set (opensearch-job-scheduler,arrow-flight-rpc,analytics-engine,parquet-data-format,composite-engine,analytics-backend-lucene,analytics-backend-datafusion,opensearch-sql-plugin),tests.analytics.parquet_indices=true,tests.analytics.force_routing=true, with companion #5421 (merged) + #5424 applied on top offeature/ppl-coverage-bundle.CalciteMVAppendFunctionIT(force-routed)Newly passing:
testMvappendWith{MultipleElements, SingleElement, ArrayFlattening, MixedArrayAndScalar, StringValues, NestedArrays, RealFields, NumericArrays, ComplexExpression, IntAndDouble}.CalciteArrayFunctionIT(force-routed)Newly passing across this PR's commits:
testArray*(4) — array constructors incl. mixed-numeric decimalstestMvjoinWith*(8) — mvjoin over arrays + empty arraytestMvindex*(9) — mvindex single-element + range, with semantic + JSON formatting fixestestMvdedupWith*(5)testSplitWith*(3)testMvzip{Basic, WithCustomDelimiter, Nested}(3)testMvfindWith*(9) — all 9 mvfind variantsWhat's left (follow-ups)
The 22 remaining failures (5 in MVAppend + 17 in ArrayFunction) split into clear buckets — none are regressions, all are pre-existing limits or out-of-scope work:
transform,mvmap,reduce,forall,exists,filter)CalciteArrayFunctionIT(testTransform*,testMvmap*,testReduce*,testForAll,testExists,testFilter)func<…>lambda-typed arguments. Confirmed: capability registration alone gets past the planner, isthmus emitsExpression.Lambdacorrectly, but no Sig binding exists fortransform(list<…>, func<… -> …>). Multi-PR architectural work — needs either a substrait-spec change or per-shape higher-order Rust UDFs. Track separately.mvappend(no common element type)CalciteMVAppendFunctionIT:testMvappendWith{MixedTypes, FieldsAndLiterals, Null}ARRAY<ANY>whenleastRestrictivereturns null (INT+VARCHAR, etc.). Substrait can't encodeANY; Arrow's Union arrays exist butdatafusion-functions-arraydoesn't operate on them. Resolving requires either changing PPL's mvappend contract from heterogeneous-Object[]to a uniform stringified type (breaks user expectations) or shipping Arrow Union arrays through the wire format (multi-PR). Document as known limit.array()operand carried via column reftestMvappendWithEmptyArray); also surfaces as 4 in ArrayFunctionIT (testMvjoinWithEmptyArray,testMvdedupWithEmptyArray,testMvzip{WithEmptyArray, WithBothEmptyArrays},testMvfindWithEmptyArray) — all those 4 actually now pass with #5421 merged on the bundle brancheval empty_arr = array(), result = mvappend(empty_arr, 1, 2)— at MVAppend's type-inference site we see aRexInputRefto a column whose declared type isARRAY<VARCHAR>(#5421 default), not the literalarray(). Per-call detection at type-inference can't reach back through the project chain. Fixing on the analytics-engine route would need either data-flow analysis or changingarray()'s element-type default toNULL/UNKNOWN(which substrait can't serialize either). Document.testMvappendInWhereClause)OpenSearchFilterRule.resolveViableBackendsextracts only the predicate's TOP operator (EQUALS) and checks it against the field's storage type (ARRAY); it doesn't walk into nested calls, soarray_length(arr) = 2is rejected as "EQUALS on ARRAY field". AddingARRAY_LENGTHtoSTANDARD_FILTER_OPSdoesn't help — the rule's coarse-grained check is the actual blocker. Tracked separately as a planner refactor; not specific to mvappend.testMvdedupPreservesOrderarray_distinctlikely doesn't preserve insertion order. Needs a custom Rust UDF or upstream contribution. Track separately.sandbox/qa/analytics-engine-restCompanion changes
Default empty array() return type to ARRAY<VARCHAR> sql#5421 (merged) — defaults empty
array()element type toARRAY<VARCHAR>in PPL'sArrayFunctionImplso the empty-call return type is substrait-serializable. Required fortestMvjoinWithEmptyArray,testMvdedupWithEmptyArray, and the empty-array mvzip / mvfind variants to pass on the analytics-engine route.Use leastRestrictive for mvappend element-type widening sql#5424 —
MVAppendFunctionImpl.updateMostGeneralTypewidens viaRelDataTypeFactory.leastRestrictive(with DECIMAL → DOUBLE promotion) instead of strictObject.equals, plus pre-casts each scalar operand to the call's element Java class inMVAppendImplementorso Avatica'sAbstractCursor.ArrayAccessor.DoubleAccessor.getDouble((Double) value) succeeds. Required fortestMvappendWith{MixedArrayAndScalar, ComplexExpression, IntAndDouble}to pass on the analytics-engine route.Test plan
./gradlew publishToMavenLocal -Dsandbox.enabled=true../gradlew :sandbox:plugins:analytics-backend-datafusion:test -Dsandbox.enabled=true— green after rebase.cargo test --lib udf::mvzip— 7/7.cargo test --lib udf::mvfind— 7/7.cargo test --lib udf::mvappend— 6/6.-Dtests.jvm.argline="-Djava.library.path=… -Dopensearch.experimental.feature.pluggable.dataformat.enabled=true". Verified all 8 plugins picked up the rebuilt jars and the rebuilt native dylib.:integ-test:integTestRemote --tests "org.opensearch.sql.calcite.remote.CalciteMVAppendFunctionIT"against running cluster withtests.analytics.{parquet_indices,force_routing}=true— 10/15 passing (was 0/15).:integ-test:integTestRemote --tests "org.opensearch.sql.calcite.remote.CalciteArrayFunctionIT"— 43/60 passing (was 1/60).mvindex(arr=[1..5], 1, 3)produces[2,3,4](was[1,2,3]); negativemvindex(arr, -3, -1)produces[3,4,5].mvjoin(array(), '-')returns""andmvdedup(array())returns[](companion #5421 default).mvzip(['a','b'], ['1','2'])returns['a,1', 'b,2'];mvfind(['apple','banana','apricot'], 'ban.*')returns1.array(1, -1.5, 2, 1.0)returns[1.0, -1.5, 2.0, 1.0](was[1, -1, 2, 1]).mvfind(arr, concat('ban', '.*'))returns1(was failing onUnable to convert call CONCAT).mvappend(arr=array(1,2), 3, 4)returns[1, 2, 3, 4](was failing withUnable to convert call mvappend(list<i32?>, i32?, i32?)until uniform-list reshape).mvappend(1, 2.5)returns[1, 2.5]end-to-end on the analytics-engine route (with [META] Deprecate REST client #5424 widening + operand pre-cast).sandbox/qa/analytics-engine-rest— follow-up [PURIFY] remove issue, pr tempalte to avoid confusion, we could add later #6 above.