[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route by ahkcs · Pull Request #21554 · opensearch-project/OpenSearch

ahkcs · 2026-05-07T22:36:43Z

Description

Onboards eight PPL collection functions to the analytics-engine route, plus the row-codec / decimal handling fixes, the concat() Sig bridge, and a Rust-side fix that's load-bearing for any of these UDFs to actually work end-to-end:

Function	Calcite operator	Substrait → DataFusion	Adapter / Rust UDF
`array(a, b, …)`	`PPLBuiltinOperators.ARRAY`	`make_array`	`MakeArrayAdapter` (rename + variadic operand widening)
`array_length(arr)`	`SqlLibraryOperators.ARRAY_LENGTH`	`array_length`	none
`array_slice` / `mvindex(arr, from, to)`	`SqlLibraryOperators.ARRAY_SLICE`	`array_slice`	`ArraySliceAdapter` (BIGINT index coerce + 0-based-`(start, length)` → 1-based-`(start, end)`)
`mvdedup`	`SqlLibraryOperators.ARRAY_DISTINCT`	`array_distinct`	none
`mvjoin(arr, sep)`	`SqlLibraryOperators.ARRAY_JOIN`	`array_to_string`	`ArrayToStringAdapter` (rename)
`mvindex(arr, N)` single-element	`SqlStdOperatorTable.ITEM`	`array_element`	`ArrayElementAdapter` (rename + BIGINT index coerce)
`mvzip(left, right [, sep])`	`PPLBuiltinOperators.MVZIP` (UDF)	custom Rust UDF `udf::mvzip`	`MvzipAdapter`
`mvfind(arr, regex)`	`PPLBuiltinOperators.MVFIND` (UDF)	custom Rust UDF `udf::mvfind`	`MvfindAdapter`
`mvappend(arg1, arg2, …)`	`PPLBuiltinOperators.MVAPPEND` (UDF)	custom Rust UDF `udf::mvappend`	`MvappendAdapter` (uniform-list reshape)

ARRAY_RETURNING_PROJECT_OPS is a separate set from STANDARD_PROJECT_OPS so FieldType.ARRAY doesn't pollute filter / aggregate capabilities.

Critical Rust-side fix (`session_context.rs`)

create_session_context (the Rust-side builder behind df_create_session_context) built a fresh DataFusion SessionContext but never called udf::register_all on it. Every fragment query routed through df_execute_with_context reused that handle's ctx via query_executor::execute_with_context, so substrait function references to mvappend / mvfind / mvzip / convert_tz failed planning with This feature is not implemented: Unsupported function name. The matching register_all call exists on execute_query / local_executor / indexed_executor — this just brings the FFM session-context path to parity. Without this, none of the custom UDFs in this PR actually worked at runtime.

Custom Rust UDFs

DataFusion has no stdlib equivalent for mvzip (element-wise zip into strings), mvfind (regex match → first index), or mvappend (mixed array+scalar flatten with null filtering). All three onboard as Rust ScalarUDFs on the analytics-backend-datafusion plugin's session context, mirroring the existing convert_tz pattern. Each has unit tests (mvzip 7, mvfind 7, mvappend 6).

Substrait extension catalog

New opensearch_array_functions.yaml declares make_array, array_length, array_slice, array_distinct, array_to_string, array_element, mvzip, mvfind, mvappend. Loaded via SimpleExtension.load("/opensearch_array_functions.yaml") and merged in DataFusionPlugin.loadSubstraitExtensions().

Operand-normalization adapters (5)

MakeArrayAdapter — implements ScalarFunctionAdapter directly. PPL's ArrayFunctionImpl infers ARRAY<commonElementType> for the call's return type but does NOT widen individual operand types. So array(1, 1.5) produces a RexCall whose operands are (INTEGER, DECIMAL(2,1)) but whose return type is ARRAY<DOUBLE>. Substrait's variadic-any1 consistency validator throws a fatal AssertionError in that case. The adapter extracts the call's component type and CASTs each non-matching operand.
ArrayToStringAdapter — declares a local array_to_string op and name-maps SqlLibraryOperators.ARRAY_JOIN → it.
ArraySliceAdapter — two transforms: BIGINT index coercion and (0-based start, length) → (1-based start, 1-based end inclusive). Without the latter, mvindex(arr=[1..5], 1, 3) returns [1, 2, 3] instead of the expected [2, 3, 4].
ArrayElementAdapter — renames SqlStdOperatorTable.ITEM → array_element and coerces the index operand to BIGINT.
MvappendAdapter — wraps each scalar operand in a singleton make_array(scalar) call so substrait's variadic-any1 sees a uniform list<componentType> shape across all variadic positions.

Sig bridges

SqlLibraryOperators.CONCAT_FUNCTION → concat — PPL's concat() is a function-form CONCAT (operator name "CONCAT"), distinct from || (SqlStdOperatorTable.CONCAT) which isthmus' default Sig table already binds to substrait concat. Without this bridge, concat(a, b) flowing into any analytics-engine query (e.g. mvfind(arr, concat('ban', '.*'))) fails substrait conversion with Unable to convert call CONCAT(string, string).

`RowResponseCodec` list support + decimal handling

The row-oriented fragment-execution wire format ships each cell through OpenSearch's writeGenericValue / readGenericValue (preserves List values as ArrayList<Object>), then re-materializes them into a VectorSchemaRoot on the coordinator. Three bugs ate array values:

inferArrowType walked rows for the first non-null cell and matched against {Long, Integer, …, CharSequence, byte[], Number}. List wasn't in the chain → fell through to Utf8 — every array column became VARCHAR.
setVectorValue for VarCharVector called value.toString(). For a JsonStringArrayList that returns the JSON form [2,3,4], which then serialized as a JSON string in the final response.
The Number fallback in scalarArrowType matched BigDecimal (extends Number) before Double/Float — encoding decimal cells as Int(64) and silently truncating fractional digits.

Fixes:

Replace inferArrowType with inferField (returns a full Field, builds List<inner> for list cells).
Add a ListVector arm to setVectorValue that writes directly to the list's offset / validity buffers and the inner data vector — bypassing UnionListWriter's tricky per-element ArrowBuf lifecycle.
Promote BigDecimal to FloatingPoint(DOUBLE) before the Number fallback — fixes testArray mixed-numeric array(1, -1.5, 2, 1.0) returning [1, -1, 2, 1].

Plus a related Arrow gotcha: ListVector.getObject for a VarCharVector child returns elements typed as org.apache.arrow.vector.util.Text (not String). ExprValueUtils.fromObjectValue rejected those as "unsupported object class". ArrowValues.toJavaValue now normalizes Text → String for list cells.

Other plumbing

ArrowSchemaFromCalcite.toArrowField — recurses into the component type for ARRAY → builds the matching List<inner> field.
commons-text to analytics-engine — Calcite's SqlFunctions.<clinit> references LevenshteinDistance.
jackson-datatype-jsr310 to arrow-flight-rpc — arrow-vector's JsonStringArrayList eagerly registers JavaTimeModule on its ObjectMapper in <clinit>.
regex 1.10 Cargo dep on the analytics-backend-datafusion crate — backs the udf::mvfind per-row regex match.

Current Status (empirically verified, post-rebase)

Tested per the Mustang + SQL plugin SOP — :integ-test:integTestRemote against an externally-managed cluster with the full sandbox plugin set (opensearch-job-scheduler, arrow-flight-rpc, analytics-engine, parquet-data-format, composite-engine, analytics-backend-lucene, analytics-backend-datafusion, opensearch-sql-plugin), tests.analytics.parquet_indices=true, tests.analytics.force_routing=true, with companion #5421 (merged) + #5424 applied on top of feature/ppl-coverage-bundle.

`CalciteMVAppendFunctionIT` (force-routed)

Status	Before	After
Passed	0/15	10/15
Failed	15/15	5/15

Newly passing: testMvappendWith{MultipleElements, SingleElement, ArrayFlattening, MixedArrayAndScalar, StringValues, NestedArrays, RealFields, NumericArrays, ComplexExpression, IntAndDouble}.

`CalciteArrayFunctionIT` (force-routed)

Status	Before	After
Passed	1/60	43/60
Failed	59/60	17/60

Newly passing across this PR's commits:

testArray* (4) — array constructors incl. mixed-numeric decimals
testMvjoinWith* (8) — mvjoin over arrays + empty array
testMvindex* (9) — mvindex single-element + range, with semantic + JSON formatting fixes
testMvdedupWith* (5)
testSplitWith* (3)
testMvzip{Basic, WithCustomDelimiter, Nested} (3)
testMvfindWith* (9) — all 9 mvfind variants

What's left (follow-ups)

The 22 remaining failures (5 in MVAppend + 17 in ArrayFunction) split into clear buckets — none are regressions, all are pre-existing limits or out-of-scope work:

#	Bucket	Tests	Status	Owner / next step
1	Lambda functions (`transform`, `mvmap`, `reduce`, `forall`, `exists`, `filter`)	16 in `CalciteArrayFunctionIT` (`testTransform`, `testMvmap`, `testReduce*`, `testForAll`, `testExists`, `testFilter`)	Out of scope	Substrait extension YAML doesn't support declaring `func<…>` lambda-typed arguments. Confirmed: capability registration alone gets past the planner, isthmus emits `Expression.Lambda` correctly, but no Sig binding exists for `transform(list<…>, func<… -> …>)`. Multi-PR architectural work — needs either a substrait-spec change or per-shape higher-order Rust UDFs. Track separately.
2	Genuinely heterogeneous `mvappend` (no common element type)	3 in `CalciteMVAppendFunctionIT`: `testMvappendWith{MixedTypes, FieldsAndLiterals, Null}`	Architectural limit	Calcite legitimately widens to `ARRAY<ANY>` when `leastRestrictive` returns null (INT+VARCHAR, etc.). Substrait can't encode `ANY`; Arrow's Union arrays exist but `datafusion-functions-array` doesn't operate on them. Resolving requires either changing PPL's mvappend contract from heterogeneous-`Object[]` to a uniform stringified type (breaks user expectations) or shipping Arrow Union arrays through the wire format (multi-PR). Document as known limit.
3	Empty-`array()` operand carried via column ref	1 in MVAppend (`testMvappendWithEmptyArray`); also surfaces as 4 in ArrayFunctionIT (`testMvjoinWithEmptyArray`, `testMvdedupWithEmptyArray`, `testMvzip{WithEmptyArray, WithBothEmptyArrays}`, `testMvfindWithEmptyArray`) — all those 4 actually now pass with #5421 merged on the bundle branch	Architectural limit (mvappend) / passes elsewhere	`eval empty_arr = array(), result = mvappend(empty_arr, 1, 2)` — at MVAppend's type-inference site we see a `RexInputRef` to a column whose declared type is `ARRAY<VARCHAR>` (#5421 default), not the literal `array()`. Per-call detection at type-inference can't reach back through the project chain. Fixing on the analytics-engine route would need either data-flow analysis or changing `array()`'s element-type default to `NULL`/`UNKNOWN` (which substrait can't serialize either). Document.
4	Filter predicate on ARRAY field	1 in MVAppend (`testMvappendInWhereClause`)	Filter-rule capability gap	`OpenSearchFilterRule.resolveViableBackends` extracts only the predicate's TOP operator (EQUALS) and checks it against the field's storage type (ARRAY); it doesn't walk into nested calls, so `array_length(arr) = 2` is rejected as "EQUALS on ARRAY field". Adding `ARRAY_LENGTH` to `STANDARD_FILTER_OPS` doesn't help — the rule's coarse-grained check is the actual blocker. Tracked separately as a planner refactor; not specific to mvappend.
5	`testMvdedupPreservesOrder`	1 in ArrayFunctionIT	Likely DataFusion stdlib behavior gap	DataFusion's `array_distinct` likely doesn't preserve insertion order. Needs a custom Rust UDF or upstream contribution. Track separately.
6	Self-contained QA IT in `sandbox/qa/analytics-engine-rest`	—	Open follow-up	Mirror the SQL-plugin ITs for the functions onboarded here so the analytics-engine sister plugins exercise the same surface independently of the SQL plugin's IT runner.

Companion changes

Default empty array() return type to ARRAY<VARCHAR> sql#5421 (merged) — defaults empty array() element type to ARRAY<VARCHAR> in PPL's ArrayFunctionImpl so the empty-call return type is substrait-serializable. Required for testMvjoinWithEmptyArray, testMvdedupWithEmptyArray, and the empty-array mvzip / mvfind variants to pass on the analytics-engine route.
Use leastRestrictive for mvappend element-type widening sql#5424 — MVAppendFunctionImpl.updateMostGeneralType widens via RelDataTypeFactory.leastRestrictive (with DECIMAL → DOUBLE promotion) instead of strict Object.equals, plus pre-casts each scalar operand to the call's element Java class in MVAppendImplementor so Avatica's AbstractCursor.ArrayAccessor.DoubleAccessor.getDouble ((Double) value) succeeds. Required for testMvappendWith{MixedArrayAndScalar, ComplexExpression, IntAndDouble} to pass on the analytics-engine route.

Test plan

github-actions · 2026-05-07T22:37:22Z

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ae3fdc2.

Path	Line	Severity	Description
plugins/arrow-flight-rpc/build.gradle	40	high	New Maven dependency added: com.fasterxml.jackson.datatype:jackson-datatype-jsr310. Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.
sandbox/plugins/analytics-backend-datafusion/build.gradle	77	high	New Maven dependency added: com.fasterxml.jackson.datatype:jackson-datatype-jsr310 (compileOnly). Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml	52	high	New Rust crate dependency added: regex = "1.10". Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 3 | Medium: 0 | Low: 0

Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.

⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

mch2 · 2026-05-08T01:57:08Z

@ahkcs Thanks Kai, can you pls fix this one:

* What went wrong:
Execution failed for task ':sandbox:plugins:analytics-engine:dependencyLicenses'.
> Missing SHA for commons-text-1.11.0.jar. Run "gradle updateSHAs" to create them

looks like we need to add this dependency's license files.

The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1` sibling per bundled dependency. Two deps added in this PR were missing them: * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE + NOTICE (no shared `commons-text-*` license files yet in this plugin). Apache 2.0; LICENSE and NOTICE extracted from the released jar. * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only. arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to the shared `jackson-LICENSE` / `jackson-NOTICE` files via `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice files are needed. Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin` that spotlessCheck flagged in precommit. Verified `:plugins:arrow-flight-rpc:precommit`, `:sandbox:plugins:analytics-engine:precommit`, and `:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed. Addresses review feedback on opensearch-project#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com>

ahkcs · 2026-05-08T03:45:48Z

Fixed in aa2fb0a. Added the SHA + LICENSE + NOTICE for commons-text-1.11.0 in analytics-engine, and the SHA for jackson-datatype-jsr310-2.21.3 in arrow-flight-rpc (the shared jackson-LICENSE / jackson-NOTICE already cover it via the existing /jackson-.*/ → jackson mapping). Verified all three plugins' :precommit passes locally.

The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1` sibling per bundled dependency. Two deps added in this PR were missing them: * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE + NOTICE (no shared `commons-text-*` license files yet in this plugin). Apache 2.0; LICENSE and NOTICE extracted from the released jar. * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only. arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to the shared `jackson-LICENSE` / `jackson-NOTICE` files via `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice files are needed. Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin` that spotlessCheck flagged in precommit. Verified `:plugins:arrow-flight-rpc:precommit`, `:sandbox:plugins:analytics-engine:precommit`, and `:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed. Addresses review feedback on opensearch-project#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Default empty array() return type to ARRAY<VARCHAR> PPL's `array()` no-arg form delegates to Calcite's `SqlLibraryOperators.ARRAY` return-type inference, which returns ARRAY<NULL> for an empty operand list and ARRAY<UNKNOWN> when all operands are typeless nulls. Both markers are fine for the v2 engine — `ArrayImplementor.internalCast` only consumes the element type when there are elements to cast, so an empty result Object list flows straight through to ExprCollectionValue regardless of declared element type. The analytics-engine route is stricter. When isthmus walks a RexCall like `mvjoin(array(), '-')`, it reaches its first operand's type and feeds it to `io.substrait.isthmus.TypeConverter.toSubstrait`, which throws `UnsupportedOperationException: Unable to convert the type UNKNOWN`. Substrait has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't serialize the call at all. Two PPL ITs hit this directly: * `CalciteArrayFunctionIT.testMvjoinWithEmptyArray` * `CalciteArrayFunctionIT.testMvdedupWithEmptyArray` Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives the call a substrait-serializable type without affecting any value computation: the result list is empty either way. # Test plan * Unit tests: `:core:test --tests "*ArrayFunction*"` — passes locally (no existing tests asserted on the empty-array element type). * IT: `CalciteArrayFunctionIT` force-routed through the analytics-engine path via opensearch-project/OpenSearch#21554's plugin set — testMvjoinWithEmptyArray and testMvdedupWithEmptyArray now pass (were UNKNOWN type errors); pass-rate moved 26/60 → 28/60. Companion to opensearch-project/OpenSearch#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Add unit tests for empty/UNKNOWN ARRAY → VARCHAR fallback Cover the four shapes that exercise the return-type inference path introduced in 666dc0e: * array() — 0 operands, fallback fires → ARRAY<VARCHAR> * array(NULL) — typeless-null operand, fallback fires → ARRAY<VARCHAR> * array(1) — INTEGER operand, fallback does NOT fire → ARRAY<INTEGER> * array('a', 'b') — VARCHAR operands, fallback does NOT fire → ARRAY<VARCHAR> The third case is the regression guard requested by review — confirms concrete element types pass through unchanged and the fallback is scoped strictly to the {@code NULL}/{@code UNKNOWN} markers. The harness uses Calcite's {@link ExplicitOperatorBinding} bound to {@link SqlLibraryOperators#ARRAY} so the inference's internal {@code SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(...)} call resolves the same operator the production code delegates to — mocking {@code SqlOperatorBinding} directly hits NPEs deep inside Calcite's least-restrictive-type computation. Addresses review feedback on #5421. Signed-off-by: Kai Huang <ahkcs@amazon.com> * Trim inline comment per review feedback The "why" lives in the PR description; the inline comment now points there instead of duplicating it. Addresses dai-chen's review note on Signed-off-by: Kai Huang <ahkcs@amazon.com> #5421. --------- Signed-off-by: Kai Huang <ahkcs@amazon.com>

The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1` sibling per bundled dependency. Two deps added in this PR were missing them: * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE + NOTICE (no shared `commons-text-*` license files yet in this plugin). Apache 2.0; LICENSE and NOTICE extracted from the released jar. * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only. arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to the shared `jackson-LICENSE` / `jackson-NOTICE` files via `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice files are needed. Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin` that spotlessCheck flagged in precommit. Verified `:plugins:arrow-flight-rpc:precommit`, `:sandbox:plugins:analytics-engine:precommit`, and `:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed. Addresses review feedback on opensearch-project#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-05-09T00:41:41Z

❌ Gradle check result for ae3fdc2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T02:53:25Z

❌ Gradle check result for ae3fdc2: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…ength / array_slice / array_distinct / mvjoin Onboards the PPL `array(a, b, …)` constructor and four array-consuming functions to the analytics-engine route by mapping their Calcite lowering targets through Substrait to DataFusion's native make_array / array_length / array_slice / array_distinct / array_to_string. Same templated shape as the `replace` PR (opensearch-project#21527), with two extensions: ScalarFunction enum constants (5) + STANDARD_PROJECT_OPS / ARRAY_RETURNING_PROJECT_OPS membership + opensearch_array_functions.yaml extension entries + ADDITIONAL_SCALAR_SIGS Calcite-op→Substrait-name bridges + scalarFunctionAdapters() entries for the 3 functions that need operand normalization = onboarded to the analytics route. Capability lookup at OpenSearchProjectRule keys on the call's return type; for array-returning functions (`array(...)`, `array_slice`, `array_distinct`) the return type resolves to `SqlTypeName.ARRAY`, which previously hit `default → null` in `FieldType.fromSqlTypeName` and emptied the viable-backend list before the registration could match. * `FieldType.ARRAY` added to the analytics SPI enum. * `SqlTypeName.ARRAY → FieldType.ARRAY` mapping in `fromSqlTypeName`. * `ARRAY_RETURNING_PROJECT_OPS` registered against `Set.of(FieldType.ARRAY)` only — separate from `STANDARD_PROJECT_OPS` so `FieldType.ARRAY` doesn't pollute filter / aggregate capabilities (no meaningful semantics over array-typed values there). * `ArrowSchemaFromCalcite.toArrowField` recurses into the component type to build the matching Arrow `List<inner>` field — without this the result schema would have a bare `List` with no element field and the backend's Arrow IPC reader would fail to bind result columns. Substrait's standard catalog has no array_* entries, so isthmus' `RexExpressionConverter` would fail with "Unable to convert call …" on every array call. New `opensearch_array_functions.yaml` declares: * `make_array(any1, …)` → `list<any1>` (variadic, min: 0). * `array_length(list<any1>)` → `i64?`. * `array_slice(list<any1>, i64, i64)` → `list<any1>` (with i32 fallback). * `array_distinct(list<any1>)` → `list<any1>`. * `array_to_string(list<any1>, string)` → `string?` (with varchar fallback). Loaded via `SimpleExtension.load("/opensearch_array_functions.yaml")` and merged into the plugin's extension collection in `DataFusionPlugin.loadSubstraitExtensions()`. Substrait's call-conversion path (and DataFusion's signature matcher) is strict about operand types in ways Calcite's PPL lowering doesn't naturally satisfy. Three adapters bridge the gap: * `MakeArrayAdapter` — implements `ScalarFunctionAdapter` directly (not `AbstractNameMappingAdapter`). PPL's `ArrayFunctionImpl` infers `ARRAY<commonElementType>` for the call's return type but does NOT widen the individual operand types. So `array(1, 1.5)` produces a RexCall whose operands are `(INTEGER, DECIMAL(2,1))` but whose return type is `ARRAY<DOUBLE>`. Substrait's variadic-`any1` consistency validator throws an `AssertionError` in that case (not a recoverable exception — it fatally exits the search-thread JVM). The adapter extracts the call's component type and CASTs each non-matching operand to it before emission. * `ArrayToStringAdapter` — declares a local `array_to_string` op and name-maps `SqlLibraryOperators.ARRAY_JOIN` → it. * `ArraySliceAdapter` — passes the `ARRAY_SLICE` call through unchanged but coerces the index operands (positions 1, 2, optional 3) to `BIGINT`. PPL's parser types positive integer literals as `DECIMAL(20,0)`; DataFusion's `array_slice` signature accepts only integer indexes and refuses to coerce decimal arguments. Two third-party dependencies that surfaced as fatal `NoClassDefFoundError` during execution of array-returning calls: * `commons-text` to analytics-engine — Calcite's `SqlFunctions` class statically references `org.apache.commons.text.similarity.LevenshteinDistance`. Without it, any Calcite RelNode walk that touches `SqlFunctions.<clinit>` poisons the search-thread JVM. * `jackson-datatype-jsr310` to **arrow-flight-rpc** (the parent plugin that bundles `arrow-vector`). `arrow-vector`'s `JsonStringArrayList` eagerly registers `JavaTimeModule` on its ObjectMapper in `<clinit>`, so any reader of an Arrow `ListVector` (i.e. every array-returning DataFusion call flowing through analytics-engine) hits a fatal NoClassDefFoundError. The dep belongs on arrow-flight-rpc's classpath because that plugin defines arrow-vector's classloader; bundling it in analytics-backend-datafusion (the child plugin) is invisible to arrow-vector. Marked `compileOnly` here to avoid jar-hell with arrow-flight-rpc's `api` dependency. * Before: 1/60 (testArrayWithMix only — exercises an error path that fails before the ARRAY capability lookup). * After: 9/60. Newly passing: testArray, testArrayWithString, testArrayLength, testMvjoinWithStringArray, testMvjoinWithStringifiedNumbers, testMvjoinWithMixedStringValues, testMvjoinWithStringBooleans, testMvjoinWithSpecialDelimiters, testMvjoinWithArrayFromRealFields, testMvjoinWithMultipleRealFields. The remaining 51 failures fall into three buckets: * 50 — out-of-scope S1+ functions (`mvfind`, `mvzip`, `reduce`, `transform`, `forall`, `filter`, `exists`, `ITEM`). These are PPL UDFs without direct DataFusion equivalents and need either lambda-substrait wiring or custom UDF registration on the Rust side. * 5 — `testMvindexRange*` family. PPL's `mvindex(arr, from, to)` lowers to `ARRAY_SLICE(arr, from+1, to+1)` (1-based shift) but the lowering is missing the +1, so DataFusion's 1-based array_slice returns a window shifted by one. Fix belongs in the SQL plugin's PPL→Calcite lowering layer. * 1 — `testMvindexRangeMixed` JSON formatting mismatch (test code expects bare `[a,b,c]` but the response is `\"[\\\"a\\\",\\\"b\\\",\\\"c\\\"]\"`). Signed-off-by: Kai Huang <ahkcs@amazon.com>

…th) → 1-based-(start, end) for DataFusion Calcite's `SqlLibraryOperators.ARRAY_SLICE` is the Spark / Hive flavor — 0-based start, third arg is the length-of-elements to take. PPL's `MVIndexFunctionImp.resolveRange` (in the SQL plugin) emits this form, e.g. `mvindex(arr=[1..5], 1, 3)` → `ARRAY_SLICE(arr, 1, 3)` meaning "start at 0-based position 1, take 3 elements" → expected `[2, 3, 4]`. DataFusion's native `array_slice` is the Postgres / Snowflake flavor — 1-based start, third arg is the inclusive end-index. So the same call `array_slice(arr, 1, 3)` returns elements at 1-based positions 1..3 → `[1, 2, 3]`. Off-by-one across every `mvindex` range query. Convert the operands in the adapter rather than the SQL plugin's PPL lowering, because the lowering's existing semantics are correct for Calcite's local executor (used by every non-analytics path); the bug is only in the bridge to DataFusion. start' = start + 1 end' = start + length (== start + 1 + (length - 1)) `MVIndexFunctionImp` already normalizes negative indexes to non-negative 0-based positions before invoking ARRAY_SLICE (it uses `arrayLen + idx`), so the arithmetic above applies uniformly. Empirically: `mvindex(arr=[1..5], 1, 3)` now returns the correct values `[2, 3, 4]` (was `[1, 2, 3]`); negative form `mvindex(arr, -3, -1)` returns `[3, 4, 5]` (was `[2, 3]`); mixed `mvindex(arr, -4, 2)` returns `[2, 3]` matching the PPL spec. The 5 `testMvindexRange*` tests still don't pass on the IT, but for an unrelated reason — array-typed result values are being returned as JSON-stringified scalars (`"[2,3,4]"`) instead of typed arrays. That's a response-formatting issue affecting every array-returning test (also `testArray`, `testArrayWithString`) and lives in a different code path; it'll be addressed separately. Signed-off-by: Kai Huang <ahkcs@amazon.com>

…t) → DataFusion array_element PPL's `mvindex(arr, N)` single-element form lowers (in `MVIndexFunctionImp.resolveSingleElement`) to Calcite's `SqlStdOperatorTable.ITEM` operator with a 1-based index (already converted from PPL's 0-based input). DataFusion's native single-element array accessor is `array_element` (also 1-based), so a name-mapping adapter + yaml extension entry are sufficient. Templated shape: ScalarFunction.ITEM (SqlKind.ITEM) + STANDARD_PROJECT_OPS membership (returns the array's element type, which resolves through the existing FieldType.fromSqlTypeName → SUPPORTED_FIELD_TYPES capability lookup for non-array element types — array-of-array is rare in PPL and not exercised by the current test surface) + scalarFunctionAdapters() entry → ArrayElementAdapter ↳ rewrites SqlStdOperatorTable.ITEM to a locally-declared SqlFunction named "array_element" ↳ coerces the index operand to BIGINT (PPL's parser produces DECIMAL for positive integer literals; DataFusion's array_element rejects DECIMAL indexes, same as array_slice) + ADDITIONAL_SCALAR_SIGS bridge for the locally-declared op + opensearch_array_functions.yaml extension entry: array_element(list<any1>, i64) → any1? # Pass-rate (CalciteArrayFunctionIT, force-routed) * Before this commit: 9/60. * After this commit: 12/60. Newly passing: testMvindexSingleElementPositive, testMvindexSingleElementNegative, testMvindexSingleElementNegativeMiddle. The other 3 tests that hit the ITEM rejection (testMvfindWith*) are multi-step queries where ITEM is one node in a tree that also includes unrelated S1+ functions (mvfind/mvzip/etc.); they remain blocked by the upstream functions, not by ITEM itself. Signed-off-by: Kai Huang <ahkcs@amazon.com>

…ithout JSON-stringifying The row-oriented fragment-execution wire format (`FragmentExecutionResponse`, used when arrow-flight streaming is disabled — every single-node test cluster today) shipped each cell through OpenSearch's `writeGenericValue` / `readGenericValue`, which preserves `List` values as `ArrayList<Object>`. On the coordinator side, `RowResponseCodec.decode` then re-materialized the rows into a `VectorSchemaRoot` for `Iterable<VectorSchemaRoot>`-style consumers. Two bugs in that re-materialization were eating array values: 1. `inferArrowType` walked rows for the first non-null cell and matched against {Long, Integer, …, CharSequence, byte[], Number}. {@code List} wasn't in the chain, so it fell through to {@code break} and the fallback {@link ArrowType.Utf8} — every array column became a VARCHAR column. 2. `setVectorValue` for {@link VarCharVector} called {@code value.toString()}. For a {@code JsonStringArrayList} that returns the JSON form {@code "[2,3,4]"}, which then got serialized as a JSON string in the final response. Tests like {@code testMvindexRangePositive} saw their array result come back as a string `"[2,3,4]"` instead of an array `[2, 3, 4]`. Fix: * Replace {@code inferArrowType} with {@code inferField} that returns a full {@link Field}. For {@code List} cells, build a list field with the inner element type inferred from the first non-null element (with a fallback that scans later rows in case the first list is empty/all-null). * Add a {@code ListVector} arm to {@code setVectorValue} that delegates to a new {@code writeListValue}. The writer bypasses {@link UnionListWriter} entirely — it writes directly to the list's offset / validity buffers and to the inner data vector via the inner vector's typed `setSafe`. The writer-based API requires per-element `ArrowBuf` allocations for varchar elements that are easy to leak or use-after-free; the direct path is simpler and avoids both classes of bug. Plus a separate Arrow gotcha that surfaced once arrays started flowing through correctly: * {@code ListVector.getObject} for a {@code VarCharVector} child returns a {@code JsonStringArrayList} whose elements are Arrow's {@link Text} class, not Java {@link String}. {@code ExprValueUtils.fromObjectValue} doesn't recognize {@code Text} and threw "unsupported object class org.apache.arrow.vector.util.Text". {@code ArrowValues.toJavaValue} now mirrors its top-level VarChar branch for list cells: when a list value comes back from a {@code ListVector}, normalize each {@code Text} element to a {@link String} before handing the list upward. * Before: 12/60 (mvindex range tests still showed expected-vs-actual diff because `[2,3,4]` came back as a JSON string, not an array). * After: 26/60. Newly passing: testMvindexRangePositive, testMvindexRangeNegative, testMvindexRangeMixed, testMvindexRangeFirstThree, testMvindexRangeLastThree, testMvindexRangeSingleElement, testMvdedupWithDuplicates, testMvdedupWithAllDuplicates, testMvdedupWithNoDuplicates, testMvdedupWithStrings, testArrayWithString, testSplitWithSemicolonDelimiter, testSplitWithMultiCharDelimiter, testSplitWithEmptyDelimiter. Signed-off-by: Kai Huang <ahkcs@amazon.com>

The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1` sibling per bundled dependency. Two deps added in this PR were missing them: * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE + NOTICE (no shared `commons-text-*` license files yet in this plugin). Apache 2.0; LICENSE and NOTICE extracted from the released jar. * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only. arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to the shared `jackson-LICENSE` / `jackson-NOTICE` files via `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice files are needed. Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin` that spotlessCheck flagged in precommit. Verified `:plugins:arrow-flight-rpc:precommit`, `:sandbox:plugins:analytics-engine:precommit`, and `:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed. Addresses review feedback on opensearch-project#21554. Signed-off-by: Kai Huang <ahkcs@amazon.com>

… inference {@code RowResponseCodec.scalarArrowType} ordered its instanceof checks {Long, Integer, Short, Byte, Double, Float, Boolean, CharSequence, byte[], Number(fallback) → Int(64)}. BigDecimal extends {@link Number} but isn't any of the typed scalar arms, so it fell through to the {@code Number} fallback and got encoded as a 64-bit integer column — silently truncating fractional digits. This bites PPL flows whose common element type is {@code DECIMAL} (e.g. {@code array(1, -1.5, 2, 1.0)} — the v2-side {@code ArrayImplementor.internalCast} explicitly maps the DECIMAL target to BigDecimal cells). The element values {@code -1.5} and {@code 1.0} round to {@code -1} and {@code 1} when forced through Int(64), so the array reads back as {@code [1, -1, 2, 1]} instead of {@code [1, -1.5, 2, 1.0]}. Promote BigDecimal cells to FloatingPoint(DOUBLE) — same precision the v2 engine uses for decimal-typed PPL results, so behavior matches across both execution paths. The list writer's {@code Float8Vector} arm already uses {@code ((Number) element).doubleValue()}, which correctly extracts the fractional value from a BigDecimal. Signed-off-by: Kai Huang <ahkcs@amazon.com>

PPL `mvzip(left, right [, sep])` element-wise zips two arrays into a list of strings, joined per pair by a separator (default `,`). DataFusion has no stdlib equivalent — `array_concat` is end-to-end concatenation, and Substrait's lambda support is too thin for a transform/zip rewrite — so this onboards a custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session context and wires the Java side to route to it. Templated shape (extends the existing pattern from convert_tz): Rust side: udf::mvzip::MvzipUdf — Signature::user_defined; coerce_types pins the first two args to ListArray and the optional 3rd to Utf8; invoke_with_args iterates per row, takes min(len(left), len(right)) elements, stringifies each (matching `Objects.toString(elem, "")` for null elements), and builds a List<Utf8>. Defensive Null-element-type arm handles the empty array case before the SQL-plugin VARCHAR-default kicks in. Registered on each session context via udf::register_all alongside convert_tz. 7 unit tests cover the basic / custom-sep / truncation / null-element / null-array / empty-array / numeric-array shapes. Java side: ScalarFunction.MVZIP enum entry (SqlKind.OTHER_FUNCTION; resolves through identifier-name valueOf("MVZIP") since PPL's MVZipFunctionImpl registers under the function name "mvzip"). MvzipAdapter — locally-declared SqlFunction("mvzip") + ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar function call with the exact name the Rust UDF is registered under. DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership (returns ARRAY<VARCHAR>, registered against FieldType.ARRAY); adapter registration in scalarFunctionAdapters(). opensearch_array_functions.yaml: two impls for arity-2 and arity-3. * Before: 28/60. * After: 34/60. Newly passing — all 5 testMvzip* variants: testMvzipBasic, testMvzipWithCustomDelimiter, testMvzipNested, testMvzipWithEmptyArray, testMvzipWithBothEmptyArrays. (Test count delta is +6 because the test class also exercises mvzip in 1 other test under a different name, picked up by the same fix.) This PR's run also picks up the SQL-plugin companion #5421 which defaults empty `array()` to ARRAY<VARCHAR>. Without that companion the testMvzipWith*EmptyArray variants would still fail — substrait would reject the input ARRAY<NULL> type before reaching the UDF. The Rust UDF's Null-element arm exists as a defensive backstop in case the call ever reaches it with a null-typed list. Signed-off-by: Kai Huang <ahkcs@amazon.com>

PPL `mvfind(arr, regex)` finds the 0-based index of the first array element matching a regex pattern (Java `Matcher.find` substring-match semantics), or NULL if no match. DataFusion has no stdlib equivalent, and rewriting in terms of array_position requires per-element regex evaluation that's only expressible with substrait lambda support — out of scope here. Onboards a custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session context, mirroring the mvzip/convert_tz pattern. Templated shape: Rust side: udf::mvfind::MvfindUdf — Signature::user_defined; coerce_types pins arg 0 to a list type and arg 1 to Utf8; invoke_with_args walks each row and finds the first non-null element whose stringified form matches the regex via Rust's `regex` crate (`Regex::is_match` is unanchored, same as Java's `Matcher.find`). Scalar pattern operands compile once up front and surface invalid-regex errors at plan time (mirrors the SQL plugin's plan-time `tryCompileLiteralPattern`); column-valued patterns compile per row and yield NULL for invalid patterns. Supports list element types Utf8 / Int{8,16,32,64} / UInt{8,16,32,64} / Float{32,64} / Boolean / Null. 7 unit tests cover the basic-match / no-match / null-array / empty-array / null-element / numeric-array / unanchored shapes. Registered on each session context via udf::register_all alongside convert_tz and mvzip. Java side: ScalarFunction.MVFIND enum entry (SqlKind.OTHER_FUNCTION; resolves through identifier-name valueOf("MVFIND") since PPL's MVFindFunctionImpl registers under the function name "mvfind"). MvfindAdapter — locally-declared SqlFunction("mvfind") + ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar function call with the exact name the Rust UDF is registered under. DataFusionAnalyticsBackendPlugin: STANDARD_PROJECT_OPS membership (returns INTEGER, registered against the existing scalar SUPPORTED_FIELD_TYPES); adapter registration in scalarFunctionAdapters(). opensearch_array_functions.yaml: arity-2 impl returning `i32?`. * Before: 34/60. * After: 42/60. Newly passing — 8 of 9 testMvfind* variants: testMvfindWithMatch, testMvfindWithFirstMatch, testMvfindWithMultipleMatches, testMvfindWithNoMatch, testMvfindWithEmptyArray, testMvfindWithNumericArray, testMvfindWithCaseInsensitive, testMvfindWithComplexRegex. Remaining mvfind failure: testMvfindWithDynamicRegex — fails with "Unable to convert call CONCAT(string, string)" because the test computes the pattern via `concat('ban', '.*')` and substrait can't bind the CONCAT call. This is a separate analytics-engine CONCAT type-conversion issue, not mvfind-specific. Signed-off-by: Kai Huang <ahkcs@amazon.com>

… UDF PPL `mvappend(arg1, arg2, …)` flattens a mixed list of array and scalar arguments into one array, dropping null arguments and null elements within array arguments. DataFusion's `array_concat` is the closest stdlib match but only accepts arrays (not mixed array+scalar) and preserves nulls — different semantics. Onboards as a custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session context, mirroring the mvzip / mvfind pattern. Templated shape: Rust side: udf::mvappend::MvappendUdf — Signature::user_defined; per-row walk over operands, skipping NULL args and NULL elements inside array args, with explicit Arrow type arms for {Int8/16/32/64, UInt8/16/32/64, Float32/64, Boolean, Utf8/LargeUtf8/Utf8View}. The string arms output List<Utf8> or List<Utf8View> depending on the inferred element type so the result schema matches what `return_type` declared (DataFusion's execution-time schema check rejects mismatches). Defensive Null element-type arm covers the empty-array shape. 6 unit tests. Registered on each session context via udf::register_all. Java side: ScalarFunction.MVAPPEND enum entry (SqlKind.OTHER_FUNCTION; resolves through identifier-name valueOf("MVAPPEND")). MvappendAdapter — locally-declared SqlFunction("mvappend") + ADDITIONAL_SCALAR_SIGS bridge. Casts every scalar operand to the call's array component type and every array operand to ARRAY<componentType> before substrait emission, so the UDF sees a single uniform element type across all positions. DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership (returns ARRAY<commonType>); adapter registration in scalarFunctionAdapters(). opensearch_array_functions.yaml: variadic min:1 entry with `list<any1?>` return type. * Before: 0/15. * After: 6/15. Newly passing: testMvappendWithMultipleElements, testMvappendWithSingleElement, testMvappendWithArrayFlattening, testMvappendWithStringValues, testMvappendWithNestedArrays, testMvappendWithRealFields. * 8 tests fail with "Unable to convert the type ANY". Root cause is PPL's MVAppendFunctionImpl.updateMostGeneralType using strict Object.equals on each pair of operand types, returning Calcite's ANY type when any two don't match — including when they only differ in nullability tag (a literal 3 is INTEGER NOT NULL but the component type of `array(1, 2)` is INTEGER NULLABLE). Substrait can't serialize ANY. The fix belongs in the SQL plugin's MVAppendFunctionImpl (use typeFactory.leastRestrictive instead of Object.equals) and isn't addressed here. * testMvappendInWhereClause — uses `where array_length(combined) = 2` which the analytics-engine planner rejects with "No backend can evaluate filter predicate [EQUALS] on fields [combined:ARRAY]". Filter-side capability gap unrelated to mvappend. * testMvappendWithComplexExpression — fails substrait conversion on a nested mvappend call ("Unable to convert call mvappend(list, …)"), likely the same nullability widening pattern flowing through nested calls. Same upstream fix applies. Unchanged at 43/60 — mvappend isn't exercised there. Signed-off-by: Kai Huang <ahkcs@amazon.com>

… lists; add Decimal128 element support Two follow-ons to the initial mvappend onboarding (40b2161), both surfaced once the SQL companion opensearch-project#5424 (`MVAppendFunctionImpl.leastRestrictive`) let homogeneous-type calls reach substrait conversion. # Uniform-list operand reshape Substrait's variadic-`any1` argument shape requires every operand at the same variadic position to share a type. PPL's `mvappend(arg, …)` accepts a mix of bare scalars and arrays, which substrait's signature matcher rejected with `Unable to convert call mvappend(list<i32?>, i32?, i32?)`. `MvappendAdapter` now wraps each scalar operand in a singleton `make_array(scalar)` call (using the locally-declared `MakeArrayAdapter.LOCAL_MAKE_ARRAY_OP`) so by the time the substrait converter sees the operands they're uniformly `list<componentType>`. The yaml impl was correspondingly tightened from `args: [{ value: any1 }] variadic` to `args: [{ value: list<any1?> }] variadic`. Rust UDF (`udf::mvappend`) keeps its scalar-handling branch intact as a defensive fallback, but in practice every operand it sees is a list now. # Decimal128 element type Calcite's leastRestrictive widening on INT + DECIMAL produces DECIMAL(p, s) which substrait converts to Decimal128(p, s); the Java adapter casts every operand's element type to that. The Rust UDF needed an explicit `DataType::Decimal128(p, s)` branch — Decimal128Builder requires `.with_precision_and_scale(p, s)` configuration before use, and Decimal128Array elements are read via the `i128`-valued `value(i)` accessor (not via the generic `build!` macro). # Pass-rate (CalciteMVAppendFunctionIT, force-routed, with companion opensearch-project#5424 applied) * Before this commit: 6/15 (initial mvappend onboarding). * After this commit: 10/15. Newly passing: testMvappendWithMixedArrayAndScalar (uniform-list reshape), testMvappendWithComplexExpression (uniform-list reshape), testMvappendWithIntAndDouble (Decimal128 element), testMvappendWithNumericArrays (Decimal128 element). Remaining 5 failures: * testMvappendWithMixedTypes / WithFieldsAndLiterals / WithEmptyArray / WithNull — call legitimately widens to ARRAY<ANY> because operands contain pairs of types with no common widened type (INT + VARCHAR). The Calcite engine handles ANY via Object generic dispatch; substrait can't encode it. Out of scope without changing PPL UDF semantics. * testMvappendInWhereClause — uses `where array_length(combined) = 2` which the analytics-engine planner rejects with "No backend can evaluate filter predicate [EQUALS] on fields [combined:ARRAY]". Filter-side capability gap unrelated to mvappend. Signed-off-by: Kai Huang <ahkcs@amazon.com>

… contexts create_session_context (the Rust-side builder behind df_create_session_context) built a fresh DataFusion SessionContext but never called udf::register_all on it. Every fragment query routed through df_execute_with_context reused that handle's ctx via query_executor::execute_with_context, so substrait function references to mvappend / mvfind / mvzip / convert_tz failed planning with "This feature is not implemented: Unsupported function name". The matching register_all call exists in execute_query / local_executor / indexed_executor — this just brings the FFM session-context path to parity. Verified: CalciteMVAppendFunctionIT against the analytics-engine route now passes 10/15 (was 0/15) with the SQL companion opensearch-project#5424 widening fix applied. The remaining 5 are pre-existing ARRAY<ANY>/UNKNOWN substrait-encoding gaps (heterogeneous mvappend signatures, empty-array default, filter-on-array predicate) tracked in this PR's "What's left" section. Signed-off-by: Kai Huang <ahkcs@amazon.com>

`MVAppendFunctionImpl.updateMostGeneralType` used strict {@code Object.equals} to compare each operand's component type against the running "most general" type, falling back to Calcite's {@code ANY} on any mismatch. That's too aggressive: {@code Object.equals} returns false for type pairs that differ only in nullability tag (e.g. {@code array(1, 2)} synthesizes INTEGER NULLABLE for its component while literal {@code 3} is INTEGER NOT NULL), and for straightforwardly-widenable numerics like INTEGER + DOUBLE. The PPL UDF result would then be {@code ARRAY<ANY>}. The Calcite engine's enumerable runtime tolerates {@code ANY} because {@code MVAppendImplementor.eval} processes elements through {@code Object} — the declared element type is unused at execution time. The analytics-engine route is stricter: substrait can't serialize {@code ANY}, so isthmus throws {@code UnsupportedOperationException: Unable to convert the type ANY} during the substrait conversion phase. Widen with {@link RelDataTypeFactory#leastRestrictive} — the same routine {@code SqlLibraryOperators.ARRAY} uses for its return-type inference. Falls back to ANY only when {@code leastRestrictive} returns null (genuinely incompatible operand types like INT + VARCHAR), preserving the original behavior on those queries. # Test plan * {@code :core:test --tests "*MVAppend*"} — passes (no existing test asserted on the {@code ANY} fallback). * Companion to opensearch-project/OpenSearch#21554 — unblocks 8+ tests in {@code CalciteMVAppendFunctionIT} force-routed through the analytics-engine path that previously failed with "Unable to convert the type ANY". Signed-off-by: Kai Huang <ahkcs@amazon.com>

…ill the cluster Substrait's plan validators (VariadicParameterConsistencyValidator, RelOptUtil.eq via Litmus.THROW, etc.) throw AssertionError directly via explicit `throw new AssertionError(...)` rather than via the `assert` keyword, so the JVM -da flag doesn't gate them. When a malformed plan triggers one inside a search-thread call to SubstraitRelVisitor.apply, the AssertionError propagates uncaught up the analytics-engine fragment handler stack, OpenSearchUncaughtExceptionHandler classifies it as fatal, and the entire cluster JVM exits. Wrap the visitor.apply call in a narrow try/catch that re-raises the AssertionError as IllegalStateException with the original message and cause preserved. The analytics-engine error path already buckets IllegalStateException at the fragment boundary into a normal HTTP 500 response — the cluster stays up and the failure shows in the per-query report instead. This came up while diagnosing CalciteMVAppendFunctionIT failures: malformed ARRAY<ANY> plans were taking down the cluster mid-test instead of producing per-test failures, masking the underlying substrait conversion error. Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-05-09T18:09:41Z

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T19:11:27Z

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T20:09:53Z

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T20:24:51Z

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…ST path Self-contained QA ITs in sandbox/qa/analytics-engine-rest exercising the PPL collection functions onboarded in this PR through POST /_analytics/ppl against a parquet-backed `calcs` dataset, no SQL plugin checkout required. ArrayFunctionIT (22 tests): - array constructor (mixed-numeric BigDecimal → Double promotion + int+string) - array_length - mvindex range (array_slice — 0-based-(start, length) → 1-based-(start, end)) - mvindex single (array_element via ITEM rename) - mvdedup (array_distinct) - mvjoin (array_to_string rename) - mvzip (Rust UDF, default + custom delimiter + nested) - mvfind (Rust UDF, match / no-match / dynamic regex via concat() Sig bridge) - split (returns array) MVAppendFunctionIT (6 tests): - uniform-typed scalar variadic (multiple, single, string) - array operands (flattening, nested string arrays) - VARCHAR field references via real calcs row Tests gated on SQL companion opensearch-project#5424 (testMvappendWith{IntAndDouble, MixedArrayAndScalar, NumericArrays, ComplexExpression}) are intentionally absent — they fail with "Unable to convert the type ANY" until MVAppendFunctionImpl's leastRestrictive widening + DECIMAL→DOUBLE promotion + operand pre-cast is published as unified-query-core. A top-of-class block lists them with a pointer back to opensearch-project#5424. Lambda-based functions (transform, mvmap, reduce, forall, exists, filter) and empty-array operands are absent for the architectural reasons in this PR's "What's left" section: substrait extension YAML doesn't support declaring func<…> lambda-typed args, and array() defaults to ARRAY<UNKNOWN> which substrait can't encode without #5421. Local verification (per `docs/dev/ppl-analytics-engine-routing.md` SOP): - :sandbox:qa:analytics-engine-rest:integTest --tests "*ArrayFunctionIT" — 22/22 - :sandbox:qa:analytics-engine-rest:integTest --tests "*MVAppendFunctionIT" — 6/6 - :check -p sandbox — all 718 tasks green Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-05-09T21:11:43Z

❌ Gradle check result for cf8ee09: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2026-05-09T22:19:31Z

❕ Gradle check result for cf8ee09: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2026-05-09T22:19:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.38%. Comparing base (8f72a95) to head (cf8ee09).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #21554      +/-   ##
============================================
- Coverage     73.48%   73.38%   -0.11%     
+ Complexity    74646    74544     -102     
============================================
  Files          5980     5980              
  Lines        338777   338777              
  Branches      48848    48848              
============================================
- Hits         248964   248608     -356     
- Misses        70026    70362     +336     
- Partials      19787    19807      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ahkcs requested a review from a team as a code owner May 7, 2026 22:36

ahkcs mentioned this pull request May 8, 2026

Default empty array() return type to ARRAY<VARCHAR> opensearch-project/sql#5421

Merged

ahkcs force-pushed the feature/mustang-array-functions branch 2 times, most recently from d2f73b7 to fa534e7 Compare May 8, 2026 06:38

ahkcs mentioned this pull request May 8, 2026

Use leastRestrictive for mvappend element-type widening opensearch-project/sql#5424

Open

ahkcs changed the title ~~[Analytics Backend / DataFusion] Wire PPL array constructor + array_length / array_slice / array_distinct / mvjoin~~ [Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route May 8, 2026

ahkcs force-pushed the feature/mustang-array-functions branch from 44d604e to a652406 Compare May 8, 2026 20:23

ahkcs force-pushed the feature/mustang-array-functions branch from a652406 to ae3fdc2 Compare May 8, 2026 23:29

mch2 added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 9, 2026

mch2 approved these changes May 9, 2026

View reviewed changes

ahkcs added 9 commits May 9, 2026 10:07

ahkcs added 2 commits May 9, 2026 10:11

ahkcs force-pushed the feature/mustang-array-functions branch from ae3fdc2 to 459bffc Compare May 9, 2026 17:12

ahkcs force-pushed the feature/mustang-array-functions branch from 459bffc to 1a0f571 Compare May 9, 2026 17:55

mch2 merged commit 4dea6c4 into opensearch-project:main May 9, 2026
27 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route#21554

[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route#21554
mch2 merged 13 commits intoopensearch-project:mainfrom
ahkcs:feature/mustang-array-functions

ahkcs commented May 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

mch2 commented May 8, 2026

Uh oh!

ahkcs commented May 8, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

codecov Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahkcs commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Critical Rust-side fix (session_context.rs)

Custom Rust UDFs

Substrait extension catalog

Operand-normalization adapters (5)

Sig bridges

RowResponseCodec list support + decimal handling

Other plumbing

Current Status (empirically verified, post-rebase)

CalciteMVAppendFunctionIT (force-routed)

CalciteArrayFunctionIT (force-routed)

What's left (follow-ups)

Companion changes

Test plan

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Analyzer ❗

Uh oh!

mch2 commented May 8, 2026

Uh oh!

ahkcs commented May 8, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

codecov Bot commented May 9, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ahkcs commented May 7, 2026 •

edited

Loading

Critical Rust-side fix (`session_context.rs`)

`RowResponseCodec` list support + decimal handling

`CalciteMVAppendFunctionIT` (force-routed)

`CalciteArrayFunctionIT` (force-routed)

github-actions Bot commented May 7, 2026 •

edited

Loading