Skip to content

[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route#21554

Merged
mch2 merged 13 commits intoopensearch-project:mainfrom
ahkcs:feature/mustang-array-functions
May 9, 2026
Merged

[Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route#21554
mch2 merged 13 commits intoopensearch-project:mainfrom
ahkcs:feature/mustang-array-functions

Conversation

@ahkcs
Copy link
Copy Markdown
Contributor

@ahkcs ahkcs commented May 7, 2026

Description

Onboards eight PPL collection functions to the analytics-engine route, plus the row-codec / decimal handling fixes, the concat() Sig bridge, and a Rust-side fix that's load-bearing for any of these UDFs to actually work end-to-end:

Function Calcite operator Substrait → DataFusion Adapter / Rust UDF
array(a, b, …) PPLBuiltinOperators.ARRAY make_array MakeArrayAdapter (rename + variadic operand widening)
array_length(arr) SqlLibraryOperators.ARRAY_LENGTH array_length none
array_slice / mvindex(arr, from, to) SqlLibraryOperators.ARRAY_SLICE array_slice ArraySliceAdapter (BIGINT index coerce + 0-based-(start, length) → 1-based-(start, end))
mvdedup SqlLibraryOperators.ARRAY_DISTINCT array_distinct none
mvjoin(arr, sep) SqlLibraryOperators.ARRAY_JOIN array_to_string ArrayToStringAdapter (rename)
mvindex(arr, N) single-element SqlStdOperatorTable.ITEM array_element ArrayElementAdapter (rename + BIGINT index coerce)
mvzip(left, right [, sep]) PPLBuiltinOperators.MVZIP (UDF) custom Rust UDF udf::mvzip MvzipAdapter
mvfind(arr, regex) PPLBuiltinOperators.MVFIND (UDF) custom Rust UDF udf::mvfind MvfindAdapter
mvappend(arg1, arg2, …) PPLBuiltinOperators.MVAPPEND (UDF) custom Rust UDF udf::mvappend MvappendAdapter (uniform-list reshape)

ARRAY_RETURNING_PROJECT_OPS is a separate set from STANDARD_PROJECT_OPS so FieldType.ARRAY doesn't pollute filter / aggregate capabilities.

Critical Rust-side fix (session_context.rs)

create_session_context (the Rust-side builder behind df_create_session_context) built a fresh DataFusion SessionContext but never called udf::register_all on it. Every fragment query routed through df_execute_with_context reused that handle's ctx via query_executor::execute_with_context, so substrait function references to mvappend / mvfind / mvzip / convert_tz failed planning with This feature is not implemented: Unsupported function name. The matching register_all call exists on execute_query / local_executor / indexed_executor — this just brings the FFM session-context path to parity. Without this, none of the custom UDFs in this PR actually worked at runtime.

Custom Rust UDFs

DataFusion has no stdlib equivalent for mvzip (element-wise zip into strings), mvfind (regex match → first index), or mvappend (mixed array+scalar flatten with null filtering). All three onboard as Rust ScalarUDFs on the analytics-backend-datafusion plugin's session context, mirroring the existing convert_tz pattern. Each has unit tests (mvzip 7, mvfind 7, mvappend 6).

Substrait extension catalog

New opensearch_array_functions.yaml declares make_array, array_length, array_slice, array_distinct, array_to_string, array_element, mvzip, mvfind, mvappend. Loaded via SimpleExtension.load("/opensearch_array_functions.yaml") and merged in DataFusionPlugin.loadSubstraitExtensions().

Operand-normalization adapters (5)

  • MakeArrayAdapter — implements ScalarFunctionAdapter directly. PPL's ArrayFunctionImpl infers ARRAY<commonElementType> for the call's return type but does NOT widen individual operand types. So array(1, 1.5) produces a RexCall whose operands are (INTEGER, DECIMAL(2,1)) but whose return type is ARRAY<DOUBLE>. Substrait's variadic-any1 consistency validator throws a fatal AssertionError in that case. The adapter extracts the call's component type and CASTs each non-matching operand.
  • ArrayToStringAdapter — declares a local array_to_string op and name-maps SqlLibraryOperators.ARRAY_JOIN → it.
  • ArraySliceAdapter — two transforms: BIGINT index coercion and (0-based start, length)(1-based start, 1-based end inclusive). Without the latter, mvindex(arr=[1..5], 1, 3) returns [1, 2, 3] instead of the expected [2, 3, 4].
  • ArrayElementAdapter — renames SqlStdOperatorTable.ITEMarray_element and coerces the index operand to BIGINT.
  • MvappendAdapter — wraps each scalar operand in a singleton make_array(scalar) call so substrait's variadic-any1 sees a uniform list<componentType> shape across all variadic positions.

Sig bridges

  • SqlLibraryOperators.CONCAT_FUNCTIONconcat — PPL's concat() is a function-form CONCAT (operator name "CONCAT"), distinct from || (SqlStdOperatorTable.CONCAT) which isthmus' default Sig table already binds to substrait concat. Without this bridge, concat(a, b) flowing into any analytics-engine query (e.g. mvfind(arr, concat('ban', '.*'))) fails substrait conversion with Unable to convert call CONCAT(string, string).

RowResponseCodec list support + decimal handling

The row-oriented fragment-execution wire format ships each cell through OpenSearch's writeGenericValue / readGenericValue (preserves List values as ArrayList<Object>), then re-materializes them into a VectorSchemaRoot on the coordinator. Three bugs ate array values:

  1. inferArrowType walked rows for the first non-null cell and matched against {Long, Integer, …, CharSequence, byte[], Number}. List wasn't in the chain → fell through to Utf8 — every array column became VARCHAR.
  2. setVectorValue for VarCharVector called value.toString(). For a JsonStringArrayList that returns the JSON form [2,3,4], which then serialized as a JSON string in the final response.
  3. The Number fallback in scalarArrowType matched BigDecimal (extends Number) before Double/Float — encoding decimal cells as Int(64) and silently truncating fractional digits.

Fixes:

  • Replace inferArrowType with inferField (returns a full Field, builds List<inner> for list cells).
  • Add a ListVector arm to setVectorValue that writes directly to the list's offset / validity buffers and the inner data vector — bypassing UnionListWriter's tricky per-element ArrowBuf lifecycle.
  • Promote BigDecimal to FloatingPoint(DOUBLE) before the Number fallback — fixes testArray mixed-numeric array(1, -1.5, 2, 1.0) returning [1, -1, 2, 1].

Plus a related Arrow gotcha: ListVector.getObject for a VarCharVector child returns elements typed as org.apache.arrow.vector.util.Text (not String). ExprValueUtils.fromObjectValue rejected those as "unsupported object class". ArrowValues.toJavaValue now normalizes Text → String for list cells.

Other plumbing

  • ArrowSchemaFromCalcite.toArrowField — recurses into the component type for ARRAY → builds the matching List<inner> field.
  • commons-text to analytics-engine — Calcite's SqlFunctions.<clinit> references LevenshteinDistance.
  • jackson-datatype-jsr310 to arrow-flight-rpcarrow-vector's JsonStringArrayList eagerly registers JavaTimeModule on its ObjectMapper in <clinit>.
  • regex 1.10 Cargo dep on the analytics-backend-datafusion crate — backs the udf::mvfind per-row regex match.

Current Status (empirically verified, post-rebase)

Tested per the Mustang + SQL plugin SOP:integ-test:integTestRemote against an externally-managed cluster with the full sandbox plugin set (opensearch-job-scheduler, arrow-flight-rpc, analytics-engine, parquet-data-format, composite-engine, analytics-backend-lucene, analytics-backend-datafusion, opensearch-sql-plugin), tests.analytics.parquet_indices=true, tests.analytics.force_routing=true, with companion #5421 (merged) + #5424 applied on top of feature/ppl-coverage-bundle.

CalciteMVAppendFunctionIT (force-routed)

Status Before After
Passed 0/15 10/15
Failed 15/15 5/15

Newly passing: testMvappendWith{MultipleElements, SingleElement, ArrayFlattening, MixedArrayAndScalar, StringValues, NestedArrays, RealFields, NumericArrays, ComplexExpression, IntAndDouble}.

CalciteArrayFunctionIT (force-routed)

Status Before After
Passed 1/60 43/60
Failed 59/60 17/60

Newly passing across this PR's commits:

  • testArray* (4) — array constructors incl. mixed-numeric decimals
  • testMvjoinWith* (8) — mvjoin over arrays + empty array
  • testMvindex* (9) — mvindex single-element + range, with semantic + JSON formatting fixes
  • testMvdedupWith* (5)
  • testSplitWith* (3)
  • testMvzip{Basic, WithCustomDelimiter, Nested} (3)
  • testMvfindWith* (9) — all 9 mvfind variants

What's left (follow-ups)

The 22 remaining failures (5 in MVAppend + 17 in ArrayFunction) split into clear buckets — none are regressions, all are pre-existing limits or out-of-scope work:

# Bucket Tests Status Owner / next step
1 Lambda functions (transform, mvmap, reduce, forall, exists, filter) 16 in CalciteArrayFunctionIT (testTransform*, testMvmap*, testReduce*, testForAll, testExists, testFilter) Out of scope Substrait extension YAML doesn't support declaring func<…> lambda-typed arguments. Confirmed: capability registration alone gets past the planner, isthmus emits Expression.Lambda correctly, but no Sig binding exists for transform(list<…>, func<… -> …>). Multi-PR architectural work — needs either a substrait-spec change or per-shape higher-order Rust UDFs. Track separately.
2 Genuinely heterogeneous mvappend (no common element type) 3 in CalciteMVAppendFunctionIT: testMvappendWith{MixedTypes, FieldsAndLiterals, Null} Architectural limit Calcite legitimately widens to ARRAY<ANY> when leastRestrictive returns null (INT+VARCHAR, etc.). Substrait can't encode ANY; Arrow's Union arrays exist but datafusion-functions-array doesn't operate on them. Resolving requires either changing PPL's mvappend contract from heterogeneous-Object[] to a uniform stringified type (breaks user expectations) or shipping Arrow Union arrays through the wire format (multi-PR). Document as known limit.
3 Empty-array() operand carried via column ref 1 in MVAppend (testMvappendWithEmptyArray); also surfaces as 4 in ArrayFunctionIT (testMvjoinWithEmptyArray, testMvdedupWithEmptyArray, testMvzip{WithEmptyArray, WithBothEmptyArrays}, testMvfindWithEmptyArray) — all those 4 actually now pass with #5421 merged on the bundle branch Architectural limit (mvappend) / passes elsewhere eval empty_arr = array(), result = mvappend(empty_arr, 1, 2) — at MVAppend's type-inference site we see a RexInputRef to a column whose declared type is ARRAY<VARCHAR> (#5421 default), not the literal array(). Per-call detection at type-inference can't reach back through the project chain. Fixing on the analytics-engine route would need either data-flow analysis or changing array()'s element-type default to NULL/UNKNOWN (which substrait can't serialize either). Document.
4 Filter predicate on ARRAY field 1 in MVAppend (testMvappendInWhereClause) Filter-rule capability gap OpenSearchFilterRule.resolveViableBackends extracts only the predicate's TOP operator (EQUALS) and checks it against the field's storage type (ARRAY); it doesn't walk into nested calls, so array_length(arr) = 2 is rejected as "EQUALS on ARRAY field". Adding ARRAY_LENGTH to STANDARD_FILTER_OPS doesn't help — the rule's coarse-grained check is the actual blocker. Tracked separately as a planner refactor; not specific to mvappend.
5 testMvdedupPreservesOrder 1 in ArrayFunctionIT Likely DataFusion stdlib behavior gap DataFusion's array_distinct likely doesn't preserve insertion order. Needs a custom Rust UDF or upstream contribution. Track separately.
6 Self-contained QA IT in sandbox/qa/analytics-engine-rest Open follow-up Mirror the SQL-plugin ITs for the functions onboarded here so the analytics-engine sister plugins exercise the same surface independently of the SQL plugin's IT runner.

Companion changes

  • Default empty array() return type to ARRAY<VARCHAR> sql#5421 (merged) — defaults empty array() element type to ARRAY<VARCHAR> in PPL's ArrayFunctionImpl so the empty-call return type is substrait-serializable. Required for testMvjoinWithEmptyArray, testMvdedupWithEmptyArray, and the empty-array mvzip / mvfind variants to pass on the analytics-engine route.

  • Use leastRestrictive for mvappend element-type widening sql#5424MVAppendFunctionImpl.updateMostGeneralType widens via RelDataTypeFactory.leastRestrictive (with DECIMAL → DOUBLE promotion) instead of strict Object.equals, plus pre-casts each scalar operand to the call's element Java class in MVAppendImplementor so Avatica's AbstractCursor.ArrayAccessor.DoubleAccessor.getDouble ((Double) value) succeeds. Required for testMvappendWith{MixedArrayAndScalar, ComplexExpression, IntAndDouble} to pass on the analytics-engine route.

Test plan

  • Rebuilt analytics-framework, analytics-backend-datafusion (Java + Rust), analytics-engine, arrow-flight-rpc bundles via ./gradlew publishToMavenLocal -Dsandbox.enabled=true.
  • ./gradlew :sandbox:plugins:analytics-backend-datafusion:test -Dsandbox.enabled=true — green after rebase.
  • cargo test --lib udf::mvzip — 7/7. cargo test --lib udf::mvfind — 7/7. cargo test --lib udf::mvappend — 6/6.
  • Started cluster per Mustang + SQL plugin SOP: full plugin set + -Dtests.jvm.argline="-Djava.library.path=… -Dopensearch.experimental.feature.pluggable.dataformat.enabled=true". Verified all 8 plugins picked up the rebuilt jars and the rebuilt native dylib.
  • :integ-test:integTestRemote --tests "org.opensearch.sql.calcite.remote.CalciteMVAppendFunctionIT" against running cluster with tests.analytics.{parquet_indices,force_routing}=true10/15 passing (was 0/15).
  • :integ-test:integTestRemote --tests "org.opensearch.sql.calcite.remote.CalciteArrayFunctionIT"43/60 passing (was 1/60).
  • Verified mvindex(arr=[1..5], 1, 3) produces [2,3,4] (was [1,2,3]); negative mvindex(arr, -3, -1) produces [3,4,5].
  • Verified mvjoin(array(), '-') returns "" and mvdedup(array()) returns [] (companion #5421 default).
  • Verified mvzip(['a','b'], ['1','2']) returns ['a,1', 'b,2']; mvfind(['apple','banana','apricot'], 'ban.*') returns 1.
  • Verified array(1, -1.5, 2, 1.0) returns [1.0, -1.5, 2.0, 1.0] (was [1, -1, 2, 1]).
  • Verified mvfind(arr, concat('ban', '.*')) returns 1 (was failing on Unable to convert call CONCAT).
  • Verified mvappend(arr=array(1,2), 3, 4) returns [1, 2, 3, 4] (was failing with Unable to convert call mvappend(list<i32?>, i32?, i32?) until uniform-list reshape).
  • Verified mvappend(1, 2.5) returns [1, 2.5] end-to-end on the analytics-engine route (with [META] Deprecate REST client #5424 widening + operand pre-cast).
  • Self-contained QA IT in sandbox/qa/analytics-engine-restfollow-up [PURIFY] remove issue, pr tempalte to avoid confusion, we could add later #6 above.

@ahkcs ahkcs requested a review from a team as a code owner May 7, 2026 22:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ae3fdc2.

PathLineSeverityDescription
plugins/arrow-flight-rpc/build.gradle40highNew Maven dependency added: com.fasterxml.jackson.datatype:jackson-datatype-jsr310. Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.
sandbox/plugins/analytics-backend-datafusion/build.gradle77highNew Maven dependency added: com.fasterxml.jackson.datatype:jackson-datatype-jsr310 (compileOnly). Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.
sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml52highNew Rust crate dependency added: regex = "1.10". Per mandatory supply chain policy, all dependency additions must be flagged for maintainer verification regardless of apparent legitimacy.

The table above displays the top 10 most important findings.

Total: 3 | Critical: 0 | High: 3 | Medium: 0 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@mch2
Copy link
Copy Markdown
Member

mch2 commented May 8, 2026

@ahkcs Thanks Kai, can you pls fix this one:

* What went wrong:
Execution failed for task ':sandbox:plugins:analytics-engine:dependencyLicenses'.
> Missing SHA for commons-text-1.11.0.jar. Run "gradle updateSHAs" to create them

looks like we need to add this dependency's license files.

ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 8, 2026
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:

  * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
    NOTICE (no shared `commons-text-*` license files yet in this plugin).
    Apache 2.0; LICENSE and NOTICE extracted from the released jar.
  * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
    arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
    the shared `jackson-LICENSE` / `jackson-NOTICE` files via
    `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
    files are needed.

Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.

Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.

Addresses review feedback on opensearch-project#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs
Copy link
Copy Markdown
Contributor Author

ahkcs commented May 8, 2026

Fixed in aa2fb0a. Added the SHA + LICENSE + NOTICE for commons-text-1.11.0 in analytics-engine, and the SHA for jackson-datatype-jsr310-2.21.3 in arrow-flight-rpc (the shared jackson-LICENSE / jackson-NOTICE already cover it via the existing /jackson-.*/jackson mapping). Verified all three plugins' :precommit passes locally.

ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 8, 2026
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:

  * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
    NOTICE (no shared `commons-text-*` license files yet in this plugin).
    Apache 2.0; LICENSE and NOTICE extracted from the released jar.
  * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
    arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
    the shared `jackson-LICENSE` / `jackson-NOTICE` files via
    `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
    files are needed.

Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.

Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.

Addresses review feedback on opensearch-project#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feature/mustang-array-functions branch 2 times, most recently from d2f73b7 to fa534e7 Compare May 8, 2026 06:38
ahkcs added a commit to opensearch-project/sql that referenced this pull request May 8, 2026
* Default empty array() return type to ARRAY<VARCHAR>

PPL's `array()` no-arg form delegates to Calcite's `SqlLibraryOperators.ARRAY`
return-type inference, which returns ARRAY<NULL> for an empty operand list and
ARRAY<UNKNOWN> when all operands are typeless nulls. Both markers are fine for
the v2 engine — `ArrayImplementor.internalCast` only consumes the element type
when there are elements to cast, so an empty result Object list flows straight
through to ExprCollectionValue regardless of declared element type.

The analytics-engine route is stricter. When isthmus walks a RexCall like
`mvjoin(array(), '-')`, it reaches its first operand's type and feeds it to
`io.substrait.isthmus.TypeConverter.toSubstrait`, which throws
`UnsupportedOperationException: Unable to convert the type UNKNOWN`. Substrait
has no on-wire encoding for NULL/UNKNOWN element types, so the planner can't
serialize the call at all. Two PPL ITs hit this directly:

  * `CalciteArrayFunctionIT.testMvjoinWithEmptyArray`
  * `CalciteArrayFunctionIT.testMvdedupWithEmptyArray`

Substituting VARCHAR when the inferred element type is NULL or UNKNOWN gives
the call a substrait-serializable type without affecting any value
computation: the result list is empty either way.

# Test plan

* Unit tests: `:core:test --tests "*ArrayFunction*"` — passes locally (no
  existing tests asserted on the empty-array element type).
* IT: `CalciteArrayFunctionIT` force-routed through the analytics-engine path
  via opensearch-project/OpenSearch#21554's plugin set —
  testMvjoinWithEmptyArray and testMvdedupWithEmptyArray now pass (were UNKNOWN
  type errors); pass-rate moved 26/60 → 28/60.

Companion to opensearch-project/OpenSearch#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Add unit tests for empty/UNKNOWN ARRAY → VARCHAR fallback

Cover the four shapes that exercise the return-type inference path
introduced in 666dc0e:

  * array()                  — 0 operands, fallback fires → ARRAY<VARCHAR>
  * array(NULL)              — typeless-null operand, fallback fires → ARRAY<VARCHAR>
  * array(1)                 — INTEGER operand, fallback does NOT fire → ARRAY<INTEGER>
  * array('a', 'b')          — VARCHAR operands, fallback does NOT fire → ARRAY<VARCHAR>

The third case is the regression guard requested by review — confirms
concrete element types pass through unchanged and the fallback is scoped
strictly to the {@code NULL}/{@code UNKNOWN} markers.

The harness uses Calcite's {@link ExplicitOperatorBinding} bound to
{@link SqlLibraryOperators#ARRAY} so the inference's internal
{@code SqlLibraryOperators.ARRAY.getReturnTypeInference().inferReturnType(...)}
call resolves the same operator the production code delegates to —
mocking {@code SqlOperatorBinding} directly hits NPEs deep inside
Calcite's least-restrictive-type computation.

Addresses review feedback on #5421.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

* Trim inline comment per review feedback

The "why" lives in the PR description; the inline comment now points
there instead of duplicating it. Addresses dai-chen's review note on

Signed-off-by: Kai Huang <ahkcs@amazon.com>
#5421.

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs changed the title [Analytics Backend / DataFusion] Wire PPL array constructor + array_length / array_slice / array_distinct / mvjoin [Analytics Backend / DataFusion] Onboard PPL array constructor and 8 multivalue (mv) functions to analytics-engine route May 8, 2026
@ahkcs ahkcs force-pushed the feature/mustang-array-functions branch from 44d604e to a652406 Compare May 8, 2026 20:23
ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 8, 2026
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:

  * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
    NOTICE (no shared `commons-text-*` license files yet in this plugin).
    Apache 2.0; LICENSE and NOTICE extracted from the released jar.
  * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
    arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
    the shared `jackson-LICENSE` / `jackson-NOTICE` files via
    `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
    files are needed.

Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.

Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.

Addresses review feedback on opensearch-project#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added a commit to ahkcs/OpenSearch that referenced this pull request May 8, 2026
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:

  * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
    NOTICE (no shared `commons-text-*` license files yet in this plugin).
    Apache 2.0; LICENSE and NOTICE extracted from the released jar.
  * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
    arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
    the shared `jackson-LICENSE` / `jackson-NOTICE` files via
    `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
    files are needed.

Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.

Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.

Addresses review feedback on opensearch-project#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feature/mustang-array-functions branch from a652406 to ae3fdc2 Compare May 8, 2026 23:29
@mch2 mch2 added skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis. labels May 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for ae3fdc2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for ae3fdc2: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

ahkcs added 9 commits May 9, 2026 10:07
…ength / array_slice / array_distinct / mvjoin

Onboards the PPL `array(a, b, …)` constructor and four array-consuming
functions to the analytics-engine route by mapping their Calcite lowering
targets through Substrait to DataFusion's native make_array / array_length /
array_slice / array_distinct / array_to_string.

Same templated shape as the `replace` PR (opensearch-project#21527), with two extensions:

  ScalarFunction enum constants (5)
    + STANDARD_PROJECT_OPS / ARRAY_RETURNING_PROJECT_OPS membership
    + opensearch_array_functions.yaml extension entries
    + ADDITIONAL_SCALAR_SIGS Calcite-op→Substrait-name bridges
    + scalarFunctionAdapters() entries for the 3 functions that need
      operand normalization
    = onboarded to the analytics route.

Capability lookup at OpenSearchProjectRule keys on the call's return type;
for array-returning functions (`array(...)`, `array_slice`, `array_distinct`)
the return type resolves to `SqlTypeName.ARRAY`, which previously hit
`default → null` in `FieldType.fromSqlTypeName` and emptied the viable-backend
list before the registration could match.

  * `FieldType.ARRAY` added to the analytics SPI enum.
  * `SqlTypeName.ARRAY → FieldType.ARRAY` mapping in `fromSqlTypeName`.
  * `ARRAY_RETURNING_PROJECT_OPS` registered against `Set.of(FieldType.ARRAY)`
    only — separate from `STANDARD_PROJECT_OPS` so `FieldType.ARRAY` doesn't
    pollute filter / aggregate capabilities (no meaningful semantics over
    array-typed values there).
  * `ArrowSchemaFromCalcite.toArrowField` recurses into the component type
    to build the matching Arrow `List<inner>` field — without this the result
    schema would have a bare `List` with no element field and the backend's
    Arrow IPC reader would fail to bind result columns.

Substrait's standard catalog has no array_* entries, so isthmus'
`RexExpressionConverter` would fail with "Unable to convert call …" on every
array call. New `opensearch_array_functions.yaml` declares:

  * `make_array(any1, …)` → `list<any1>` (variadic, min: 0).
  * `array_length(list<any1>)` → `i64?`.
  * `array_slice(list<any1>, i64, i64)` → `list<any1>` (with i32 fallback).
  * `array_distinct(list<any1>)` → `list<any1>`.
  * `array_to_string(list<any1>, string)` → `string?` (with varchar fallback).

Loaded via `SimpleExtension.load("/opensearch_array_functions.yaml")` and
merged into the plugin's extension collection in
`DataFusionPlugin.loadSubstraitExtensions()`.

Substrait's call-conversion path (and DataFusion's signature matcher) is
strict about operand types in ways Calcite's PPL lowering doesn't naturally
satisfy. Three adapters bridge the gap:

  * `MakeArrayAdapter` — implements `ScalarFunctionAdapter` directly
    (not `AbstractNameMappingAdapter`). PPL's `ArrayFunctionImpl` infers
    `ARRAY<commonElementType>` for the call's return type but does NOT
    widen the individual operand types. So `array(1, 1.5)` produces a
    RexCall whose operands are `(INTEGER, DECIMAL(2,1))` but whose return
    type is `ARRAY<DOUBLE>`. Substrait's variadic-`any1` consistency
    validator throws an `AssertionError` in that case (not a recoverable
    exception — it fatally exits the search-thread JVM). The adapter
    extracts the call's component type and CASTs each non-matching
    operand to it before emission.
  * `ArrayToStringAdapter` — declares a local `array_to_string` op and
    name-maps `SqlLibraryOperators.ARRAY_JOIN` → it.
  * `ArraySliceAdapter` — passes the `ARRAY_SLICE` call through unchanged
    but coerces the index operands (positions 1, 2, optional 3) to
    `BIGINT`. PPL's parser types positive integer literals as
    `DECIMAL(20,0)`; DataFusion's `array_slice` signature accepts only
    integer indexes and refuses to coerce decimal arguments.

Two third-party dependencies that surfaced as fatal `NoClassDefFoundError`
during execution of array-returning calls:

  * `commons-text` to analytics-engine — Calcite's `SqlFunctions` class
    statically references `org.apache.commons.text.similarity.LevenshteinDistance`.
    Without it, any Calcite RelNode walk that touches `SqlFunctions.<clinit>`
    poisons the search-thread JVM.
  * `jackson-datatype-jsr310` to **arrow-flight-rpc** (the parent plugin
    that bundles `arrow-vector`). `arrow-vector`'s `JsonStringArrayList`
    eagerly registers `JavaTimeModule` on its ObjectMapper in `<clinit>`,
    so any reader of an Arrow `ListVector` (i.e. every array-returning
    DataFusion call flowing through analytics-engine) hits a fatal
    NoClassDefFoundError. The dep belongs on arrow-flight-rpc's classpath
    because that plugin defines arrow-vector's classloader; bundling it
    in analytics-backend-datafusion (the child plugin) is invisible to
    arrow-vector. Marked `compileOnly` here to avoid jar-hell with
    arrow-flight-rpc's `api` dependency.

  * Before: 1/60 (testArrayWithMix only — exercises an error path that
    fails before the ARRAY capability lookup).
  * After:  9/60.
    Newly passing: testArray, testArrayWithString, testArrayLength,
    testMvjoinWithStringArray, testMvjoinWithStringifiedNumbers,
    testMvjoinWithMixedStringValues, testMvjoinWithStringBooleans,
    testMvjoinWithSpecialDelimiters, testMvjoinWithArrayFromRealFields,
    testMvjoinWithMultipleRealFields.

The remaining 51 failures fall into three buckets:

  * 50 — out-of-scope S1+ functions (`mvfind`, `mvzip`, `reduce`, `transform`,
    `forall`, `filter`, `exists`, `ITEM`). These are PPL UDFs without direct
    DataFusion equivalents and need either lambda-substrait wiring or
    custom UDF registration on the Rust side.
  * 5  — `testMvindexRange*` family. PPL's `mvindex(arr, from, to)` lowers
    to `ARRAY_SLICE(arr, from+1, to+1)` (1-based shift) but the lowering
    is missing the +1, so DataFusion's 1-based array_slice returns a
    window shifted by one. Fix belongs in the SQL plugin's PPL→Calcite
    lowering layer.
  * 1  — `testMvindexRangeMixed` JSON formatting mismatch (test code
    expects bare `[a,b,c]` but the response is `\"[\\\"a\\\",\\\"b\\\",\\\"c\\\"]\"`).

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…th) → 1-based-(start, end) for DataFusion

Calcite's `SqlLibraryOperators.ARRAY_SLICE` is the Spark / Hive flavor —
0-based start, third arg is the length-of-elements to take. PPL's
`MVIndexFunctionImp.resolveRange` (in the SQL plugin) emits this form,
e.g. `mvindex(arr=[1..5], 1, 3)` → `ARRAY_SLICE(arr, 1, 3)` meaning
"start at 0-based position 1, take 3 elements" → expected `[2, 3, 4]`.

DataFusion's native `array_slice` is the Postgres / Snowflake flavor —
1-based start, third arg is the inclusive end-index. So the same call
`array_slice(arr, 1, 3)` returns elements at 1-based positions 1..3 →
`[1, 2, 3]`. Off-by-one across every `mvindex` range query.

Convert the operands in the adapter rather than the SQL plugin's PPL
lowering, because the lowering's existing semantics are correct for
Calcite's local executor (used by every non-analytics path); the bug is
only in the bridge to DataFusion.

  start' = start + 1
  end'   = start + length    (== start + 1 + (length - 1))

`MVIndexFunctionImp` already normalizes negative indexes to non-negative
0-based positions before invoking ARRAY_SLICE (it uses `arrayLen + idx`),
so the arithmetic above applies uniformly.

Empirically: `mvindex(arr=[1..5], 1, 3)` now returns the correct values
`[2, 3, 4]` (was `[1, 2, 3]`); negative form `mvindex(arr, -3, -1)`
returns `[3, 4, 5]` (was `[2, 3]`); mixed `mvindex(arr, -4, 2)` returns
`[2, 3]` matching the PPL spec.

The 5 `testMvindexRange*` tests still don't pass on the IT, but for an
unrelated reason — array-typed result values are being returned as
JSON-stringified scalars (`"[2,3,4]"`) instead of typed arrays. That's a
response-formatting issue affecting every array-returning test (also
`testArray`, `testArrayWithString`) and lives in a different code path;
it'll be addressed separately.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…t) → DataFusion array_element

PPL's `mvindex(arr, N)` single-element form lowers (in `MVIndexFunctionImp.resolveSingleElement`)
to Calcite's `SqlStdOperatorTable.ITEM` operator with a 1-based index (already
converted from PPL's 0-based input). DataFusion's native single-element array
accessor is `array_element` (also 1-based), so a name-mapping adapter + yaml
extension entry are sufficient.

Templated shape:

  ScalarFunction.ITEM (SqlKind.ITEM)
    + STANDARD_PROJECT_OPS membership (returns the array's element type, which
      resolves through the existing FieldType.fromSqlTypeName → SUPPORTED_FIELD_TYPES
      capability lookup for non-array element types — array-of-array is rare in
      PPL and not exercised by the current test surface)
    + scalarFunctionAdapters() entry → ArrayElementAdapter
        ↳ rewrites SqlStdOperatorTable.ITEM to a locally-declared SqlFunction
          named "array_element"
        ↳ coerces the index operand to BIGINT (PPL's parser produces DECIMAL
          for positive integer literals; DataFusion's array_element rejects
          DECIMAL indexes, same as array_slice)
    + ADDITIONAL_SCALAR_SIGS bridge for the locally-declared op
    + opensearch_array_functions.yaml extension entry:
        array_element(list<any1>, i64) → any1?

# Pass-rate (CalciteArrayFunctionIT, force-routed)

  * Before this commit: 9/60.
  * After this commit:  12/60.
    Newly passing: testMvindexSingleElementPositive,
    testMvindexSingleElementNegative,
    testMvindexSingleElementNegativeMiddle.

The other 3 tests that hit the ITEM rejection (testMvfindWith*) are
multi-step queries where ITEM is one node in a tree that also includes
unrelated S1+ functions (mvfind/mvzip/etc.); they remain blocked by
the upstream functions, not by ITEM itself.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ithout JSON-stringifying

The row-oriented fragment-execution wire format (`FragmentExecutionResponse`,
used when arrow-flight streaming is disabled — every single-node test cluster
today) shipped each cell through OpenSearch's `writeGenericValue` /
`readGenericValue`, which preserves `List` values as `ArrayList<Object>`. On
the coordinator side, `RowResponseCodec.decode` then re-materialized the rows
into a `VectorSchemaRoot` for `Iterable<VectorSchemaRoot>`-style consumers.

Two bugs in that re-materialization were eating array values:

1. `inferArrowType` walked rows for the first non-null cell and matched
   against {Long, Integer, …, CharSequence, byte[], Number}. {@code List}
   wasn't in the chain, so it fell through to {@code break} and the
   fallback {@link ArrowType.Utf8} — every array column became a VARCHAR
   column.
2. `setVectorValue` for {@link VarCharVector} called {@code value.toString()}.
   For a {@code JsonStringArrayList} that returns the JSON form
   {@code "[2,3,4]"}, which then got serialized as a JSON string in the
   final response. Tests like {@code testMvindexRangePositive} saw their
   array result come back as a string `"[2,3,4]"` instead of an array
   `[2, 3, 4]`.

Fix:

* Replace {@code inferArrowType} with {@code inferField} that returns a
  full {@link Field}. For {@code List} cells, build a list field with the
  inner element type inferred from the first non-null element (with a
  fallback that scans later rows in case the first list is empty/all-null).
* Add a {@code ListVector} arm to {@code setVectorValue} that delegates to
  a new {@code writeListValue}. The writer bypasses {@link UnionListWriter}
  entirely — it writes directly to the list's offset / validity buffers and
  to the inner data vector via the inner vector's typed `setSafe`. The
  writer-based API requires per-element `ArrowBuf` allocations for varchar
  elements that are easy to leak or use-after-free; the direct path is
  simpler and avoids both classes of bug.

Plus a separate Arrow gotcha that surfaced once arrays started flowing
through correctly:

* {@code ListVector.getObject} for a {@code VarCharVector} child returns a
  {@code JsonStringArrayList} whose elements are Arrow's {@link Text} class,
  not Java {@link String}. {@code ExprValueUtils.fromObjectValue} doesn't
  recognize {@code Text} and threw "unsupported object class
  org.apache.arrow.vector.util.Text". {@code ArrowValues.toJavaValue} now
  mirrors its top-level VarChar branch for list cells: when a list value
  comes back from a {@code ListVector}, normalize each {@code Text} element
  to a {@link String} before handing the list upward.

  * Before: 12/60 (mvindex range tests still showed expected-vs-actual
    diff because `[2,3,4]` came back as a JSON string, not an array).
  * After:  26/60.

  Newly passing:
    testMvindexRangePositive, testMvindexRangeNegative, testMvindexRangeMixed,
    testMvindexRangeFirstThree, testMvindexRangeLastThree,
    testMvindexRangeSingleElement,
    testMvdedupWithDuplicates, testMvdedupWithAllDuplicates,
    testMvdedupWithNoDuplicates, testMvdedupWithStrings,
    testArrayWithString,
    testSplitWithSemicolonDelimiter, testSplitWithMultiCharDelimiter,
    testSplitWithEmptyDelimiter.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
The `dependencyLicenses` precommit task scans `licenses/` for a `<jar>.sha1`
sibling per bundled dependency. Two deps added in this PR were missing them:

  * `commons-text-1.11.0` in analytics-engine — needs sha1 + LICENSE +
    NOTICE (no shared `commons-text-*` license files yet in this plugin).
    Apache 2.0; LICENSE and NOTICE extracted from the released jar.
  * `jackson-datatype-jsr310-2.21.3` in arrow-flight-rpc — sha1 only.
    arrow-flight-rpc's `dependencyLicenses` already maps `jackson-.*` to
    the shared `jackson-LICENSE` / `jackson-NOTICE` files via
    `mapping from: /jackson-.*/, to: 'jackson'`, so no new license/notice
    files are needed.

Plus googleJavaFormat reflow on `ArraySliceAdapter` and `DataFusionPlugin`
that spotlessCheck flagged in precommit.

Verified `:plugins:arrow-flight-rpc:precommit`,
`:sandbox:plugins:analytics-engine:precommit`, and
`:sandbox:plugins:analytics-backend-datafusion:precommit` all succeed.

Addresses review feedback on opensearch-project#21554.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
… inference

{@code RowResponseCodec.scalarArrowType} ordered its instanceof checks
{Long, Integer, Short, Byte, Double, Float, Boolean, CharSequence, byte[],
Number(fallback) → Int(64)}. BigDecimal extends {@link Number} but isn't any
of the typed scalar arms, so it fell through to the {@code Number} fallback
and got encoded as a 64-bit integer column — silently truncating fractional
digits.

This bites PPL flows whose common element type is {@code DECIMAL} (e.g.
{@code array(1, -1.5, 2, 1.0)} — the v2-side {@code ArrayImplementor.internalCast}
explicitly maps the DECIMAL target to BigDecimal cells). The element values
{@code -1.5} and {@code 1.0} round to {@code -1} and {@code 1} when forced
through Int(64), so the array reads back as {@code [1, -1, 2, 1]} instead of
{@code [1, -1.5, 2, 1.0]}.

Promote BigDecimal cells to FloatingPoint(DOUBLE) — same precision the v2
engine uses for decimal-typed PPL results, so behavior matches across both
execution paths. The list writer's {@code Float8Vector} arm already uses
{@code ((Number) element).doubleValue()}, which correctly extracts the
fractional value from a BigDecimal.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
PPL `mvzip(left, right [, sep])` element-wise zips two arrays into a list of
strings, joined per pair by a separator (default `,`). DataFusion has no
stdlib equivalent — `array_concat` is end-to-end concatenation, and Substrait's
lambda support is too thin for a transform/zip rewrite — so this onboards a
custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session
context and wires the Java side to route to it.

Templated shape (extends the existing pattern from convert_tz):

  Rust side:
    udf::mvzip::MvzipUdf — Signature::user_defined; coerce_types pins the
      first two args to ListArray and the optional 3rd to Utf8; invoke_with_args
      iterates per row, takes min(len(left), len(right)) elements, stringifies
      each (matching `Objects.toString(elem, "")` for null elements), and
      builds a List<Utf8>. Defensive Null-element-type arm handles the empty
      array case before the SQL-plugin VARCHAR-default kicks in.
    Registered on each session context via udf::register_all alongside
    convert_tz. 7 unit tests cover the basic / custom-sep / truncation /
    null-element / null-array / empty-array / numeric-array shapes.

  Java side:
    ScalarFunction.MVZIP enum entry (SqlKind.OTHER_FUNCTION; resolves through
      identifier-name valueOf("MVZIP") since PPL's MVZipFunctionImpl registers
      under the function name "mvzip").
    MvzipAdapter — locally-declared SqlFunction("mvzip") + ADDITIONAL_SCALAR_SIGS
      bridge so isthmus emits a Substrait scalar function call with the exact
      name the Rust UDF is registered under.
    DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership
      (returns ARRAY<VARCHAR>, registered against FieldType.ARRAY); adapter
      registration in scalarFunctionAdapters().
    opensearch_array_functions.yaml: two impls for arity-2 and arity-3.

  * Before: 28/60.
  * After:  34/60.

  Newly passing — all 5 testMvzip* variants:
    testMvzipBasic, testMvzipWithCustomDelimiter, testMvzipNested,
    testMvzipWithEmptyArray, testMvzipWithBothEmptyArrays.

  (Test count delta is +6 because the test class also exercises mvzip in 1
  other test under a different name, picked up by the same fix.)

This PR's run also picks up the SQL-plugin companion #5421 which defaults
empty `array()` to ARRAY<VARCHAR>. Without that companion the testMvzipWith*EmptyArray
variants would still fail — substrait would reject the input ARRAY<NULL>
type before reaching the UDF. The Rust UDF's Null-element arm exists as a
defensive backstop in case the call ever reaches it with a null-typed list.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
PPL `mvfind(arr, regex)` finds the 0-based index of the first array element
matching a regex pattern (Java `Matcher.find` substring-match semantics), or
NULL if no match. DataFusion has no stdlib equivalent, and rewriting in terms
of array_position requires per-element regex evaluation that's only
expressible with substrait lambda support — out of scope here. Onboards a
custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session
context, mirroring the mvzip/convert_tz pattern.

Templated shape:

  Rust side:
    udf::mvfind::MvfindUdf — Signature::user_defined; coerce_types pins arg 0
      to a list type and arg 1 to Utf8; invoke_with_args walks each row and
      finds the first non-null element whose stringified form matches the
      regex via Rust's `regex` crate (`Regex::is_match` is unanchored, same
      as Java's `Matcher.find`). Scalar pattern operands compile once up
      front and surface invalid-regex errors at plan time (mirrors the SQL
      plugin's plan-time `tryCompileLiteralPattern`); column-valued patterns
      compile per row and yield NULL for invalid patterns. Supports list
      element types Utf8 / Int{8,16,32,64} / UInt{8,16,32,64} / Float{32,64}
      / Boolean / Null. 7 unit tests cover the basic-match / no-match /
      null-array / empty-array / null-element / numeric-array / unanchored
      shapes.
    Registered on each session context via udf::register_all alongside
    convert_tz and mvzip.

  Java side:
    ScalarFunction.MVFIND enum entry (SqlKind.OTHER_FUNCTION; resolves
      through identifier-name valueOf("MVFIND") since PPL's
      MVFindFunctionImpl registers under the function name "mvfind").
    MvfindAdapter — locally-declared SqlFunction("mvfind") +
      ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar
      function call with the exact name the Rust UDF is registered under.
    DataFusionAnalyticsBackendPlugin: STANDARD_PROJECT_OPS membership
      (returns INTEGER, registered against the existing scalar
      SUPPORTED_FIELD_TYPES); adapter registration in
      scalarFunctionAdapters().
    opensearch_array_functions.yaml: arity-2 impl returning `i32?`.

  * Before: 34/60.
  * After:  42/60.

  Newly passing — 8 of 9 testMvfind* variants:
    testMvfindWithMatch, testMvfindWithFirstMatch, testMvfindWithMultipleMatches,
    testMvfindWithNoMatch, testMvfindWithEmptyArray, testMvfindWithNumericArray,
    testMvfindWithCaseInsensitive, testMvfindWithComplexRegex.

  Remaining mvfind failure:
    testMvfindWithDynamicRegex — fails with "Unable to convert call
    CONCAT(string, string)" because the test computes the pattern via
    `concat('ban', '.*')` and substrait can't bind the CONCAT call. This is a
    separate analytics-engine CONCAT type-conversion issue, not mvfind-specific.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
… UDF

PPL `mvappend(arg1, arg2, …)` flattens a mixed list of array and scalar
arguments into one array, dropping null arguments and null elements within
array arguments. DataFusion's `array_concat` is the closest stdlib match but
only accepts arrays (not mixed array+scalar) and preserves nulls — different
semantics. Onboards as a custom Rust ScalarUDF on the analytics-backend-datafusion
plugin's session context, mirroring the mvzip / mvfind pattern.

Templated shape:

  Rust side:
    udf::mvappend::MvappendUdf — Signature::user_defined; per-row walk over
      operands, skipping NULL args and NULL elements inside array args, with
      explicit Arrow type arms for {Int8/16/32/64, UInt8/16/32/64,
      Float32/64, Boolean, Utf8/LargeUtf8/Utf8View}. The string arms output
      List<Utf8> or List<Utf8View> depending on the inferred element type so
      the result schema matches what `return_type` declared (DataFusion's
      execution-time schema check rejects mismatches). Defensive Null
      element-type arm covers the empty-array shape. 6 unit tests.
    Registered on each session context via udf::register_all.

  Java side:
    ScalarFunction.MVAPPEND enum entry (SqlKind.OTHER_FUNCTION; resolves
      through identifier-name valueOf("MVAPPEND")).
    MvappendAdapter — locally-declared SqlFunction("mvappend") +
      ADDITIONAL_SCALAR_SIGS bridge. Casts every scalar operand to the
      call's array component type and every array operand to
      ARRAY<componentType> before substrait emission, so the UDF sees a
      single uniform element type across all positions.
    DataFusionAnalyticsBackendPlugin: ARRAY_RETURNING_PROJECT_OPS membership
      (returns ARRAY<commonType>); adapter registration in
      scalarFunctionAdapters().
    opensearch_array_functions.yaml: variadic min:1 entry with `list<any1?>`
      return type.

  * Before: 0/15.
  * After:  6/15.

  Newly passing:
    testMvappendWithMultipleElements, testMvappendWithSingleElement,
    testMvappendWithArrayFlattening, testMvappendWithStringValues,
    testMvappendWithNestedArrays, testMvappendWithRealFields.

  * 8 tests fail with "Unable to convert the type ANY". Root cause is
    PPL's MVAppendFunctionImpl.updateMostGeneralType using strict
    Object.equals on each pair of operand types, returning Calcite's
    ANY type when any two don't match — including when they only differ
    in nullability tag (a literal 3 is INTEGER NOT NULL but the
    component type of `array(1, 2)` is INTEGER NULLABLE). Substrait
    can't serialize ANY. The fix belongs in the SQL plugin's
    MVAppendFunctionImpl (use typeFactory.leastRestrictive instead of
    Object.equals) and isn't addressed here.
  * testMvappendInWhereClause — uses `where array_length(combined) = 2`
    which the analytics-engine planner rejects with "No backend can
    evaluate filter predicate [EQUALS] on fields [combined:ARRAY]".
    Filter-side capability gap unrelated to mvappend.
  * testMvappendWithComplexExpression — fails substrait conversion on
    a nested mvappend call ("Unable to convert call mvappend(list, …)"),
    likely the same nullability widening pattern flowing through nested
    calls. Same upstream fix applies.

  Unchanged at 43/60 — mvappend isn't exercised there.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
ahkcs added 2 commits May 9, 2026 10:11
… lists; add Decimal128 element support

Two follow-ons to the initial mvappend onboarding (40b2161), both surfaced
once the SQL companion opensearch-project#5424 (`MVAppendFunctionImpl.leastRestrictive`) let
homogeneous-type calls reach substrait conversion.

# Uniform-list operand reshape

Substrait's variadic-`any1` argument shape requires every operand at the same
variadic position to share a type. PPL's `mvappend(arg, …)` accepts a mix of
bare scalars and arrays, which substrait's signature matcher rejected with
`Unable to convert call mvappend(list<i32?>, i32?, i32?)`.

`MvappendAdapter` now wraps each scalar operand in a singleton
`make_array(scalar)` call (using the locally-declared `MakeArrayAdapter.LOCAL_MAKE_ARRAY_OP`)
so by the time the substrait converter sees the operands they're uniformly
`list<componentType>`. The yaml impl was correspondingly tightened from
`args: [{ value: any1 }] variadic` to `args: [{ value: list<any1?> }] variadic`.

Rust UDF (`udf::mvappend`) keeps its scalar-handling branch intact as a
defensive fallback, but in practice every operand it sees is a list now.

# Decimal128 element type

Calcite's leastRestrictive widening on INT + DECIMAL produces DECIMAL(p, s)
which substrait converts to Decimal128(p, s); the Java adapter casts every
operand's element type to that. The Rust UDF needed an explicit
`DataType::Decimal128(p, s)` branch — Decimal128Builder requires
`.with_precision_and_scale(p, s)` configuration before use, and Decimal128Array
elements are read via the `i128`-valued `value(i)` accessor (not via the
generic `build!` macro).

# Pass-rate (CalciteMVAppendFunctionIT, force-routed, with companion opensearch-project#5424 applied)

  * Before this commit: 6/15 (initial mvappend onboarding).
  * After this commit:  10/15.

  Newly passing:
    testMvappendWithMixedArrayAndScalar (uniform-list reshape),
    testMvappendWithComplexExpression (uniform-list reshape),
    testMvappendWithIntAndDouble (Decimal128 element),
    testMvappendWithNumericArrays (Decimal128 element).

  Remaining 5 failures:
    * testMvappendWithMixedTypes / WithFieldsAndLiterals / WithEmptyArray /
      WithNull — call legitimately widens to ARRAY<ANY> because operands
      contain pairs of types with no common widened type (INT + VARCHAR).
      The Calcite engine handles ANY via Object generic dispatch; substrait
      can't encode it. Out of scope without changing PPL UDF semantics.
    * testMvappendInWhereClause — uses `where array_length(combined) = 2`
      which the analytics-engine planner rejects with "No backend can
      evaluate filter predicate [EQUALS] on fields [combined:ARRAY]".
      Filter-side capability gap unrelated to mvappend.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
… contexts

create_session_context (the Rust-side builder behind df_create_session_context)
built a fresh DataFusion SessionContext but never called udf::register_all on
it. Every fragment query routed through df_execute_with_context reused that
handle's ctx via query_executor::execute_with_context, so substrait function
references to mvappend / mvfind / mvzip / convert_tz failed planning with
"This feature is not implemented: Unsupported function name". The matching
register_all call exists in execute_query / local_executor / indexed_executor
— this just brings the FFM session-context path to parity.

Verified: CalciteMVAppendFunctionIT against the analytics-engine route now
passes 10/15 (was 0/15) with the SQL companion opensearch-project#5424 widening fix applied.
The remaining 5 are pre-existing ARRAY<ANY>/UNKNOWN substrait-encoding gaps
(heterogeneous mvappend signatures, empty-array default, filter-on-array
predicate) tracked in this PR's "What's left" section.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feature/mustang-array-functions branch from ae3fdc2 to 459bffc Compare May 9, 2026 17:12
ahkcs added a commit to ahkcs/sql that referenced this pull request May 9, 2026
`MVAppendFunctionImpl.updateMostGeneralType` used strict {@code Object.equals}
to compare each operand's component type against the running "most general"
type, falling back to Calcite's {@code ANY} on any mismatch. That's too
aggressive: {@code Object.equals} returns false for type pairs that differ
only in nullability tag (e.g. {@code array(1, 2)} synthesizes INTEGER NULLABLE
for its component while literal {@code 3} is INTEGER NOT NULL), and for
straightforwardly-widenable numerics like INTEGER + DOUBLE. The PPL UDF result
would then be {@code ARRAY<ANY>}.

The Calcite engine's enumerable runtime tolerates {@code ANY} because
{@code MVAppendImplementor.eval} processes elements through {@code Object} —
the declared element type is unused at execution time. The analytics-engine
route is stricter: substrait can't serialize {@code ANY}, so isthmus throws
{@code UnsupportedOperationException: Unable to convert the type ANY} during
the substrait conversion phase.

Widen with {@link RelDataTypeFactory#leastRestrictive} — the same routine
{@code SqlLibraryOperators.ARRAY} uses for its return-type inference. Falls
back to ANY only when {@code leastRestrictive} returns null (genuinely
incompatible operand types like INT + VARCHAR), preserving the original
behavior on those queries.

# Test plan

* {@code :core:test --tests "*MVAppend*"} — passes (no existing test asserted
  on the {@code ANY} fallback).
* Companion to opensearch-project/OpenSearch#21554 — unblocks 8+ tests in
  {@code CalciteMVAppendFunctionIT} force-routed through the analytics-engine
  path that previously failed with "Unable to convert the type ANY".

Signed-off-by: Kai Huang <ahkcs@amazon.com>
…ill the cluster

Substrait's plan validators (VariadicParameterConsistencyValidator,
RelOptUtil.eq via Litmus.THROW, etc.) throw AssertionError directly via
explicit `throw new AssertionError(...)` rather than via the `assert`
keyword, so the JVM -da flag doesn't gate them. When a malformed plan
triggers one inside a search-thread call to SubstraitRelVisitor.apply,
the AssertionError propagates uncaught up the analytics-engine fragment
handler stack, OpenSearchUncaughtExceptionHandler classifies it as fatal,
and the entire cluster JVM exits.

Wrap the visitor.apply call in a narrow try/catch that re-raises the
AssertionError as IllegalStateException with the original message and
cause preserved. The analytics-engine error path already buckets
IllegalStateException at the fragment boundary into a normal HTTP 500
response — the cluster stays up and the failure shows in the per-query
report instead.

This came up while diagnosing CalciteMVAppendFunctionIT failures: malformed
ARRAY<ANY> plans were taking down the cluster mid-test instead of producing
per-test failures, masking the underlying substrait conversion error.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the feature/mustang-array-functions branch from 459bffc to 1a0f571 Compare May 9, 2026 17:55
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for 1a0f571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…ST path

Self-contained QA ITs in sandbox/qa/analytics-engine-rest exercising the
PPL collection functions onboarded in this PR through POST /_analytics/ppl
against a parquet-backed `calcs` dataset, no SQL plugin checkout required.

ArrayFunctionIT (22 tests):
  - array constructor (mixed-numeric BigDecimal → Double promotion + int+string)
  - array_length
  - mvindex range (array_slice — 0-based-(start, length) → 1-based-(start, end))
  - mvindex single (array_element via ITEM rename)
  - mvdedup (array_distinct)
  - mvjoin (array_to_string rename)
  - mvzip (Rust UDF, default + custom delimiter + nested)
  - mvfind (Rust UDF, match / no-match / dynamic regex via concat() Sig bridge)
  - split (returns array)

MVAppendFunctionIT (6 tests):
  - uniform-typed scalar variadic (multiple, single, string)
  - array operands (flattening, nested string arrays)
  - VARCHAR field references via real calcs row

Tests gated on SQL companion opensearch-project#5424 (testMvappendWith{IntAndDouble,
MixedArrayAndScalar, NumericArrays, ComplexExpression}) are intentionally
absent — they fail with "Unable to convert the type ANY" until
MVAppendFunctionImpl's leastRestrictive widening + DECIMAL→DOUBLE
promotion + operand pre-cast is published as unified-query-core. A
top-of-class block lists them with a pointer back to opensearch-project#5424.

Lambda-based functions (transform, mvmap, reduce, forall, exists, filter)
and empty-array operands are absent for the architectural reasons in this
PR's "What's left" section: substrait extension YAML doesn't support
declaring func<…> lambda-typed args, and array() defaults to ARRAY<UNKNOWN>
which substrait can't encode without #5421.

Local verification (per `docs/dev/ppl-analytics-engine-routing.md` SOP):
- :sandbox:qa:analytics-engine-rest:integTest --tests "*ArrayFunctionIT" — 22/22
- :sandbox:qa:analytics-engine-rest:integTest --tests "*MVAppendFunctionIT" — 6/6
- :check -p sandbox — all 718 tasks green

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❌ Gradle check result for cf8ee09: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

❕ Gradle check result for cf8ee09: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.38%. Comparing base (8f72a95) to head (cf8ee09).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21554      +/-   ##
============================================
- Coverage     73.48%   73.38%   -0.11%     
+ Complexity    74646    74544     -102     
============================================
  Files          5980     5980              
  Lines        338777   338777              
  Branches      48848    48848              
============================================
- Hits         248964   248608     -356     
- Misses        70026    70362     +336     
- Partials      19787    19807      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mch2 mch2 merged commit 4dea6c4 into opensearch-project:main May 9, 2026
27 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. skip-diff-reviewer Maintainer to skip code-diff-reviewer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants