-
Notifications
You must be signed in to change notification settings - Fork 608
[GLUTEN-8629][VL] Add RDDScanExec support to Velox backend #12077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d0a2d1d
d97e670
34deb4d
72945f8
877e9be
0e6f659
bb8ba18
e5e3ce9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.gluten.execution | ||
|
|
||
| import org.apache.gluten.backendsapi.velox.VeloxValidatorApi | ||
| import org.apache.gluten.config.{GlutenConfig, VeloxConfig} | ||
|
|
||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.{Attribute, SortOrder} | ||
| import org.apache.spark.sql.catalyst.plans.physical.Partitioning | ||
| import org.apache.spark.sql.execution.{RDDScanTransformer, SparkPlan} | ||
| import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} | ||
| import org.apache.spark.sql.vectorized.ColumnarBatch | ||
|
|
||
| /** | ||
| * Velox-backend implementation of RDDScanTransformer. | ||
| * | ||
| * Converts an RDD[InternalRow] into columnar batches using Velox's native row-to-columnar | ||
| * conversion (same JNI path as RowToVeloxColumnarExec). | ||
| */ | ||
| case class VeloxRDDScanTransformer( | ||
| outputAttributes: Seq[Attribute], | ||
| rdd: RDD[InternalRow], | ||
| name: String, | ||
| // Row-to-columnar conversion preserves data distribution, so we carry through | ||
| // the original partitioning. This differs from CH which uses UnknownPartitioning(0) | ||
| // but is consistent with RowToVeloxColumnarExec's behavior. | ||
| override val outputPartitioning: Partitioning, | ||
| override val outputOrdering: Seq[SortOrder] | ||
| ) extends RDDScanTransformer(outputAttributes, outputPartitioning, outputOrdering) { | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PR description contradicts validation logic for complex types Problem: The PR description states "rejects complex types (ARRAY, MAP, STRUCT)" but Evidence: case _: org.apache.spark.sql.types.ArrayType =>
case _: org.apache.spark.sql.types.MapType =>
case _: org.apache.spark.sql.types.StructType =>These cases fall through to Suggested Fix: Update the PR description to remove the claim that complex types are rejected, e.g.:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch — updated the PR description. It now correctly states that complex types (Array, Map, Struct) are supported via the UnsafeRowFast::deserialize path, and only truly unsupported types trigger fallback |
||
| override def nodeName: String = name | ||
|
|
||
| @transient override lazy val metrics: Map[String, SQLMetric] = Map( | ||
| "numInputRows" -> SQLMetrics.createMetric(sparkContext, "number of input rows"), | ||
| "numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"), | ||
| "convertTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to convert") | ||
| ) | ||
|
|
||
| override protected def doValidateInternal(): ValidationResult = { | ||
| for (field <- schema.fields) { | ||
| val reason = VeloxValidatorApi.validateSchema(field.dataType) | ||
| if (reason.isDefined) { | ||
| return ValidationResult.failed(reason.get) | ||
| } | ||
| } | ||
| ValidationResult.succeeded | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Metrics gap in BatchCarrierRow unwrap path Problem: When the RDD contains Evidence: case _: BatchCarrierRow =>
// No metrics updated here
(Iterator.single(first) ++ iter).flatMap(row => BatchCarrierRow.unwrap(row))Suggested Fix: case _: BatchCarrierRow =>
(Iterator.single(first) ++ iter).flatMap { row =>
BatchCarrierRow.unwrap(row).map { batch =>
numOutputBatches += 1
numInputRows += batch.numRows()
batch
}
}
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated the BatchCarrierRow unwrap path to increment numOutputBatches and numInputRows per batch, so Spark UI now shows correct metrics for checkpointed data. convertTime is intentionally omitted since no row-to-columnar conversion happens in this path. |
||
| } | ||
|
|
||
| override def doExecuteColumnar(): RDD[ColumnarBatch] = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great catch — this is a real bug. If the upstream RDD was produced by a Gluten plan ending in VeloxColumnarToCarrierRowExec (e.g., via df.checkpoint()), the rows would be BatchCarrierRow instances and UnsafeProjection.apply() would throw. Fixed by peeking at the first row and branching: carrier rows are unwrapped directly via BatchCarrierRow.unwrap(), skipping row-to-columnar conversion entirely. This mirrors the CH pattern. |
||
| val numInputRows = longMetric("numInputRows") | ||
| val numOutputBatches = longMetric("numOutputBatches") | ||
| val convertTime = longMetric("convertTime") | ||
| val localSchema = this.schema | ||
| val batchSize = GlutenConfig.get.maxBatchSize | ||
| val batchBytes = VeloxConfig.get.veloxPreferredBatchBytes | ||
| rdd.mapPartitions { | ||
| iter => | ||
| if (iter.hasNext) { | ||
| val first = iter.next() | ||
| first match { | ||
| case _: BatchCarrierRow => | ||
| // RDD already contains columnar batches wrapped as carrier rows | ||
| // (e.g., from df.checkpoint() on a Gluten plan). Unwrap directly. | ||
| (Iterator.single(first) ++ iter).flatMap { | ||
| row => | ||
| BatchCarrierRow.unwrap(row).map { | ||
| batch => | ||
| numOutputBatches += 1 | ||
| numInputRows += batch.numRows() | ||
| batch | ||
| } | ||
| } | ||
| case _ => | ||
| // Standard InternalRow path - convert via native row-to-columnar. | ||
| RowToVeloxColumnarExec.toColumnarBatchIterator( | ||
| Iterator.single(first) ++ iter, | ||
| localSchema, | ||
| numInputRows, | ||
| numOutputBatches, | ||
| convertTime, | ||
| batchSize, | ||
| batchBytes) | ||
| } | ||
| } else { | ||
| Iterator.empty | ||
| } | ||
| } | ||
| } | ||
|
|
||
| override protected def withNewChildrenInternal( | ||
| newChildren: IndexedSeq[SparkPlan]): SparkPlan = { | ||
| assert(newChildren.isEmpty, "VeloxRDDScanTransformer is a leaf node") | ||
| copy(outputAttributes, rdd, name, outputPartitioning, outputOrdering) | ||
| } | ||
| } | ||
|
|
||
| object VeloxRDDScanTransformer { | ||
| def replace(plan: org.apache.spark.sql.execution.RDDScanExec): RDDScanTransformer = | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CH uses
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Valid concern. Row-to-columnar conversion doesn't change data distribution — it converts row format within each partition, preserving the partition layout. This is consistent with RowToVeloxColumnarExec which also carries through the child's outputPartitioning. Added an inline comment explaining the rationale and the difference from CH's approach |
||
| VeloxRDDScanTransformer( | ||
| plan.output, | ||
| plan.inputRDD, | ||
| plan.nodeName, | ||
| plan.outputPartitioning, | ||
| plan.outputOrdering) | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation does not recurse into complex type element types
Problem: The type allowlist checks top-level types only. An
ArrayType(UnsupportedType)orMapType(StringType, UnsupportedType)would pass validation but could fail at native execution time. The CH backend avoids this by delegating toConverterUtils.getTypeNode()which recursively validates.Evidence:
Suggested Fix:
Alternatively, delegate to
VeloxValidatorApifor centralized type validation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a great point. Replaced the manual allowlist with VeloxValidatorApi.validateSchema which handles recursive validation for complex type elements and also catches variant shredded structs. This keeps validation logic centralized