Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1205,6 +1205,11 @@ class VeloxSparkPlanExecApi extends SparkPlanExecApi with Logging {
override def genColumnarRangeExec(rangeExec: RangeExec): ColumnarRangeBaseExec =
ColumnarRangeExec(rangeExec.range)

override def isSupportRDDScanExec(plan: RDDScanExec): Boolean = true

override def getRDDScanTransform(plan: RDDScanExec): RDDScanTransformer =
VeloxRDDScanTransformer.replace(plan)

override def genColumnarTailExec(limit: Int, child: SparkPlan): ColumnarCollectTailBaseExec =
ColumnarCollectTailExec(limit, child)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.gluten.execution

import org.apache.gluten.backendsapi.velox.VeloxValidatorApi
import org.apache.gluten.config.{GlutenConfig, VeloxConfig}

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.{Attribute, SortOrder}
import org.apache.spark.sql.catalyst.plans.physical.Partitioning
import org.apache.spark.sql.execution.{RDDScanTransformer, SparkPlan}
import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
import org.apache.spark.sql.vectorized.ColumnarBatch

/**
* Velox-backend implementation of RDDScanTransformer.
*
* Converts an RDD[InternalRow] into columnar batches using Velox's native row-to-columnar
* conversion (same JNI path as RowToVeloxColumnarExec).
*/
case class VeloxRDDScanTransformer(
outputAttributes: Seq[Attribute],
rdd: RDD[InternalRow],
name: String,
// Row-to-columnar conversion preserves data distribution, so we carry through
// the original partitioning. This differs from CH which uses UnknownPartitioning(0)
// but is consistent with RowToVeloxColumnarExec's behavior.
override val outputPartitioning: Partitioning,
override val outputOrdering: Seq[SortOrder]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation does not recurse into complex type element types

Problem: The type allowlist checks top-level types only. An ArrayType(UnsupportedType) or MapType(StringType, UnsupportedType) would pass validation but could fail at native execution time. The CH backend avoids this by delegating to ConverterUtils.getTypeNode() which recursively validates.

Evidence:

case _: org.apache.spark.sql.types.ArrayType =>   // passes any ArrayType, no element check
case _: org.apache.spark.sql.types.MapType =>      // passes any MapType, no key/value check
case _: org.apache.spark.sql.types.StructType =>   // passes any StructType, no field check

Suggested Fix:

case a: org.apache.spark.sql.types.ArrayType =>
  validateType(a.elementType)
case m: org.apache.spark.sql.types.MapType =>
  validateType(m.keyType)
  validateType(m.valueType)
case s: org.apache.spark.sql.types.StructType =>
  s.fields.foreach(f => validateType(f.dataType))

Alternatively, delegate to VeloxValidatorApi for centralized type validation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a great point. Replaced the manual allowlist with VeloxValidatorApi.validateSchema which handles recursive validation for complex type elements and also catches variant shredded structs. This keeps validation logic centralized

) extends RDDScanTransformer(outputAttributes, outputPartitioning, outputOrdering) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description contradicts validation logic for complex types

Problem: The PR description states "rejects complex types (ARRAY, MAP, STRUCT)" but doValidateInternal() explicitly accepts these types. The code is correct — Velox does support complex types via UnsafeRowFast::deserialize. The PR description should be updated to avoid misleading reviewers.

Evidence:

case _: org.apache.spark.sql.types.ArrayType =>
case _: org.apache.spark.sql.types.MapType =>
case _: org.apache.spark.sql.types.StructType =>

These cases fall through to ValidationResult.succeeded, meaning complex types are accepted.

Suggested Fix: Update the PR description to remove the claim that complex types are rejected, e.g.:

Supports all Velox-compatible types including complex types (Array, Map, Struct). Rejects only truly unsupported types (e.g., CalendarIntervalType) with clean fallback to vanilla Spark.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated the PR description. It now correctly states that complex types (Array, Map, Struct) are supported via the UnsafeRowFast::deserialize path, and only truly unsupported types trigger fallback

override def nodeName: String = name

@transient override lazy val metrics: Map[String, SQLMetric] = Map(
"numInputRows" -> SQLMetrics.createMetric(sparkContext, "number of input rows"),
"numOutputBatches" -> SQLMetrics.createMetric(sparkContext, "number of output batches"),
"convertTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to convert")
)

override protected def doValidateInternal(): ValidationResult = {
for (field <- schema.fields) {
val reason = VeloxValidatorApi.validateSchema(field.dataType)
if (reason.isDefined) {
return ValidationResult.failed(reason.get)
}
}
ValidationResult.succeeded
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics gap in BatchCarrierRow unwrap path

Problem: When the RDD contains BatchCarrierRow instances (e.g., from df.checkpoint() on a Gluten plan), the code unwraps columnar batches directly without updating numInputRows, numOutputBatches, or convertTime. Spark UI will show zeros for this operator when processing checkpointed data, making performance debugging difficult.

Evidence:

case _: BatchCarrierRow =>
  // No metrics updated here
  (Iterator.single(first) ++ iter).flatMap(row => BatchCarrierRow.unwrap(row))

Suggested Fix:

case _: BatchCarrierRow =>
  (Iterator.single(first) ++ iter).flatMap { row =>
    BatchCarrierRow.unwrap(row).map { batch =>
      numOutputBatches += 1
      numInputRows += batch.numRows()
      batch
    }
  }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the BatchCarrierRow unwrap path to increment numOutputBatches and numInputRows per batch, so Spark UI now shows correct metrics for checkpointed data. convertTime is intentionally omitted since no row-to-columnar conversion happens in this path.

}

override def doExecuteColumnar(): RDD[ColumnarBatch] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowToVeloxColumnarExec.toColumnarBatchIterator does UnsafeProjection.apply(row), which throws on a BatchCarrierRow since PlaceholderRow's getters all throw UnsupportedOperationException. This can show up via df.checkpoint() or user code that does df.queryExecution.toRdd and re-wraps with LogicalRDD.fromDataset, when the upstream Gluten plan ends in VeloxColumnarToCarrierRowExec. CHRDDScanTransformer.scala L101-104 detects this and unwraps via findNextTerminalRow.batch(). Either mirror that, or fail fast with a clear error for carrier rows and add a checkpoint round-trip test to document the current behavior.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch — this is a real bug. If the upstream RDD was produced by a Gluten plan ending in VeloxColumnarToCarrierRowExec (e.g., via df.checkpoint()), the rows would be BatchCarrierRow instances and UnsafeProjection.apply() would throw. Fixed by peeking at the first row and branching: carrier rows are unwrapped directly via BatchCarrierRow.unwrap(), skipping row-to-columnar conversion entirely. This mirrors the CH pattern.

val numInputRows = longMetric("numInputRows")
val numOutputBatches = longMetric("numOutputBatches")
val convertTime = longMetric("convertTime")
val localSchema = this.schema
val batchSize = GlutenConfig.get.maxBatchSize
val batchBytes = VeloxConfig.get.veloxPreferredBatchBytes
rdd.mapPartitions {
iter =>
if (iter.hasNext) {
val first = iter.next()
first match {
case _: BatchCarrierRow =>
// RDD already contains columnar batches wrapped as carrier rows
// (e.g., from df.checkpoint() on a Gluten plan). Unwrap directly.
(Iterator.single(first) ++ iter).flatMap {
row =>
BatchCarrierRow.unwrap(row).map {
batch =>
numOutputBatches += 1
numInputRows += batch.numRows()
batch
}
}
case _ =>
// Standard InternalRow path - convert via native row-to-columnar.
RowToVeloxColumnarExec.toColumnarBatchIterator(
Iterator.single(first) ++ iter,
localSchema,
numInputRows,
numOutputBatches,
convertTime,
batchSize,
batchBytes)
}
} else {
Iterator.empty
}
}
}

override protected def withNewChildrenInternal(
newChildren: IndexedSeq[SparkPlan]): SparkPlan = {
assert(newChildren.isEmpty, "VeloxRDDScanTransformer is a leaf node")
copy(outputAttributes, rdd, name, outputPartitioning, outputOrdering)
}
}

object VeloxRDDScanTransformer {
def replace(plan: org.apache.spark.sql.execution.RDDScanExec): RDDScanTransformer =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CH uses UnknownPartitioning(0); we pass plan.outputPartitioning through. If the original RDDScanExec declares e.g. HashPartitioning, downstream Velox ops might skip a shuffle based on a hint we never verified survives the row→columnar conversion. Worth either justifying with a comment or aligning with CH.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern. Row-to-columnar conversion doesn't change data distribution — it converts row format within each partition, preserving the partition layout. This is consistent with RowToVeloxColumnarExec which also carries through the child's outputPartitioning. Added an inline comment explaining the rationale and the difference from CH's approach

VeloxRDDScanTransformer(
plan.output,
plan.inputRDD,
plan.nodeName,
plan.outputPartitioning,
plan.outputOrdering)
}
Original file line number Diff line number Diff line change
Expand Up @@ -805,10 +805,10 @@ class MiscOperatorSuite extends VeloxWholeStageTransformerSuite with AdaptiveSpa
if (isSparkVersionGE("4.1")) {
assert(plan.find(_.getClass.getSimpleName == "OneRowRelationExec").isDefined)
} else {
assert(plan.find(_.isInstanceOf[RDDScanExec]).isDefined)
// RDDScanExec is offloaded to VeloxRDDScanTransformer which handles R2C internally
assert(plan.find(_.isInstanceOf[VeloxRDDScanTransformer]).isDefined)
}
assert(plan.find(_.isInstanceOf[ProjectExecTransformer]).isDefined)
assert(plan.find(_.isInstanceOf[RowToVeloxColumnarExec]).isDefined)
}

test("equal null safe") {
Expand Down
Loading
Loading