[spark] Support catalog-qualified CREATE TABLE LIKE#7924
Conversation
eceaf35 to
d0202fe
Compare
There was a problem hiding this comment.
Pull request overview
Adds Spark SQL parsing and test coverage to support CREATE TABLE LIKE when either the source or target table uses a catalog-qualified identifier (e.g. paimon.db.tbl), ensuring the command is correctly rewritten to Paimon’s CreateTableLike handling on Spark ≥ 3.4.
Changes:
- Extend the Paimon SQL extensions grammar + AST builder to parse
CREATE TABLE LIKEand remap catalog-qualified identifiers into Spark’sCreateTableLikeCommand. - Update the extensions parser to detect catalog-qualified
CREATE TABLE LIKEstatements and run them through the extensions pipeline (so rewrite rules apply). - Add a new UT suite covering catalog-qualified target/source,
IF NOT EXISTS, clause passthrough (USING,TBLPROPERTIES), and unsupported Hive storage syntax.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| paimon-spark/paimon-spark-ut/src/test/scala/org/apache/paimon/spark/sql/CatalogQualifiedCreateTableLikeTest.scala | Adds regression/behavior tests for catalog-qualified CREATE TABLE LIKE scenarios on Spark ≥ 3.4. |
| paimon-spark/paimon-spark-common/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/PaimonSqlExtensionsAstBuilder.scala | Builds Spark’s CreateTableLikeCommand from a placeholder parse, then patches in catalog-qualified TableIdentifiers. |
| paimon-spark/paimon-spark-common/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/AbstractPaimonSparkSqlExtensionsParser.scala | Detects catalog-qualified CREATE TABLE LIKE and routes it through the extensions parser + rewrite rules. |
| paimon-spark/paimon-spark-common/src/main/antlr4/org.apache.spark.sql.catalyst.parser.extensions/PaimonSqlExtensions.g4 | Extends the extensions grammar to recognize CREATE TABLE LIKE with relevant clauses. |
| paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/AbstractPaimonSparkSqlExtensionsParser.scala | Keeps Spark 4.0’s parser wrapper behavior consistent with the common module changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d0202fe to
f720582
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review
The approach of intercepting catalog-qualified CREATE TABLE LIKE in the extensions parser and delegating to Spark's parser for clause handling (via synthetic SQL with dummy names) is clever and avoids duplicating Spark's complex clause resolution logic.
Issues
1. Double parsing on every SQL statement that isn't a Paimon command.
isCatalogCreateTableLike performs tokenization + full parse for any non-Paimon-command SQL that contains CREATE TABLE, LIKE, and . tokens. This is a hot path — every DML/query statement goes through parsePlan. The maybeCreateTableLike heuristic helps, but it will still trigger for common patterns like:
CREATE TABLE foo AS SELECT * FROM catalog.db.table WHERE col LIKE '%x%'This matches all four conditions: starts with CREATE TABLE, contains LIKE token, contains .. Then isParsedCatalogCreateTableLike does a full parse which will fail with PaimonParseException (caught and returns false). The failure path is cheap but the token scan + exception-based control flow on every CTAS with qualified source and LIKE predicate is unnecessary overhead.
Consider tightening maybeCreateTableLike: e.g., check that LIKE appears after the second identifier (not inside a WHERE clause). Or check that the token immediately before LIKE is an identifier/dot (not a string literal).
2. isCatalogIdentifier assumes exactly 3 parts = catalog-qualified.
private def isCatalogIdentifier(identifier: MultipartIdentifierContext): Boolean = {
identifier.parts.size() == 3
}This breaks for tables in nested namespaces (e.g., catalog.ns1.ns2.table = 4 parts) and for tables in the default namespace where only 2 parts (catalog.table) might be used. More importantly, toTableIdentifier handles arbitrary lengths:
case parts =>
TableIdentifier(parts.last, Some(parts.slice(1, parts.length - 1).mkString(".")), Some(parts.head))So isCatalogIdentifier should be parts.size() >= 3 to catch all catalog-qualified cases.
3. The grammar adds many tokens/rules that are unused except for passthrough.
The PR adds rowFormat, createFileFormat, storageHandler, fileFormat, locationSpec, propertyList, property, propertyKey, propertyValue, stringLit, plus 20+ new keywords. These are only needed so the ANTLR parser can successfully parse the full CREATE TABLE LIKE statement — but the visitor never visits them (only createTableLikeClausesText extracts them as raw text).
This works but bloats the grammar significantly. An alternative: use a greedy catch-all rule for the trailing clauses (everything after source=multipartIdentifier), since you just extract it as text anyway.
4. sparkCreateTableLikeCommand uses delegate parser with synthetic identifiers — fragile.
s"CREATE TABLE$ifNotExists __paimon_create_like_target LIKE __paimon_create_like_source${createTableLikeClausesText(ctx)}"If the clauses contain __paimon_create_like_target or __paimon_create_like_source as string values (e.g., in TBLPROPERTIES), this could theoretically break. More practically, if Spark's parser adds new clauses or changes syntax in future versions, the reconstructed SQL may not parse correctly. The version guard (< "3.4") helps, but each new Spark version may need testing.
5. Duplicate AbstractPaimonSparkSqlExtensionsParser.scala in spark-4.0 and spark-common.
The diff shows identical changes in both paimon-spark-4.0 and paimon-spark-common. Is there a way to avoid this duplication? If the Spark 4.0 version must diverge, at least add a comment explaining why both files exist.
Minor
nonReservedlist update looks correct — all new keywords can still be used as identifiers- The test coverage is solid: qualified target, qualified source, both qualified, IF NOT EXISTS, TBLPROPERTIES override, STORED AS rejection
- The
applyParserRulesrefactoring is a nice cleanup
Summary
The PR works for the intended use case. Main concerns are the detection heuristic overhead on non-matching SQL and the rigid parts.size() == 3 check for catalog identification.
2ec62ab to
f688995
Compare
|
+1 |
|
|
||
| private lazy val substitutor = new VariableSubstitution() | ||
| private lazy val astBuilder = new PaimonSqlExtensionsAstBuilder(delegate) | ||
| private val nonReservedIdentifierTokenTypes = Set( |
Purpose
This allows
CREATE TABLE LIKEto resolve source and target tables through Paimon catalogs when either side uses a catalog-qualified identifier.Examples:
Tests
CI