Skip to content

[spark] Support catalog-qualified CREATE TABLE LIKE#7924

Open
kerwin-zk wants to merge 3 commits into
apache:masterfrom
kerwin-zk:support-catalog-qualified-create-table-like
Open

[spark] Support catalog-qualified CREATE TABLE LIKE#7924
kerwin-zk wants to merge 3 commits into
apache:masterfrom
kerwin-zk:support-catalog-qualified-create-table-like

Conversation

@kerwin-zk
Copy link
Copy Markdown
Contributor

@kerwin-zk kerwin-zk commented May 21, 2026

Purpose

This allows CREATE TABLE LIKE to resolve source and target tables through Paimon catalogs when either side uses a catalog-qualified identifier.

Examples:

-- target and source are both catalog-qualified
CREATE TABLE paimon.default.target_tbl LIKE paimon.default.source_tbl;

-- only the target table is catalog-qualified
CREATE TABLE paimon.default.target_tbl LIKE source_tbl;

-- only the source table is catalog-qualified
CREATE TABLE target_tbl LIKE paimon.default.source_tbl;

Tests

CI

@kerwin-zk kerwin-zk force-pushed the support-catalog-qualified-create-table-like branch 5 times, most recently from eceaf35 to d0202fe Compare May 21, 2026 13:23
@YannByron YannByron requested a review from Copilot May 22, 2026 08:46
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Spark SQL parsing and test coverage to support CREATE TABLE LIKE when either the source or target table uses a catalog-qualified identifier (e.g. paimon.db.tbl), ensuring the command is correctly rewritten to Paimon’s CreateTableLike handling on Spark ≥ 3.4.

Changes:

  • Extend the Paimon SQL extensions grammar + AST builder to parse CREATE TABLE LIKE and remap catalog-qualified identifiers into Spark’s CreateTableLikeCommand.
  • Update the extensions parser to detect catalog-qualified CREATE TABLE LIKE statements and run them through the extensions pipeline (so rewrite rules apply).
  • Add a new UT suite covering catalog-qualified target/source, IF NOT EXISTS, clause passthrough (USING, TBLPROPERTIES), and unsupported Hive storage syntax.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
paimon-spark/paimon-spark-ut/src/test/scala/org/apache/paimon/spark/sql/CatalogQualifiedCreateTableLikeTest.scala Adds regression/behavior tests for catalog-qualified CREATE TABLE LIKE scenarios on Spark ≥ 3.4.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/PaimonSqlExtensionsAstBuilder.scala Builds Spark’s CreateTableLikeCommand from a placeholder parse, then patches in catalog-qualified TableIdentifiers.
paimon-spark/paimon-spark-common/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/AbstractPaimonSparkSqlExtensionsParser.scala Detects catalog-qualified CREATE TABLE LIKE and routes it through the extensions parser + rewrite rules.
paimon-spark/paimon-spark-common/src/main/antlr4/org.apache.spark.sql.catalyst.parser.extensions/PaimonSqlExtensions.g4 Extends the extensions grammar to recognize CREATE TABLE LIKE with relevant clauses.
paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/AbstractPaimonSparkSqlExtensionsParser.scala Keeps Spark 4.0’s parser wrapper behavior consistent with the common module changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kerwin-zk kerwin-zk force-pushed the support-catalog-qualified-create-table-like branch from d0202fe to f720582 Compare May 22, 2026 13:55
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The approach of intercepting catalog-qualified CREATE TABLE LIKE in the extensions parser and delegating to Spark's parser for clause handling (via synthetic SQL with dummy names) is clever and avoids duplicating Spark's complex clause resolution logic.

Issues

1. Double parsing on every SQL statement that isn't a Paimon command.

isCatalogCreateTableLike performs tokenization + full parse for any non-Paimon-command SQL that contains CREATE TABLE, LIKE, and . tokens. This is a hot path — every DML/query statement goes through parsePlan. The maybeCreateTableLike heuristic helps, but it will still trigger for common patterns like:

CREATE TABLE foo AS SELECT * FROM catalog.db.table WHERE col LIKE '%x%'

This matches all four conditions: starts with CREATE TABLE, contains LIKE token, contains .. Then isParsedCatalogCreateTableLike does a full parse which will fail with PaimonParseException (caught and returns false). The failure path is cheap but the token scan + exception-based control flow on every CTAS with qualified source and LIKE predicate is unnecessary overhead.

Consider tightening maybeCreateTableLike: e.g., check that LIKE appears after the second identifier (not inside a WHERE clause). Or check that the token immediately before LIKE is an identifier/dot (not a string literal).

2. isCatalogIdentifier assumes exactly 3 parts = catalog-qualified.

private def isCatalogIdentifier(identifier: MultipartIdentifierContext): Boolean = {
  identifier.parts.size() == 3
}

This breaks for tables in nested namespaces (e.g., catalog.ns1.ns2.table = 4 parts) and for tables in the default namespace where only 2 parts (catalog.table) might be used. More importantly, toTableIdentifier handles arbitrary lengths:

case parts =>
  TableIdentifier(parts.last, Some(parts.slice(1, parts.length - 1).mkString(".")), Some(parts.head))

So isCatalogIdentifier should be parts.size() >= 3 to catch all catalog-qualified cases.

3. The grammar adds many tokens/rules that are unused except for passthrough.

The PR adds rowFormat, createFileFormat, storageHandler, fileFormat, locationSpec, propertyList, property, propertyKey, propertyValue, stringLit, plus 20+ new keywords. These are only needed so the ANTLR parser can successfully parse the full CREATE TABLE LIKE statement — but the visitor never visits them (only createTableLikeClausesText extracts them as raw text).

This works but bloats the grammar significantly. An alternative: use a greedy catch-all rule for the trailing clauses (everything after source=multipartIdentifier), since you just extract it as text anyway.

4. sparkCreateTableLikeCommand uses delegate parser with synthetic identifiers — fragile.

s"CREATE TABLE$ifNotExists __paimon_create_like_target LIKE __paimon_create_like_source${createTableLikeClausesText(ctx)}"

If the clauses contain __paimon_create_like_target or __paimon_create_like_source as string values (e.g., in TBLPROPERTIES), this could theoretically break. More practically, if Spark's parser adds new clauses or changes syntax in future versions, the reconstructed SQL may not parse correctly. The version guard (< "3.4") helps, but each new Spark version may need testing.

5. Duplicate AbstractPaimonSparkSqlExtensionsParser.scala in spark-4.0 and spark-common.

The diff shows identical changes in both paimon-spark-4.0 and paimon-spark-common. Is there a way to avoid this duplication? If the Spark 4.0 version must diverge, at least add a comment explaining why both files exist.

Minor

  • nonReserved list update looks correct — all new keywords can still be used as identifiers
  • The test coverage is solid: qualified target, qualified source, both qualified, IF NOT EXISTS, TBLPROPERTIES override, STORED AS rejection
  • The applyParserRules refactoring is a nice cleanup

Summary

The PR works for the intended use case. Main concerns are the detection heuristic overhead on non-matching SQL and the rigid parts.size() == 3 check for catalog identification.

@kerwin-zk kerwin-zk force-pushed the support-catalog-qualified-create-table-like branch from 2ec62ab to f688995 Compare May 23, 2026 16:15
@YannByron
Copy link
Copy Markdown
Contributor

+1


private lazy val substitutor = new VariableSubstitution()
private lazy val astBuilder = new PaimonSqlExtensionsAstBuilder(delegate)
private val nonReservedIdentifierTokenTypes = Set(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these effects?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants