[Feature Request]: hive partition external table

### Is there an existing issue for the same feature request?

- [x] I have checked the existing issues.

### Is your feature request related to a problem?

```Markdown
Yes. Today MatrixOne can read Parquet from S3 via CREATE EXTERNAL TABLE ... URL s3option{...}, but cannot recognize Hive-style
  partitioned layouts. A typical TPC-DS / data lake setup looks like this:

  s3://nova-nt/user/hive/warehouse/training.db/catalog_returns/data/
    cr_returned_date_sk=2450815/part-00000.parquet
    cr_returned_date_sk=2450816/part-00000.parquet
    ...
    cr_returned_date_sk=2452641/part-00000.parquet    # 1,800+ daily partitions

  Problems this creates for users today:

  - No recursive discovery. filepath='.../catalog_returns/data/' returns nothing — users must hand-write glob patterns like
  .../data/*/*.
  - Partition key is invisible. cr_returned_date_sk exists only in the directory name; Spark/Hive strip it from the Parquet file to save
   space. MatrixOne has no way to expose it as a column.
  - No partition pruning. WHERE cr_returned_date_sk = 2450821 ends up scanning all 1,800+ partitions instead of opening one file.
  - Lakehouse interoperability gap. Trino, Spark, StarRocks, and Impala all read this layout natively. MatrixOne users either maintain a
   separate copy of the data or can't participate in a multi-engine lakehouse.
```

### Describe the feature you'd like

Explicit, opt-in Hive-style partitioned external tables. DDL:

  CREATE EXTERNAL TABLE catalog_returns (
      cr_item_sk          INT NOT NULL,
      cr_order_number     BIGINT NOT NULL,
      cr_net_loss         DECIMAL(7,2),
      cr_returned_date_sk INT                  -- partition column
  ) URL s3option{
      "endpoint" = "https://storage.yandexcloud.net",
      "bucket"   = "nova-nt",
      "filepath" = "user/hive/warehouse/training.db/catalog_returns/data/",
      "format"   = "parquet",
      "hive_partitioning"      = "true",
      "hive_partition_columns" = "cr_returned_date_sk"
  };

  Runtime behaviour:

  - Point filepath at the partition root; MatrixOne discovers key=value/ directories automatically (single-level or multi-level).
  - Partition columns behave like regular columns: DESCRIBE, SELECT *, WHERE, GROUP BY, ORDER BY, JOIN all work, even when the Parquet
  files don't physically contain the column.
  - WHERE col = v and WHERE col IN (...) are pushed into directory discovery, so List/Get counts on the object store scale with the
  number of matching partitions rather than the total partition count.
  - Unsupported predicates (range, OR, cast, LIKE '%...%') fall back to row-level filtering; results are always correct.
  - Works identically against local filesystem (INFILE{...}) and S3-compatible object stores (MinIO, AWS S3, Yandex).

  Non-goals (explicit, not for this issue):

  - Apache Iceberg / Hudi / Delta Lake table formats
  - Hive Metastore / AWS Glue / Unity Catalog integration
  - Writes (INSERT / LOAD DATA into partitioned external tables)
  - Schema evolution
  - CSV / JSONLINE hive partitions (Parquet only in P0)

### Describe implementation you've considered

A parallel discovery path separate from the existing ReadDir glob flow. Key pieces:

  - DiscoverHivePartitions (new): recursive list-and-filter over key=value/ directories. For a declared N-level partitioning, the walk
  descends N times then collects files. S3 request shape is List = matched_partitions + 1 + Get = matched_partitions.
  - Conservative three-state matching (MatchTrue / MatchFalse / MatchUnknown). Only MatchFalse prunes. Unknown types (DECIMAL, DATE,
  TIMESTAMP, BOOL, collation-sensitive VARCHAR mismatches) are kept and evaluated by the row filter — never a false-negative pruning
  decision.
  - Double filtering for correctness: partition filters live in both the directory pruner and the runtime row filter, so a buggy or
  conservative pruner cannot drop data that SQL semantics requires.
  - Partition column types persisted in CreateSql JSON (HivePartitionColTypes snapshot at DDL time), not looked up from catalog at
  compile. Avoids an extra RPC and closes a race window with concurrent DDL.
  - Hard limits: warnPartitionCount = 5000 (WARN log), maxPartitionCount = 50000 (error), maxListCalls = 10000 (error) — protects the
  coordinator from pathological layouts.
  - Compile integration: getHivePartitionFileList replaces ReadDir when hive_partitioning=true. A runFilePathFilters helper evaluates
  __mo_filepath LIKE ... predicates at compile time; unconsumed filters are appended back to row filters so nothing is ever silently
  dropped.
  - Path-value wins when a Parquet file happens to contain a physical column with the same name as a partition column (Hive/Spark
  convention).
  - __HIVE_DEFAULT_PARTITION__ maps to SQL NULL on nullable columns; raises constraint violation on NOT NULL.
  - URL decoding happens in exactly one place (ParseHivePartitionSegment). Directory names with literal % are currently rejected with an
   explicit error (known P0 limitation due to the FileService path parser; fix deferred to a follow-up).

  Explicitly out of scope for the first iteration: distributed per-file execution across CNs (forced param.Parallel = false), partition
  metadata caching between queries, range-predicate directory pruning, and concurrent List calls.

### Documentation, Adoption, Use Case, Migration Strategy

```Markdown

```

### Additional information

[hive_partition_user_guide_en.md](https://github.com/user-attachments/files/27549018/hive_partition_user_guide_en.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: hive partition external table #24320

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, Adoption, Use Case, Migration Strategy

Additional information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: hive partition external table #24320

Description

Is there an existing issue for the same feature request?

Is your feature request related to a problem?

Describe the feature you'd like

Describe implementation you've considered

Documentation, Adoption, Use Case, Migration Strategy

Additional information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions