feat(external): support Hive-style partitioned Parquet external tables by iamlinjunhong · Pull Request #24321 · matrixorigin/matrixone

iamlinjunhong · 2026-05-09T08:34:34Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

support Hive-style partitioned Parquet external tables

qodo-code-review · 2026-05-09T08:34:38Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

iamlinjunhong · 2026-05-11T06:58:48Z

Solid feature — the test coverage (hive_partition_test.go, hive_partition_coverage_test.go, BVT) is unusually thorough and the partition-pruning hookup into the optimizer is well-isolated. A few things worth addressing before merge; none are blockers.

Correctness / UX

1. __HIVE_DEFAULT_PARTITION__ unconditionally mapped to NULL (hive_partition_fill.go, the if strVal == HiveDefaultPartition block). The check fires for every column type, including VARCHAR / TEXT. A user whose string partition legitimately contains that literal token (e.g., data imported from a system that didn't treat it as a sentinel) will see those rows silently NULL-ified. Hive/Spark do the same thing globally, so this is defensible — but worth calling out in user docs and/or restricting the sentinel semantics to non-string types where the literal string could never otherwise be the partition value.

2. Zero-padded numeric partitions don't equate to unpadded predicates (matchPartitionValue in hive_partition.go). Directory month=01 compared against WHERE month=1 — if the partition column is typed INT, both coerce to 1 and prune correctly. But if typed VARCHAR, "01" != "1" and pruning misses. Test fixture zero_pad/ uses INT so this is fine in practice; worth a test case with VARCHAR to lock the behavior in both directions.

3. NOT NULL DEFAULT + __HIVE_DEFAULT_PARTITION__ error message shows the relative path but not the column default. Minor UX: the user sees "NOT NULL but directory has HIVE_DEFAULT_PARTITION" and has to guess whether to add NULL default or rename the directory. A short suggestion in the error message would help.

Hardening

4. URL-decoding path is dead code. hive_partition.go:~265 hard-rejects any directory name containing % before listing descends into it. ParseHivePartitionSegment calls url.PathUnescape on the value portion, but the guard above ensures no % ever reaches that code. Either remove the guard and let the URL-decoding actually work (the url_encoded fixture is labeled "P0 known limitation" — escalating it to P1 would be a nice follow-up), or remove the PathUnescape and comment that URL-encoding is unsupported. Having both is confusing.

5. Partition pruning only handles = and IN (tryExtractPredicate, hive_partition.go). <, <=, >, >=, BETWEEN, NOT IN all fall through to row-level filtering. Not a correctness issue (the predicate is also applied row-side), but the perf benefit of partitioning vanishes for range queries — which is exactly what partitions are usually for on date/year columns. Worth tracking as a follow-up.

6. collectBareColNames only recurses into Expr_F (hive_partition.go). Expressions wrapped in CAST(...) or inside CASE / subqueries don't surface their referenced columns, so pruning is silently skipped. A predicate like CAST(year AS INT) = 2024 won't prune. The row-level filter still catches it, so it's conservative — but the pruning surface is narrower than the design doc implies. Consider walking through CAST and CASE-WHEN nodes.

7. Serial directory listing, 10k list-call cap (DiscoverHivePartitions). On S3 at ~100ms/list a table with 5000 partitions takes ~8 min to plan. Not a bug — safe by design — but worth a bounded concurrent list (e.g., errgroup.Group with SetLimit(8)) as a follow-up. The TODO at hive_partition.go:~91 about re-invoking GetForETLWithType per recursion compounds this; caching the FS client across the recursion would help.

8. validateLiteralVecBinary hand-parses vector.Vec.MarshalBinary layout (hive_partition.go:~731). The format isn't versioned. If the vector binary layout changes, this silently returns false and partition pruning stops working with no observable error. Replace with vec.UnmarshalBinary + Length check.

9. parseHiveOptionKV defensive skip-guards against legacy JSON — if a legacy JSON has hive_partitioning=false but hive_partition_columns=year, the table ends up with columns but pruning disabled. No consumer guards against this inconsistent state (refreshPartitionValues checks HivePartitioning only). Consider rejecting the combination in validateAndSetHivePartitionOptions to fail fast.

Verified correct

Parser change (update.go, 14 lines): ExternParam gets HivePartitioning, HivePartitionCols, HivePartitionColTypes (with a local HivePartColType to avoid importing pkg/pb/plan). Plumbed end-to-end through build/compile/exec. Reasonable boundary.

Concurrency (types.go, 11 lines): new per-ExternalParam state is read/written by a single reader goroutine because param.Parallel = false is forced for hive in compile.go. Safe as long as that enforcement holds — worth a comment on each new field noting the Parallel=false dependency.

Hidden files: IsHiddenFile correctly skips _SUCCESS and .crc; non-parquet regular files are silently filtered. Consistent with Hive convention.

Path boundary check: ExtractPartitionValues normalizes via path.Clean then enforces HasPrefix(filePath, basePath+"/"). path.Clean collapses .. before the check, so basePath/../etc/passwd → /etc/passwd fails the prefix test. Consistent with non-hive external-table path handling.

done

aunjgr

Follow-up review. Several of my earlier concerns are resolved; a few remain.

Resolved (both #24321 and #24329)

URL-decoding dead code — PathUnescape removed; % in directory names rejected with an explicit error message.
collectBareColNames — CAST/CASE are Expr_F nodes, so they're now reached via the existing walk. Subqueries (Expr_Sub) still aren't walked but fall through to row filters (conservatively safe).
validateLiteralVecBinary replaced with vectorBinaryEnvelopeInBounds — bounds-only check, actual decode delegates to vector.UnmarshalBinary with Oid/Length post-check.
parseHiveOptionKV legacy-JSON — now documented with per-key skip-guards and regression comment.
NOT NULL DEFAULT + __HIVE_DEFAULT_PARTITION__ error message — now actionable.

Still pending

__HIVE_DEFAULT_PARTITION__ for VARCHAR columns (hive_partition_fill.go) — still unconditionally NULLs regardless of column type. String values that legitimately contain that literal are silently lost.
Zero-padded non-INT — INT works via numeric compare (tested); VARCHAR month='01' vs WHERE month='1' still won't prune. No test or note for the VARCHAR case.
Range predicate pruning (<, <=, >, >=, BETWEEN, NOT IN) — still not implemented; falls through to row-level filters. Acknowledged in SQL test comments as P0 scope.
Serial directory listing — discoverRecursive single-threaded, maxListCalls=10000 cap preserved. Acceptable for P0.

Important: #24321 vs #24329

I notice you opened #24329 with the same feature and a different file split (adds pkg/fileservice/path.go, drops pkg/sql/colexec/external/reader_parquet.go).

#24329 will regress behavior on the current tree. On main since commit a2ac9f22f (Feb 2026), scanParquetFile no longer exists; Parquet is dispatched through NewParquetReader in reader_parquet.go (external.go:153). The #24329 diff:

Places refreshPartitionValues / fillVirtualColumns / rowCountOnly logic inside scanParquetFile, which is gone post-rebase.
Deletes reader_parquet.go, which breaks external.go:153 and parquet_test.go:895,1572.

Net: after rebase, Hive virtual-column fill would never execute, and the build would break.

#24321 avoids this by editing both scanParquetFile and reader_parquet.go; only its reader_parquet.go edits are load-bearing post-rebase.

The fileservice/path.go change in #24329 (adding = to allowed chars) is already on main via #24021 (8bb4c4e05), so that edit is a no-op conflict.

Recommendation: merge this PR (#24321) and close #24329. If you want the path_test.go additions from #24329, cherry-pick the test into this PR.

aunjgr

Upgrading my prior comment to approve. On reflection the three pending items don't warrant blocking:

__HIVE_DEFAULT_PARTITION__ for VARCHAR — silently NULLs a literal that happens to match the Hive sentinel. Consistent with Hive/Spark semantics. Please document this in user-facing docs and file a follow-up to scope the sentinel to non-string types only.
Zero-padded VARCHAR — conservative miss in pruning, not a correctness issue. Row-level filter still runs. Follow-up.
Range-predicate pruning (<, <=, >, >=, BETWEEN, NOT IN) — perf only. Follow-up.

Please still address my comment on #24329 before landing this one — same feature shouldn't ship twice.

LGTM.

iamlinjunhong requested review from aunjgr, heni02 and ouyuanning as code owners May 9, 2026 08:34

iamlinjunhong temporarily deployed to ci May 9, 2026 08:34 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 9, 2026 08:34 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 9, 2026 08:34 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 9, 2026 08:35 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 9, 2026 08:35 — with GitHub Actions Inactive

matrix-meow added the size/XXL Denotes a PR that changes 2000+ lines label May 9, 2026

iamlinjunhong temporarily deployed to ci May 9, 2026 12:52 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 9, 2026 12:52 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 9, 2026 12:52 — with GitHub Actions Inactive

iamlinjunhong force-pushed the mp-hive branch from af8096f to c7d5dc2 Compare May 11, 2026 02:26

iamlinjunhong had a problem deploying to ci May 11, 2026 02:26 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 11, 2026 02:26 — with GitHub Actions Inactive

fix

1f0bffc

iamlinjunhong force-pushed the mp-hive branch from c7d5dc2 to 1f0bffc Compare May 11, 2026 06:58

iamlinjunhong temporarily deployed to ci May 11, 2026 06:58 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 06:58 — with GitHub Actions Error

iamlinjunhong temporarily deployed to ci May 11, 2026 06:58 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 06:58 — with GitHub Actions Failure

Merge branch 'main' into mp-hive

b8b2c75

iamlinjunhong temporarily deployed to ci May 11, 2026 08:01 — with GitHub Actions Inactive

iamlinjunhong had a problem deploying to ci May 11, 2026 08:01 — with GitHub Actions Failure

iamlinjunhong temporarily deployed to ci May 11, 2026 09:03 — with GitHub Actions Inactive

aunjgr reviewed May 11, 2026

View reviewed changes

aunjgr mentioned this pull request May 11, 2026

feat(external): support Hive-style partitioned Parquet external tables #24329

Open

7 tasks

aunjgr approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(external): support Hive-style partitioned Parquet external tables#24321

feat(external): support Hive-style partitioned Parquet external tables#24321
iamlinjunhong wants to merge 4 commits intomatrixorigin:mainfrom
iamlinjunhong:mp-hive

iamlinjunhong commented May 9, 2026

Uh oh!

qodo-code-review Bot commented May 9, 2026

Uh oh!

iamlinjunhong commented May 11, 2026

Correctness / UX

Hardening

Verified correct

Uh oh!

aunjgr left a comment

Uh oh!

aunjgr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iamlinjunhong commented May 9, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 9, 2026

Uh oh!

iamlinjunhong commented May 11, 2026

Correctness / UX

Hardening

Verified correct

Uh oh!

aunjgr left a comment

Choose a reason for hiding this comment

Resolved (both #24321 and #24329)

Still pending

Important: #24321 vs #24329

Uh oh!

aunjgr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants