feat(external): support Hive-style partitioned Parquet external tables#24321
feat(external): support Hive-style partitioned Parquet external tables#24321iamlinjunhong wants to merge 4 commits intomatrixorigin:mainfrom
Conversation
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
done |
aunjgr
left a comment
There was a problem hiding this comment.
Follow-up review. Several of my earlier concerns are resolved; a few remain.
Resolved (both #24321 and #24329)
- URL-decoding dead code —
PathUnescaperemoved;%in directory names rejected with an explicit error message. collectBareColNames— CAST/CASE areExpr_Fnodes, so they're now reached via the existing walk. Subqueries (Expr_Sub) still aren't walked but fall through to row filters (conservatively safe).validateLiteralVecBinaryreplaced withvectorBinaryEnvelopeInBounds— bounds-only check, actual decode delegates tovector.UnmarshalBinarywith Oid/Length post-check.parseHiveOptionKVlegacy-JSON — now documented with per-key skip-guards and regression comment.NOT NULL DEFAULT+__HIVE_DEFAULT_PARTITION__error message — now actionable.
Still pending
__HIVE_DEFAULT_PARTITION__for VARCHAR columns (hive_partition_fill.go) — still unconditionally NULLs regardless of column type. String values that legitimately contain that literal are silently lost.- Zero-padded non-INT — INT works via numeric compare (tested); VARCHAR
month='01'vsWHERE month='1'still won't prune. No test or note for the VARCHAR case. - Range predicate pruning (
<,<=,>,>=,BETWEEN,NOT IN) — still not implemented; falls through to row-level filters. Acknowledged in SQL test comments as P0 scope. - Serial directory listing —
discoverRecursivesingle-threaded,maxListCalls=10000cap preserved. Acceptable for P0.
Important: #24321 vs #24329
I notice you opened #24329 with the same feature and a different file split (adds pkg/fileservice/path.go, drops pkg/sql/colexec/external/reader_parquet.go).
#24329 will regress behavior on the current tree. On main since commit a2ac9f22f (Feb 2026), scanParquetFile no longer exists; Parquet is dispatched through NewParquetReader in reader_parquet.go (external.go:153). The #24329 diff:
- Places
refreshPartitionValues/fillVirtualColumns/ rowCountOnly logic insidescanParquetFile, which is gone post-rebase. - Deletes
reader_parquet.go, which breaksexternal.go:153andparquet_test.go:895,1572.
Net: after rebase, Hive virtual-column fill would never execute, and the build would break.
#24321 avoids this by editing both scanParquetFile and reader_parquet.go; only its reader_parquet.go edits are load-bearing post-rebase.
The fileservice/path.go change in #24329 (adding = to allowed chars) is already on main via #24021 (8bb4c4e05), so that edit is a no-op conflict.
Recommendation: merge this PR (#24321) and close #24329. If you want the path_test.go additions from #24329, cherry-pick the test into this PR.
aunjgr
left a comment
There was a problem hiding this comment.
Upgrading my prior comment to approve. On reflection the three pending items don't warrant blocking:
__HIVE_DEFAULT_PARTITION__for VARCHAR — silently NULLs a literal that happens to match the Hive sentinel. Consistent with Hive/Spark semantics. Please document this in user-facing docs and file a follow-up to scope the sentinel to non-string types only.- Zero-padded VARCHAR — conservative miss in pruning, not a correctness issue. Row-level filter still runs. Follow-up.
- Range-predicate pruning (
<,<=,>,>=,BETWEEN,NOT IN) — perf only. Follow-up.
Please still address my comment on #24329 before landing this one — same feature shouldn't ship twice.
LGTM.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24320
What this PR does / why we need it:
support Hive-style partitioned Parquet external tables