feat(cubesql): merge view joins on a shared cube member into a single CubeScan#10977
feat(cubesql): merge view joins on a shared cube member into a single CubeScan#10977paveltiunov wants to merge 20 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #10977 +/- ##
==========================================
+ Coverage 83.32% 83.54% +0.21%
==========================================
Files 255 256 +1
Lines 77261 78588 +1327
==========================================
+ Hits 64379 65655 +1276
- Misses 12882 12933 +51
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Claude finished @paveltiunov's task in 2m 14s —— View job Claude Code ReviewVerdict: LGTM with a few nits. No correctness blockers found. 1 low (test gap), 2 nits (cleanliness/efficiency). Full review (expand)
SummaryThe change cleanly generalizes Findings[Low] Test coverage gap for RIGHT JOIN and FULL JOIN. [Nit] Dangling egraph mutations on no-op outcome. [Nit] Composite-key cube name only carries the last column. Other notes
|
…eScan Generalize the push-down-cube-join rewrite so that a join between two CubeScans (typically views) on a dimension that resolves to the same underlying cube member is merged into a single CubeScan, just like the existing __cubeJoinField cube-to-cube join. A view dimension keeps its original cube.dimension path in alias_member, which is used to detect that both sides of the equi-join reference the same shared key. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Mirror the motivating query exactly: SELECT c.customer_city, measure(o.revenue), measure(c.avg_age) FROM customers_view c LEFT JOIN orders_view o ON o.customer_city = c.customer_city GROUP BY 1 and assert it merges into a single grouped multi-fact CubeScan. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
When merging a join between two views on a shared cube member, the downstream multi-fact query is rendered as a FULL OUTER JOIN over the shared key. To recover the requested join semantics, the rewrite now adds a measure 'set' filter on each side that must be present: - INNER: both sides required - LEFT: left side required - RIGHT: right side required - FULL: no extra filter Branch presence is detected via a measure of the side (the grouping key is COALESCEd across sides downstream, so it cannot tell sides apart). Covered with left/inner group-by tests. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…e filter Detect side presence with the side's join-key dimension instead of an arbitrary measure. The join key is always available and is the actual shared-key marker, avoiding the nullable-measure caveat and the case where a side has no selected measure. - LEFT: left join key must be set - RIGHT: right join key must be set - INNER: both join keys must be set - FULL: no extra filter Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…n dimensions Make the merge gate explicit: the entire join key must resolve to dimensions (or time dimensions) on both sides and to the same underlying cube member. A join key that touches a measure/segment/etc. is rejected and the join falls back to normal (non-merged) handling. Add a negative test that joining two views on measures is not merged. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
- Add #[cfg(test)] to the test_cube_join_views module so it is not compiled into non-test builds (fixes unresolved pretty_assertions and unused-import errors under clippy -D warnings and the native builds). - Remove the unused right_filters_var from push_down_cube_join. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
…shift) Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
21aa80a to
9a682af
Compare
|
Claude finished @cursor[bot]'s task in 46s —— View job Claude Code ReviewVerdict: LGTM. No correctness blockers. The merge gate ("join key fully within dimensions, same underlying member on both sides") is conservative, so non-view/unrelated joins fall through unchanged. The Notes (expand) — 1 low, 2 nits; all re-affirmed from the prior review pass[Low] No test for RIGHT or FULL join semantics. [Nit] Egraph/subst mutation runs before the alias_to_cube validation. [Nit] Composite-key cube name only carries the last column. Style: the verbose |
…join key Move the shared-member view-join merge out of push_down_cube_join (which runs on the always-ungrouped raw join) into a new rule that matches an Aggregate over the join. The merge now only fires when: - the query is grouped (an Aggregate sits over the join), and - the GROUP BY is exactly the shared join key. This rejects ungrouped queries (e.g. SELECT * over the join) and queries that group by a non-join-key dimension, both of which would otherwise produce an incorrect multi-fact pushdown. push_down_cube_join is restored to its original __cubeJoinField-only behavior. Tests: grouped left/inner merge (with join-key set filters); ungrouped, group-by-mismatch, and measure-key joins are not merged. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 1m 31s —— View job Claude Code ReviewVerdict: LGTM. No correctness blockers. The latest commit ( Findings: 0 high · 0 medium · 1 low · 2 nits — all left as inline comments. Re-affirms two prior threads (composite-key cube name, egraph mutation before final validation) against the new location; the RIGHT/FULL test gap is unchanged. Notes (expand)
[Low] No RIGHT or FULL join test. Covered: grouped LEFT, grouped INNER, ungrouped/group-by-mismatch/measure-key (negatives). The semantics table lists FULL (no filters) and RIGHT (right-side [Nit] Composite-key cube name keeps only the last column's cube. [Nit] Egraph/subst mutation runs before the iteration that decides success. Style: the verbose |
The shared-member view-join merge produces a multi-fact query that is only handled correctly by the Tesseract SQL planner (FULL OUTER stitch over the shared key). Add an `enable_tesseract_sql_planner` config option (read from CUBEJS_TESSERACT_SQL_PLANNER) and only fire the rewrite when it is enabled. Add a test that the merge does not happen when Tesseract is disabled. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 1m 6s —— View job Claude Code ReviewVerdict: LGTM. New commit Findings since last review: 0 high · 0 medium · 0 low · 0 nits new — 3 prior unresolved threads (1 low, 2 nits) still apply against the new line numbers; not re-posting. Details (expand)
Re-affirmed prior threads (still open, no new comment):
On the new commit specifically: the gate is read once at rule-construction time ( · branch · |
Document that joining two views on a dimension that resolves to the same underlying cube member (grouped by that key) triggers a multi-fact query in the SQL API, including the join-type semantics (inner/left/right/full) and the Tesseract requirement. Add the behavior to the multi-fact views page and a cross-referencing section in the SQL API joins reference. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 48s —— View job Claude Code ReviewVerdict: LGTM. Latest commit ( Findings on this turn: 0 high · 0 medium · 0 low · 0 nits new. Re-affirmed 3 prior unresolved threads (1 low, 2 nits) — still applicable, not duplicating. Details (expand)
Docs review (
Re-affirmed prior threads (still open, no new comment):
· branch · |
…e/view A CubeScan can expose members from multiple cubes/views, so enforce that every join-key column on each side resolves to the same cube/view. A mismatch would make the merged join hint ambiguous, so such joins are no longer merged. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 46s —— View job Claude Code ReviewVerdict: LGTM. The new commit Findings on this turn: 0 high · 0 medium · 0 low · 0 nits new. Resolved 1 prior thread (composite-key); 2 prior threads still apply (1 low, 1 nit) — not re-posting. Details (expand)
Re-affirmed prior threads (still open, no new comment):
· branch · |
…l merge; test RIGHT/FULL Address review nits: - Construct the join-semantics set filters and mutate subst inside the innermost iteration, right before returning true, so a false return never leaves a stale subst entry or orphan filter e-nodes. - Add RIGHT JOIN and FULL JOIN tests to lock in the join-type table (right join key set filter; no filter for full). (The composite-key single-cube-per-side check was already added in a prior commit.) Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 50s —— View job Claude Code ReviewVerdict: LGTM. New commit Findings on this turn: 0 high · 0 medium · 0 low · 0 nits new. Details (expand)
Minor observation, not blocking: the FULL test asserts |
…ush-down
Introduce a MultiFactJoinWrapper intermediate egraph node so the shared-member
view-join rewrite is no longer a single aggregate-bound rule. The rewrite now
splits into:
- shared-member-join-to-wrapper: Join(CubeScan, CubeScan) -> wrapper(CubeScan)
- shared-member-join-extend-wrapper: Join(wrapper(CubeScan), CubeScan) ->
wrapper(CubeScan), enabling joins of 3+ views
- multi-fact-join-wrapper-filter-push-down: Filter(wrapper) ->
wrapper(Filter), pushing WHERE/ON filters into the merged scan
- aggregate-multi-fact-join-wrapper: unwrap only when GROUP BY matches the
recorded join key
The wrapper records the join key (as underlying cube members) so the finalize
rule can verify the GROUP BY, while joins and filters compose beforehand.
Adds tests for 3-way and 4-way FULL joins, a WHERE filter, and an ON-clause
filter, in addition to the existing 2-way LEFT/INNER/RIGHT/FULL coverage.
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 2m 46s —— View job Claude Code ReviewVerdict: LGTM. New commits Findings on this turn: 0 high · 0 medium · 3 low · 1 nit — left inline. Details (expand)
Inline findings:
Other notes (not blocking, not posted inline):
· branch · |
- Remove duplicate #[allow(clippy::too_many_arguments)] on merge_shared_member_join - Document the left-deep-only assumption of shared-member-join-extend-wrapper (right-associative a JOIN (b JOIN c) is not chained) - Document that finalize only accepts plain-column GROUP BY (wrapped exprs like DATE_TRUNC fall back to standard handling) - Add a 3-way LEFT join test pinning per-pass presence-filter accumulation through the extend-wrapper rule Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 42s —— View job Claude Code ReviewVerdict: LGTM. New commit Findings on this turn: 0 high · 0 medium · 0 low · 0 nits new. All 4 prior open Claude threads resolved. Details (expand)
Nothing further to flag. The PR is in a good state. · branch · |
A join written directly on DATE_TRUNC (ON DATE_TRUNC(g, a.ts) = DATE_TRUNC(g, b.ts)) is lowered by the planner to Filter(<eq>, CrossJoin(...)) rather than a column equi-join, so it never reached the shared-member merge. Add a shared-time-member-cross-join-to-wrapper rule that recognizes this shape, resolves both truncated columns to the same underlying time member at the same granularity, and merges into an INNER multi-fact CubeScan (both keys marked present). Grouping by DATE_TRUNC already worked via referenced-column collapse; the finalize comment is corrected accordingly. Adds time dimensions to the test views and tests for join-on-raw-time + GROUP BY DATE_TRUNC (LEFT) and join-on-DATE_TRUNC + GROUP BY DATE_TRUNC (INNER). Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
I'll analyze this and get back to you. |
Add a shared customer_state dimension and tests for joining two views on a composite key (customer_city + customer_state) and grouping by both, plus a negative test that a partial GROUP BY (only one of the two join keys) does not merge. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 3m 39s —— View job Claude Code ReviewVerdict: LGTM. Two new commits since the last pass: Findings on this turn: 0 high · 0 medium · 2 low · 0 nits new. All 7 prior Claude threads resolved or no longer applicable. Details (expand)
Inline findings:
Other notes (not blocking, not posted inline):
· branch · |
| ..Default::default() | ||
| } | ||
| ) | ||
| } |
There was a problem hiding this comment.
[low] Test name says INNER but the SQL uses LEFT JOIN.
The test asserts INNER semantics (set_filter("orders_view.created_at") AND set_filter("customers_view.created_at") — both sides marked present), but the SQL uses LEFT JOIN. The merge happens because DataFusion lowers a non-equi LEFT JOIN to Filter(<eq>, CrossJoin) upstream (see the existing comment at engine/df/optimizers/filter_push_down.rs:114 — "Current DataFusion version plans complex joins as Filter(CrossJoin)"), which silently strips the LEFT-outer semantics before this rule sees the plan.
It's a pre-existing upstream behavior, not introduced here, but as written the test reads as "writing LEFT JOIN on DATE_TRUNC returns INNER results" without flagging that. Suggest either using INNER JOIN in the SQL (matching the rule's docstring on members.rs:594-599 "Merge an INNER join expressed as a date-truncated equality"), or adding a comment to the test that the LEFT keyword is intentional to exercise the upstream-lowering path and the loss-of-LEFT is expected.
…sion The planner turns 'ON a.dim = b.dim AND DATE_TRUNC(g, a.ts) = DATE_TRUNC(g, b.ts)' into Filter(<trunc eq>, Join(a.dim = b.dim, ...)). The column join becomes a MultiFactJoinWrapper; add a multi-fact-join-wrapper-absorb-time-key rule that folds the truncated time member into the wrapper's recorded join key (marking both time columns present, since a post-join equality is INNER on that key) so a query grouped by both the time dimension and the dimension merges into one multi-fact CubeScan. Adds a test for the mixed DATE_TRUNC + dimension join. Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 2m 6s —— View job Claude Code ReviewVerdict: LGTM. New commit Findings on this turn: 0 high · 0 medium · 0 low · 0 nits new. Posted no new inline comments. Details (expand)
Re-affirmed prior open threads (still apply, no new comment):
Other notes (not blocking, not posted inline):
· branch · |
…P BY
Per review: the multi-fact stitch happens at the GROUP BY grain, so the join
key's granularity must equal the GROUP BY granularity for a time member.
Previously only the underlying member name was recorded, so joining on
DATE_TRUNC('month', ...) while grouping by DATE_TRUNC('day', ...) merged and
stitched at day grain, diverging from the written join.
Store join_members as (underlying member, Option<granularity>) on the wrapper
(None for plain dimensions, Some(grain) for DATE_TRUNC time keys), and at
finalize extract each GROUP BY expression's granularity from its original_expr
(rather than referenced_expr, which drops the grain) and require the full
(member, granularity) sets to match.
Replaces the raw-time-join test with a granularity-mismatch negative test.
Co-authored-by: Pavel Tiunov <pavel.tiunov@gmail.com>
|
Claude finished @cursor[bot]'s task in 2m 15s —— View job Claude Code ReviewVerdict: Request changes — 1 medium. Latest commit Findings on this turn: 0 high · 1 medium · 0 low · 0 nits. Resolved 1 prior thread ( Details (expand)
Other notes (not blocking):
· branch · |
| // already recorded on the left wrapper (for chained 3+ view joins). | ||
| let mut join_member_names: Vec<(String, Option<String>)> = shared_left_keys | ||
| .iter() | ||
| .map(|k| (resolve_underlying(k), None)) |
There was a problem hiding this comment.
[medium] Granularity-strict matching silently breaks the documented "raw time column + GROUP BY DATE_TRUNC" shape.
merge_shared_member_join records every shared key as (name, None) (line 3378). finalize_shared_member_join now requires the GROUP BY pair to equal the join key exactly — a DATE_TRUNC('day', c.created_at) group-by extracts (created_at, Some("day")), which does not match (created_at, None). So a query like
LEFT JOIN o ON o.created_at = c.created_at
GROUP BY DATE_TRUNC('day', c.created_at)no longer merges. That's exactly the first shape documented as supported in docs-mintlify/docs/data-modeling/multi-fact-views.mdx:374-381 ("Join on the raw time column, group by DATE_TRUNC") and was previously pinned by test_left_join_raw_time_group_by_date_trunc — which was deleted in 379a2a1 and replaced with a granularity-mismatch negative, so no test catches the regression.
If the strict (member, grain) match is the intended design (the GROUP BY grain must match the join grain — and a raw column join is "no grain" so it never matches a truncated GROUP BY), then the docs example needs to be removed/rewritten and the PR description's "Both shapes of the common time multi-fact pattern are supported" claim updated. Otherwise the resolution rule should treat a (name, None) join key on a time member as compatible with a (name, Some(grain)) GROUP BY (the user joined at the row grain and is free to group by any truncation), restoring the deleted test.
Either way worth aligning the code with the docs before this merges.

Summary
When the Tesseract SQL planner is enabled, the SQL API now merges a join of
views on a dimension that resolves to the same underlying cube member into a
single multi-fact
CubeScan, instead of erroring out. The planner stitches thefact groups with a
FULL OUTER JOINover the shared key; the requested SQL jointype (
INNER/LEFT/RIGHT/FULL) is reconstructed withsetfilters on thejoin-key dimension(s).
Motivating query:
How it works
The rewrite is a local CubeSQL e-graph rewrite, structured around a new
intermediate
MultiFactJoinWrappernode that carries the join key (asunderlying cube members) so joins and filters can compose before the query is
finalized at the aggregate:
shared-member-join-to-wrapper—Join(CubeScan, CubeScan)→MultiFactJoinWrapper(CubeScan)shared-member-join-extend-wrapper—Join(MultiFactJoinWrapper(CubeScan), CubeScan)→MultiFactJoinWrapper(CubeScan), enabling joins of 3+ views (left-deep)shared-time-member-cross-join-to-wrapper—Filter(DATE_TRUNC = DATE_TRUNC, CrossJoin(CubeScan, CubeScan))→MultiFactJoinWrapper(CubeScan), supporting joins written directly onDATE_TRUNC(which the planner lowers to a filtered cross join, i.e.INNER)multi-fact-join-wrapper-filter-push-down—Filter(MultiFactJoinWrapper)→MultiFactJoinWrapper(Filter), pushingWHERE/ONfilters into the merged scanaggregate-multi-fact-join-wrapper— unwraps the wrapper only when theGROUP BYexactly matches the recorded join keyTime dimensions
Both shapes of the common time multi-fact pattern are supported:
GROUP BY DATE_TRUNC('day', …)(groupedcolumn emitted as a
timeDimensionsentry with granularity).DATE_TRUNC('day', a.ts) = DATE_TRUNC('day', b.ts)(INNER),requiring both truncated columns to resolve to the same underlying time member
at the same granularity.
Guards
The merge only happens when:
CUBEJS_TESSERACT_SQL_PLANNER).and the join key is composed only of dimensions.
SELECT *) andgroup-by-a-different-dimension queries fall back to standard join handling
(and error as before).
Tests
compile::test::test_cube_join_views:LEFT/INNER/RIGHT/FULLjoin semanticsFULLjoins; 3-wayLEFTpinning per-pass presence filtersWHEREfilter pushed into the merged scan;ON-clause filter on a single factGROUP BY DATE_TRUNC(LEFT); join onDATE_TRUNC+GROUP BY DATE_TRUNC(INNER)SELECT *, group-by non-join-key, join on a measureDocs
docs-mintlify/docs/data-modeling/multi-fact-views.mdxand the SQL API joinsreference document the behavior, requirements, N-way joins, time-dimension
joins, filtering, and the join-type semantics table.