feat(cubestore): push Sort(fetch=N) to workers for ORDER BY group-by … by paveltiunov · Pull Request #11027 · cube-js/cube

paveltiunov · 2026-06-06T20:12:29Z

…column LIMIT queries

When a query has ORDER BY on GROUP BY columns with a LIMIT but doesn't match the index sort prefix (so the existing limit_and_reverse optimization cannot apply), propagate a worker_sort_and_limit through ClusterSendNode serialization to workers. Workers wrap their aggregate output with SortExec(fetch=N), bounding memory to O(limit) via DataFusion's heap-based sort instead of materializing all groups.

Check List

Tests have been run in packages where changes have been made if available
Linter has been run for changed code
Tests for the changes have been added if not covered yet
Docs have been added / updated if required

…column LIMIT queries When a query has ORDER BY on GROUP BY columns with a LIMIT but doesn't match the index sort prefix (so the existing limit_and_reverse optimization cannot apply), propagate a worker_sort_and_limit through ClusterSendNode serialization to workers. Workers wrap their aggregate output with SortExec(fetch=N), bounding memory to O(limit) via DataFusion's heap-based sort instead of materializing all groups. Co-authored-by: Cursor <cursoragent@cursor.com>

claude · 2026-06-06T20:12:44Z

Claude finished @paveltiunov's task in 4m 32s —— View job

Code Review

Verdict: Solid optimization with a sound core idea (push SortExec(fetch=N) to workers for bounded-memory top-K when ORDER BY hits GROUP BY columns). A few correctness concerns worth tightening before merge — 3 high, 2 medium, 2 low posted as inline comments.

Details — checklist, findings, scope

High

Schema-index mismatch risk — planning.rs:1001-1013. worker_sort_and_limit stores positions into the scan schema but the worker resolves them against the partial-aggregate output schema. Coincides today; brittle long-term. Store by column name instead.
nulls_first is hard-coded — planning.rs:1010. nulls_first = !ctx.sort_is_asc ignores the user's explicit NULLS FIRST/LAST. A router that keeps nulls and a worker that drops them can return wrong rows. Capture the real flag in sort_to_column_names or restrict push-down to default-NULL queries.
Two parallel push-down mechanisms — optimizations/mod.rs. push_sort_to_workers (post-physical) and worker_sort_and_limit (logical, serialized) overlap. Workers re-plan from the logical plan, so the physical path only affects local execution. Worth consolidating or documenting the topology each path serves.

Medium

push_sort_to_workers over-skips when limit_and_reverse is set — optimizations/mod.rs:285-295. A partial-prefix index match still benefits from the bounded sort; bailing out leaves the post-aggregate sort unbounded.
Behavior change in limit_and_reverse reversal — planning.rs:978-986. Now reachable with sort_is_asc=true (default) and empty ctx.sort, paired with the get_limit_for_pushdown relaxation that returns Some(limit) without ORDER BY. Verify all consumers of (limit, false) handle the "limit without explicit order" case deterministically; add a LIMIT N (no ORDER BY) cluster test.

Low

InlineAggregateMode::Single looks unreached — inline_aggregate/mod.rs. Plumbed end-to-end but no path in this PR appears to construct it. Either add coverage or note the intended emitter.
Tests rely on printer string format — sql/mod.rs:5314-5377. worker_plan.contains("Sort, fetch: 3") is fragile; add a NULL-handling test that locks in correctness for (2).

Notes

topk/plan.rs, panic.rs, pretty_printers.rs changes are straightforward plumbing — LGTM.
Removed comment in inline_aggregate_stream.rs is fine.
I posted 7 inline comments on the relevant lines.

· [Branch](https://github.com/cube-js/cube/tree/feat/cubestore-worker-sort-and-limit)

claude · 2026-06-06T20:16:02Z

+                            let output_schema = p.schema();
+                            // Map each sort column to its position in the worker output schema
+                            let mapped: Option<Vec<(usize, bool, bool)>> = sort_cols
+                                .iter()
+                                .map(|col_name| {
+                                    output_schema
+                                        .fields()
+                                        .iter()
+                                        .position(|f| f.name() == col_name)
+                                        .map(|pos| (pos, ctx.sort_is_asc, !ctx.sort_is_asc))
+                                })
+                                .collect();
+                            mapped.map(|cols| (cols, ctx.limit.unwrap()))


Schema-index mismatch risk. output_schema = p.schema() is the raw table-scan schema (here p is a LogicalPlan::TableScan), but at the worker, apply_worker_sort_and_limit resolves the column by index against w.input.schema(), which is the partial-aggregate output schema after push_aggregate_to_workers has rewritten the plan. These happen to coincide today because GROUP BY columns land at positions 0..N in both schemas, but the contract is fragile: any future projection or column reordering between the scan and the partial-aggregate output would silently sort by the wrong column.

Two ways to harden this:

Resolve by name in the worker (apply_worker_sort_and_limit): build sort exprs by searching schema for the field by stored col_name, not by index. Store (name, asc, nulls_first) in worker_sort_and_limit instead of (usize, bool, bool).

Or assert output_schema.field(pos).name() == col_name on both ends.

Option (1) is cleaner and matches how get_limit_for_pushdown already compares by name.

claude · 2026-06-06T20:16:13Z

+                                        .fields()
+                                        .iter()
+                                        .position(|f| f.name() == col_name)
+                                        .map(|pos| (pos, ctx.sort_is_asc, !ctx.sort_is_asc))


nulls_first heuristic doesn't match SQL defaults. Hard-coding nulls_first = !ctx.sort_is_asc means:

ASC → nulls last

DESC → nulls first

That matches DataFusion's default SortOptions, but ctx.sort is derived in sort_to_column_names, which discards each SortExpr's actual nulls_first flag. If a user wrote ORDER BY x ASC NULLS FIRST LIMIT N, the router's SortExec will use nulls_first = true while the worker's pushed-down SortExec(fetch=N) will use nulls_first = false — the worker can drop the very rows the router would keep, producing wrong results.

Either capture the actual nulls_first in sort_to_column_names/PlanningContext, or restrict the push-down to plans whose router sort matches the default.

claude · 2026-06-06T20:16:28Z

+/// When the router plan has `SortExec(fetch=N)` sorting by GROUP BY columns,
+/// and the worker uses `InlineAggregateExec` (streaming aggregate where groups don't overlap),
+/// push a matching `SortExec(fetch=N)` to the worker. DataFusion's SortExec with fetch uses
+/// a bounded heap, so this limits worker output to N rows with O(N) memory.
+fn push_sort_to_workers(
+    p: Arc<dyn ExecutionPlan>,
+) -> Result<Arc<dyn ExecutionPlan>, DataFusionError> {
+    let Some(sort) = p.as_any().downcast_ref::<SortExec>() else {
+        return Ok(p);
+    };
+    let Some(fetch) = sort.fetch() else {
+        return Ok(p);
+    };
+
+    let sort_exprs = sort.expr().clone();
+
+    // Walk down through Projection → InlineAggregate(Final) → ClusterSend
+    let sort_input = sort.input();
+    let (below_proj, col_mapping) = if let Some(proj) = sort_input.as_any().downcast_ref::<ProjectionExec>() {
+        // Map sort column indices through the projection
+        let mapping: Vec<Option<usize>> = proj.expr().iter().map(|(expr, _name)| {
+            expr.as_any()
+                .downcast_ref::<datafusion::physical_plan::expressions::Column>()
+                .map(|c| c.index())
+        }).collect();
+        (proj.input().clone(), Some(mapping))
+    } else {
+        (sort_input.clone(), None)
+    };
+
+    let Some(inline_final) = below_proj.as_any().downcast_ref::<InlineAggregateExec>() else {
+        return Ok(p);
+    };
+    if *inline_final.mode() != InlineAggregateMode::Final {
+        return Ok(p);
+    }
+    let group_key_len = inline_final.group_expr().expr().len();
+
+    // Translate sort expressions to pre-projection column indices
+    let translated_sort_exprs: Vec<_> = sort_exprs.iter().map(|se| {
+        if let Some(col) = se.expr.as_any().downcast_ref::<datafusion::physical_plan::expressions::Column>() {
+            let actual_idx = if let Some(ref mapping) = col_mapping {
+                mapping.get(col.index()).copied().flatten()
+            } else {
+                Some(col.index())
+            };
+            actual_idx.filter(|&idx| idx < group_key_len)
+        } else {
+            None
+        }
+    }).collect();
+
+    // All sort columns must be GROUP BY columns
+    if translated_sort_exprs.iter().any(|x| x.is_none()) {
+        return Ok(p);
+    }
+
+    // Find ClusterSendExec below InlineAggregate(Final), possibly through CheckMemoryExec
+    let final_input = inline_final.input();
+    let (cluster_send, through_check_memory) =
+        if let Some(cs) = final_input.as_any().downcast_ref::<ClusterSendExec>() {
+            (cs, false)
+        } else if let Some(cm) = final_input.as_any().downcast_ref::<CheckMemoryExec>() {
+            if let Some(cs) = cm.input.as_any().downcast_ref::<ClusterSendExec>() {
+                (cs, true)
+            } else {
+                return Ok(p);
+            }
+        } else {
+            return Ok(p);
+        };
+
+    // Don't override if limit_and_reverse is already set
+    if cluster_send.limit_and_reverse.is_some() {
+        return Ok(p);
+    }
+
+    let worker_input = &cluster_send.input_for_optimizations;
+
+    // Verify the worker has InlineAggregateExec(Partial) - confirms groups don't overlap
+    let has_inline_partial = worker_input
+        .as_any()
+        .downcast_ref::<InlineAggregateExec>()
+        .map_or(false, |ia| *ia.mode() == InlineAggregateMode::Partial);
+    if !has_inline_partial {
+        return Ok(p);
+    }
+
+    // Build sort expressions for the worker (same column indices, same options)
+    let worker_schema = worker_input.schema();
+    let worker_sort_exprs: Vec<_> = sort_exprs.iter().zip(translated_sort_exprs.iter()).map(|(se, &mapped_idx)| {
+        let idx = mapped_idx.unwrap();
+        datafusion::physical_expr::PhysicalSortExpr {
+            expr: Arc::new(datafusion::physical_plan::expressions::Column::new(
+                worker_schema.field(idx).name(),
+                idx,
+            )),
+            options: se.options,
+        }
+    }).collect();
+
+    // Wrap the worker plan: SortExec(fetch=N) → InlinePartialAggregate
+    let new_worker_input: Arc<dyn ExecutionPlan> = Arc::new(
+        SortExec::new(LexOrdering::new(worker_sort_exprs), worker_input.clone())
+            .with_fetch(Some(fetch)),
+    );
+
+    // Rebuild ClusterSendExec with the new worker input
+    let new_cluster_send: Arc<dyn ExecutionPlan> = Arc::new(
+        cluster_send.with_changed_schema(new_worker_input, cluster_send.required_input_ordering.clone()),
+    );
+
+    // Re-wrap with CheckMemoryExec if it was present
+    let new_final_child: Arc<dyn ExecutionPlan> = if through_check_memory {
+        final_input.clone().with_new_children(vec![new_cluster_send])?
+    } else {
+        new_cluster_send
+    };
+
+    // Rebuild InlineAggregate(Final) with new child
+    let new_inline_final: Arc<dyn ExecutionPlan> =
+        Arc::clone(&below_proj).with_new_children(vec![new_final_child])?;
+
+    // Rebuild Projection if present
+    let new_sort_input = if sort_input.as_any().downcast_ref::<ProjectionExec>().is_some() {
+        sort_input.clone().with_new_children(vec![new_inline_final])?
+    } else {
+        new_inline_final
+    };
+
+    // Rebuild SortExec with the new subtree
+    p.with_new_children(vec![new_sort_input])
+}
+
+/// Apply worker_sort_and_limit on a WorkerExec by wrapping its child
+/// (the partial aggregate) with SortExec(fetch=N). Must run AFTER
+/// push_aggregate_to_workers and replace_with_inline_aggregate.
+fn apply_worker_sort_and_limit(
+    p: Arc<dyn ExecutionPlan>,
+) -> Result<Arc<dyn ExecutionPlan>, DataFusionError> {
+    use crate::queryplanner::planning::WorkerExec;
+
+    let Some(w) = p.as_any().downcast_ref::<WorkerExec>() else {
+        return Ok(p);
+    };
+    let Some((sort_cols, limit)) = w.worker_sort_and_limit.as_ref() else {
+        return Ok(p);
+    };
+
+    let input = &w.input;
+    let schema = input.schema();
+    let sort_exprs: Vec<_> = sort_cols
+        .iter()
+        .map(|(col_idx, asc, nulls_first)| {
+            datafusion::physical_expr::PhysicalSortExpr {
+                expr: Arc::new(
+                    datafusion::physical_plan::expressions::Column::new(
+                        schema.field(*col_idx).name(),
+                        *col_idx,
+                    ),
+                ),
+                options: datafusion::arrow::compute::SortOptions {
+                    descending: !asc,
+                    nulls_first: *nulls_first,
+                },
+            }
+        })
+        .collect();
+    let sort_exec: Arc<dyn ExecutionPlan> = Arc::new(
+        SortExec::new(LexOrdering::new(sort_exprs), input.clone())
+            .with_fetch(Some(*limit)),


Two parallel mechanisms for the same optimization — please clarify or consolidate.

This PR introduces two distinct paths that both wrap the worker plan with SortExec(fetch=N):

push_sort_to_workers (this file, run from finalize_physical_plan): walks the router's already-built physical plan (SortExec → [Projection] → InlineAggregate(Final) → [CheckMemory] → ClusterSendExec) and rebuilds ClusterSendExec with a SortExec-wrapped input_for_optimizations.

worker_sort_and_limit (logical field on ClusterSendNode, serialized to workers, applied in apply_worker_sort_and_limit against WorkerExec): the worker re-plans the logical plan and wraps with SortExec when WorkerExec is built.

Workers re-plan from the serialized logical plan rather than receiving the router's physical plan, so path (1) doesn't actually reach a remote worker — it only affects local execution. Meanwhile (2) is the path that actually reaches remote workers via serialization.

A few concrete concerns:

Coverage gap: path (2) only fires when ChooseIndex builds the ClusterSendNode. Path (1) fires whenever the post-finalize plan matches the SortExec→…→ClusterSendExec shape. These shouldn't disagree, but they can.

Path (1) over-matches: there's no check that ctx.sort ever existed at logical-planning time. If a later physical-plan rule introduces a SortExec(fetch=N) for an internal reason (e.g. a TopK lowering), path (1) will still push it to the worker.

Path (1) skips when limit_and_reverse.is_some() but path (2) is gated on limit_and_reverse.is_none() at logical time. The skip is conservative-correct but the intent overlaps confusingly with the logical gate.

Consider either removing path (1) (since the logical-level field already covers the worker case end-to-end), or document very clearly which path fires in which deployment topology. Right now this is the single hardest piece of the PR to reason about.

claude · 2026-06-06T20:16:37Z

+        .downcast_ref::<InlineAggregateExec>()
+        .map_or(false, |ia| *ia.mode() == InlineAggregateMode::Partial);
+    if !has_inline_partial {
+        return Ok(p);
+    }
+
+    // Build sort expressions for the worker (same column indices, same options)
+    let worker_schema = worker_input.schema();
+    let worker_sort_exprs: Vec<_> = sort_exprs.iter().zip(translated_sort_exprs.iter()).map(|(se, &mapped_idx)| {
+        let idx = mapped_idx.unwrap();
+        datafusion::physical_expr::PhysicalSortExpr {


Don't skip when limit_and_reverse is set — it can pessimize. limit_and_reverse says "the scan can return at most limit rows from the index in sorted order." That limit is on the scan input, not on the post-aggregate output, and it only matches the index sort prefix. If the router still has a SortExec(fetch=M) above the final aggregate (e.g. because the ORDER BY columns and the index prefix only overlap partially), bailing out here leaves the final aggregate consuming every group with no fetch bound.

Recommend: rather than return Ok(p) when limit_and_reverse.is_some(), just continue with the push-down — SortExec(fetch=M) is correct on top of a limit-scan worker input too.

claude · 2026-06-06T20:16:46Z

                        let limit_and_reverse = if let Some(limit) = limit {
-                            Some((limit, !ctx.sort_is_asc))
+                            let reverse = if ctx.sort.is_some()
+                                && !ctx.sort.as_ref().unwrap().is_empty()
+                            {
+                                !ctx.sort_is_asc
+                            } else {
+                                false
+                            };
+                            Some((limit, reverse))


Subtle behavior change in limit_and_reverse reversal. Previously this was unconditionally Some((limit, !ctx.sort_is_asc)). Now the reverse flag is forced to false when ctx.sort is empty. That's only reachable because get_limit_for_pushdown was simultaneously relaxed (further down) to return Some(limit) even when ctx.sort is None. Without a fast scan to confirm: please double-check that any code path consuming limit_and_reverse (e.g. CubeTable read path, replace_suboptimal_merge_sorts) correctly handles the (limit, false) shape when the worker is NOT performing a reverse scan and the index has its own natural order — there's a risk of returning a non-deterministic "first N rows" silently.

A test for SELECT … FROM data WHERE … LIMIT N (no ORDER BY) on a clustered setup would lock in the new behavior.

claude · 2026-06-06T20:16:55Z

 pub enum InlineAggregateMode {
    Partial,
    Final,
+    Single,


InlineAggregateMode::Single is added end-to-end but I don't see anything in this PR that actually constructs a Single-mode AggregateExec — DataFusion only emits AggregateMode::Single for a one-pass aggregate which Cube's distributed planner typically rewrites into Partial+Final. Is this dead code today, or is there a path I'm missing? If it's anticipatory, a test that exercises it (or at least a comment noting which planner path produces it) would prevent the new branches in inline_aggregate_stream.rs from going untested.

claude · 2026-06-06T20:17:03Z

+                                        match &row.values()[1] {
+                                            TableValue::String(region) => {
+                                                assert_eq!(region, "East");
+                                            }
+                                            _ => panic!("expected string"),
+                                        }
+                                    }
+                                }
+
+                                // Test 4: ORDER BY 1 DESC with LIMIT on non-prefix column
+                                {
+                                    let result = service
+                                        .exec_query(
+                                            "EXPLAIN ANALYZE SELECT region, sum(amount) \
+                                             FROM foo.data GROUP BY 1 ORDER BY 1 DESC LIMIT 2",
+                                        )
+                                        .await?
+                                        .collect()
+                                        .await?;
+
+                                    let worker_row = &result.get_rows()[1];
+                                    let worker_plan = match &worker_row.values()[2] {
+                                        TableValue::String(s) => s.clone(),
+                                        _ => panic!("expected string"),
+                                    };
+                                    assert!(
+                                        worker_plan.contains("Sort, fetch: 2"),
+                                        "Worker should have Sort with fetch=2 for DESC. Plan: {}",
+                                        worker_plan
+                                    );
+                                }
+
+                                // Test 5: Verify correctness of ORDER BY 1 DESC LIMIT 2
+                                {
+                                    let result = service
+                                        .exec_query(
+                                            "SELECT region, sum(amount) \
+                                             FROM foo.data GROUP BY 1 ORDER BY 1 DESC LIMIT 2",
+                                        )
+                                        .await?
+                                        .collect()
+                                        .await?;
+
+                                    assert_eq!(result.len(), 2);
+                                    let regions: Vec<&str> = result
+                                        .get_rows()
+                                        .iter()
+                                        .map(|r| match &r.values()[0] {
+                                            TableValue::String(s) => s.as_str(),
+                                            _ => panic!("expected string"),
+                                        })
+                                        .collect();
+                                    assert_eq!(regions, vec!["West", "South"]);
+                                }
+
+                                Ok::<(), CubeError>(())
+                            })
+                            .await;
+                        Ok::<(), CubeError>(())
+                    })
+                    .await;
+                Ok::<(), CubeError>(())
+            })
+            .await;


Tests are tightly coupled to plan-printer output. Asserting worker_plan.contains("Sort, fetch: 3") will break the next time pp_phys_plan_indented changes its format. Consider extracting a small helper that parses out the SortExec node (or matches Sort, fetch: \d+ with a regex) and asserting on its parameters, so a cosmetic printer change doesn't break correctness tests.

Also missing: a test for the ORDER BY with nulls_first specified explicitly path — given the hard-coded nulls_first = !ctx.sort_is_asc in planning.rs (see comment on that line), the absence of a NULL-handling test is a real coverage gap, not just a polish issue.

paveltiunov requested a review from a team as a code owner June 6, 2026 20:12

github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 6, 2026

vercel Bot deployed to Preview June 6, 2026 20:14 View deployment

claude Bot reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cubestore): push Sort(fetch=N) to workers for ORDER BY group-by …#11027

feat(cubestore): push Sort(fetch=N) to workers for ORDER BY group-by …#11027
paveltiunov wants to merge 1 commit into
masterfrom
feat/cubestore-worker-sort-and-limit

paveltiunov commented Jun 6, 2026

Uh oh!

claude Bot commented Jun 6, 2026 •

edited

Loading

High

Medium

Low

Notes

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

claude Bot Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paveltiunov commented Jun 6, 2026

Uh oh!

claude Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

High

Medium

Low

Notes

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 6, 2026 •

edited

Loading