feat(cubestore): pre-filter in-memory chunks on worker before IPC by waralexrom · Pull Request #11040 · cube-js/cube

waralexrom · 2026-06-08T22:53:18Z

Summary

Trims in-memory chunks on the worker by the dedup-safe unique-key pushable predicate before they are serialized and shipped over IPC to the select subprocess. Parquet data is already pruned by the predicate; in-memory chunks previously crossed IPC whole, so a partition that survives range-pruning still shipped rows of many keys that the query immediately discards. This closes that gap.

Changes

Extract the dedup-safe pushable-filter selection into a shared dedup_safe_unique_key_filter reused by the scan-time FilterExec and the new worker pre-filter, so the two cannot diverge.
Compute the per-index pushable predicate once at planning time (choose_index_ext, from the same filters that gate partition pruning), strip column qualifiers, and carry it proto-encoded in PlanningMeta.pushable_chunk_filters (1:1 with indices, #[serde(default)] for back-compat).
On the worker, decode the predicate and trim the loaded chunk batches between load_in_memory_chunks and serialization (trace op chunks.prefilter). The subprocess still re-applies the predicate, so this only reduces IPC payload and is never relied on for correctness.
A chunk referenced by more than one scan (self-join / self-union of one index) is left untrimmed: the subprocess shares batches by chunk id, and a scan with a different or no predicate would otherwise lose rows. Reference counting spans all index snapshots, not just predicate-bearing ones.
Best-effort: any decode/build/eval failure is logged and skipped rather than failing the query.

Testing

New regression test prefilter_chunks_shared_scan (self-union of one index, one branch key-filtered, the other not) — reproduced the shared-chunk bug (60 vs expected 130) before the cross-snapshot reference-count fix, green after.
filter_pushdown_unique_key, unique_key_and_multi_*, *_stream_table, limit_pushdown_unique_key — pass in-process and multi-process (real select-subprocess IPC), no regressions.
cargo build -p cubestore clean; cargo fmt --check passes (pre-commit hook).

Trim in-memory chunks by the dedup-safe unique-key pushable predicate on the worker before they are serialized and shipped over IPC to the select subprocess. The predicate is computed once at planning time (from the same filters that gate partition pruning), carried in PlanningMeta, and applied to the loaded batches; the subprocess still re-applies it, so this only reduces IPC payload and is never relied on for correctness. A chunk referenced by more than one scan (self-join/self-union of one index) is left untrimmed: the subprocess shares batches by chunk id and a scan with a different or no predicate would otherwise lose rows. The pre-filter is best-effort - any failure is logged and skipped rather than failing the query.

claude · 2026-06-08T22:53:34Z

Claude finished @waralexrom's task in 1m 37s —— View job

Code review

Solid, well-scoped change. The shared-chunk reference count across all index snapshots (not just predicate-bearing ones) is the key correctness insight, and the new prefilter_chunks_shared_scan test pins it down precisely. Defense-in-depth (subprocess re-applies the predicate; planner-side encode failures degrade to no pre-filter) keeps the surface area for correctness regressions small.

A few minor observations — nothing blocking:

1. in_memory_chunk_filter_groups doc is slightly misleading — serialized_plan.rs:1086

The doc says the result is "grouped by the proto-encoded pushable predicate", but the implementation produces one entry per index snapshot — two snapshots with identical predicate bytes yield two entries, not one merged group. Not a correctness issue (downstream just iterates), but the comment overstates the grouping. Consider either deduping by bytes or rewording (e.g., "one entry per index snapshot that carries a pre-filter").

2. phys.as_ref().unwrap() — query_executor.rs:2433

After the explicit if phys.is_none() build block, the immediate phys.as_ref().unwrap() is safe but slightly awkward. Binding the freshly built Arc directly avoids the unwrap:

let p = match phys.as_ref() {
    Some(p) => p.clone(),
    None => { /* build, set phys, return p */ }
};

Pure style.

3. Backward-compat direction is one-way

#[serde(default)] on pushable_chunk_filters covers old plan → new worker. The reverse — new router → old worker — relies on serde's default of ignoring unknown fields, which is true for bincode-derived serializers but worth confirming for whatever transport actually serializes PlanningMeta in your mixed-version rollouts. If a rolling upgrade is in scope, a quick check would be cheap insurance.

4. Filter cost for low-selectivity predicates

filter_record_batch reallocates the entire batch even when the predicate selects most rows. For the intended use case (selective unique-key filters in multi-tenant scans) this is a clear win, but if predicates ever land here that match >90% of rows, the IPC-saving math flips. Not worth gating on, but a future selectivity heuristic could short-circuit when the mask popcount is close to row count.

5. Test gap: pre-filter actually fires when it should

prefilter_chunks_shared_scan proves the skip path is correct (shared chunks aren't wrongly trimmed). It doesn't directly verify the active path trims anything — both branches produce the same final result whether the pre-filter ran or not, since the subprocess re-applies the predicate. A counter on chunks.prefilter (or an EXPLAIN ANALYZE DETAILED assertion now that #11016 landed) would catch a future refactor that silently disables the pre-filter. Not required for this PR.

Nits

query_executor.rs:2359 doc comment for dedup_safe_unique_key_filter is very good — the "commutes with last-row dedup" framing is exactly the right invariant to call out.
in_memory_chunk_filter_groups walks index_snapshots twice (once for chunk_counts, once for emit). Fine at current sizes; mentionable only if profiling ever points here.

LGTM modulo the doc tweak on item 1.

codecov · 2026-06-08T23:17:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.51%. Comparing base (107e63f) to head (392e472).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11040      +/-   ##
==========================================
+ Coverage   58.44%   58.51%   +0.06%     
==========================================
  Files         216      216              
  Lines       17248    17265      +17     
  Branches     3520     3524       +4     
==========================================
+ Hits        10080    10102      +22     
+ Misses       6660     6648      -12     
- Partials      508      515       +7

Flag	Coverage Δ
cube-backend	`58.51% <ø> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

waralexrom requested a review from a team as a code owner June 8, 2026 22:53

github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 8, 2026

vercel Bot deployed to Preview June 8, 2026 22:55 View deployment

paveltiunov approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040
waralexrom wants to merge 1 commit into
masterfrom
cubestore-mem-chunks-filter

waralexrom commented Jun 8, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waralexrom commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

claude Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

codecov Bot commented Jun 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waralexrom commented Jun 8, 2026 •

edited

Loading

claude Bot commented Jun 8, 2026 •

edited

Loading