Skip to content

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040

Open
waralexrom wants to merge 1 commit into
masterfrom
cubestore-mem-chunks-filter
Open

feat(cubestore): pre-filter in-memory chunks on worker before IPC#11040
waralexrom wants to merge 1 commit into
masterfrom
cubestore-mem-chunks-filter

Conversation

@waralexrom

@waralexrom waralexrom commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Trims in-memory chunks on the worker by the dedup-safe unique-key pushable predicate before they are serialized and shipped over IPC to the select subprocess. Parquet data is already pruned by the predicate; in-memory chunks previously crossed IPC whole, so a partition that survives range-pruning still shipped rows of many keys that the query immediately discards. This closes that gap.

Changes

  • Extract the dedup-safe pushable-filter selection into a shared dedup_safe_unique_key_filter reused by the scan-time FilterExec and the new worker pre-filter, so the two cannot diverge.
  • Compute the per-index pushable predicate once at planning time (choose_index_ext, from the same filters that gate partition pruning), strip column qualifiers, and carry it proto-encoded in PlanningMeta.pushable_chunk_filters (1:1 with indices, #[serde(default)] for back-compat).
  • On the worker, decode the predicate and trim the loaded chunk batches between load_in_memory_chunks and serialization (trace op chunks.prefilter). The subprocess still re-applies the predicate, so this only reduces IPC payload and is never relied on for correctness.
  • A chunk referenced by more than one scan (self-join / self-union of one index) is left untrimmed: the subprocess shares batches by chunk id, and a scan with a different or no predicate would otherwise lose rows. Reference counting spans all index snapshots, not just predicate-bearing ones.
  • Best-effort: any decode/build/eval failure is logged and skipped rather than failing the query.

Testing

  • New regression test prefilter_chunks_shared_scan (self-union of one index, one branch key-filtered, the other not) — reproduced the shared-chunk bug (60 vs expected 130) before the cross-snapshot reference-count fix, green after.
  • filter_pushdown_unique_key, unique_key_and_multi_*, *_stream_table, limit_pushdown_unique_key — pass in-process and multi-process (real select-subprocess IPC), no regressions.
  • cargo build -p cubestore clean; cargo fmt --check passes (pre-commit hook).

Trim in-memory chunks by the dedup-safe unique-key pushable predicate on
the worker before they are serialized and shipped over IPC to the select
subprocess. The predicate is computed once at planning time (from the same
filters that gate partition pruning), carried in PlanningMeta, and applied
to the loaded batches; the subprocess still re-applies it, so this only
reduces IPC payload and is never relied on for correctness.

A chunk referenced by more than one scan (self-join/self-union of one
index) is left untrimmed: the subprocess shares batches by chunk id and a
scan with a different or no predicate would otherwise lose rows. The
pre-filter is best-effort - any failure is logged and skipped rather than
failing the query.
@waralexrom waralexrom requested a review from a team as a code owner June 8, 2026 22:53
@github-actions github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels Jun 8, 2026
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Claude finished @waralexrom's task in 1m 37s —— View job


Code review

Solid, well-scoped change. The shared-chunk reference count across all index snapshots (not just predicate-bearing ones) is the key correctness insight, and the new prefilter_chunks_shared_scan test pins it down precisely. Defense-in-depth (subprocess re-applies the predicate; planner-side encode failures degrade to no pre-filter) keeps the surface area for correctness regressions small.

A few minor observations — nothing blocking:

1. in_memory_chunk_filter_groups doc is slightly misleadingserialized_plan.rs:1086

The doc says the result is "grouped by the proto-encoded pushable predicate", but the implementation produces one entry per index snapshot — two snapshots with identical predicate bytes yield two entries, not one merged group. Not a correctness issue (downstream just iterates), but the comment overstates the grouping. Consider either deduping by bytes or rewording (e.g., "one entry per index snapshot that carries a pre-filter").

2. phys.as_ref().unwrap()query_executor.rs:2433

After the explicit if phys.is_none() build block, the immediate phys.as_ref().unwrap() is safe but slightly awkward. Binding the freshly built Arc directly avoids the unwrap:

let p = match phys.as_ref() {
    Some(p) => p.clone(),
    None => { /* build, set phys, return p */ }
};

Pure style.

3. Backward-compat direction is one-way

#[serde(default)] on pushable_chunk_filters covers old plan → new worker. The reverse — new router → old worker — relies on serde's default of ignoring unknown fields, which is true for bincode-derived serializers but worth confirming for whatever transport actually serializes PlanningMeta in your mixed-version rollouts. If a rolling upgrade is in scope, a quick check would be cheap insurance.

4. Filter cost for low-selectivity predicates

filter_record_batch reallocates the entire batch even when the predicate selects most rows. For the intended use case (selective unique-key filters in multi-tenant scans) this is a clear win, but if predicates ever land here that match >90% of rows, the IPC-saving math flips. Not worth gating on, but a future selectivity heuristic could short-circuit when the mask popcount is close to row count.

5. Test gap: pre-filter actually fires when it should

prefilter_chunks_shared_scan proves the skip path is correct (shared chunks aren't wrongly trimmed). It doesn't directly verify the active path trims anything — both branches produce the same final result whether the pre-filter ran or not, since the subprocess re-applies the predicate. A counter on chunks.prefilter (or an EXPLAIN ANALYZE DETAILED assertion now that #11016 landed) would catch a future refactor that silently disables the pre-filter. Not required for this PR.

Nits

  • query_executor.rs:2359 doc comment for dedup_safe_unique_key_filter is very good — the "commutes with last-row dedup" framing is exactly the right invariant to call out.
  • in_memory_chunk_filter_groups walks index_snapshots twice (once for chunk_counts, once for emit). Fine at current sizes; mentionable only if profiling ever points here.

LGTM modulo the doc tweak on item 1.

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.51%. Comparing base (107e63f) to head (392e472).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11040      +/-   ##
==========================================
+ Coverage   58.44%   58.51%   +0.06%     
==========================================
  Files         216      216              
  Lines       17248    17265      +17     
  Branches     3520     3524       +4     
==========================================
+ Hits        10080    10102      +22     
+ Misses       6660     6648      -12     
- Partials      508      515       +7     
Flag Coverage Δ
cube-backend 58.51% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants