Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget) by philcunliffe · Pull Request #36 · hyparam/squirreling

philcunliffe · 2026-05-06T21:25:01Z

Summary

Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget)

Changes by Bead

Rebase + resolve conflicts on PR [Phase 5b] SqlExecutionBudget propagation through executor (sq-lcw.5) #33 (sq-lcw.5 budget): Rebase + resolve conflicts on PR [Phase 5b] SqlExecutionBudget propagation through executor (sq-lcw.5) #33 (sq-lcw.5 budget). Rebased polecat/sq-lcw.5 onto origin/integration/batch-execution. Resolved conflicts in src/execute/sort.js (Top-K path first, full-sort path adds budget op?.addRow()) and src/execute/execute.js (kept both batchOps and budget imports). Force-pushed; PR [Phase 5b] SqlExecutionBudget propagation through executor (sq-lcw.5) #33 CI green (lint/typecheck/test all SUCCESS, mergeStateStatus=CLEAN).

Validation

Validation followed the refinery merge path before this draft PR was opened.

Risks / Follow-ups

Review the aggregated changes on integration/batch-execution before marking this PR ready.

#31) Phase 5a establishes the columnar batch interface without changing any operator. Sources that implement scanBatches can hand columnar data to the engine; sources that only implement scan() are bridged via the row to batch adapter. Operator migration to native batch mode is Phase 5b. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…w.2) (#32) * Add scanBatches() to AsyncDataSource interface plus row/batch adapters Phase 5a establishes the columnar batch interface without changing any operator. Sources that implement scanBatches can hand columnar data to the engine; sources that only implement scan() are bridged via the row to batch adapter. Operator migration to native batch mode is Phase 5b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Migrate scan/project/filter to batch mode (sq-lcw.2) Operators now produce ColumnBatch streams natively when the underlying data source provides scanBatches(). Each batch-aware operator exposes an optional batches() method on QueryResults; consumers that don't implement batch mode pull rows() at the boundary via adaptBatchesToRows. - executeScan: emits batches when source has scanBatches; applies WHERE and LIMIT engine-side via filterBatches/limitBatches per Phase 5a contract (scanBatches has no pushdown surface). - executeFilter: chains through batches when child supports them. Preserves typed-array column types (Uint32Array stays Uint32Array) and forwards all-pass batches unchanged. - executeProject: chains through batches for "simple" SELECT lists (stars and bare identifiers) by aliasing column arrays — no per-row copy. Complex expressions still go through the row-mode path. - limitBatches: slice-aware batch limiter that terminates iteration early once the limit is hit, so sources stop producing. Microbench (test/execute/batchBench.test.js): 100k Uint32Array rows filtered with `n < 50000` runs ~3.8-5.7x faster on the batch path (~1.5M rows/s) than the row-mode fallback (~0.3M rows/s). Row-mode fallback is unchanged for sources without scanBatches; all 1591 tests pass (1571 pre-existing + 20 new for batch operators). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sorts feeding a small LIMIT now use a bounded max-heap of size limit + offset instead of buffering the full input. Memory stays O(k) regardless of input size, matching Phase 5b's bounded-memory goal for ORDER BY x LIMIT 100 over 100M-row scans. Planner: SortNode gains an optional `limit` cap. The planSelect / planSet construction sets it when LIMIT is defined, LIMIT + OFFSET <= 10000, and no DISTINCT sits between Sort and Limit (DISTINCT can drop rows, so the cap would be unsafe). Executor: a new TopKHeap maintains the worst-among-top-K candidate at the root. Sort keys are evaluated lazily, term by term, so multi-key ORDER BY keeps later (often expensive) terms unevaluated unless earlier terms tie — preserving the same cell-access economy the existing full-sort path guarantees. First-key evaluation is parallelized in chunks for streaming throughput. Falls back to the full-sort path when no limit hint is set. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plumbs an optional SqlExecutionBudget through the executor context so materializing operators can detect runaway resource use and abort with a structured error indicating which limit was hit. Budget fields: maxRowsToMaterialize (global row counter across operators), maxHeapBytes (global byte counter), maxIntermediateBytes (per-operator byte counter), timeoutMs (wall-clock deadline), allowDerivedColumnScan (opts out of the scalar aggregate scanColumn fast path). Operators that materialize now call context.budget?.operator(name).addRow() before pinning each row: Sort, HashAggregate, ScalarAggregate slow path, HashJoin and NestedLoopJoin (right side), PositionalJoin (both sides), Distinct, UNION/INTERSECT/EXCEPT, and Window's non-streaming path. Streaming loops (filterRows, scanColumnAggregate chunks) check the timeout deadline so long-running pure-stream queries also abort. Errors throw a new SqlBudgetError with limit/value/max/operator fields. Budget passes through Subquery, LATERAL, and correlated subquery contexts. 28 new tests cover each limit reaching threshold, the structured error shape, fast-path opt-out, and confirm budget-free queries are unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a batch-mode fast path to executeScalarAggregate that consumes ColumnBatch streams from the child operator (Scan/Filter/Project) and computes aggregates directly off typed-array column data without per-row AsyncRow materialization. Activates when the child exposes batches() and every output column is a simple aggregate (COUNT/SUM/AVG/MIN/MAX) on a plain identifier or COUNT(*). DISTINCT, FILTER, HAVING, expressions wrapping aggregates, and non-aggregate columns all bail out to the existing row-mode path which now naturally bridges over batches() via adaptBatchesToRows. Null-handling mirrors scanColumnAggregate: nulls skipped for every function; SUM/AVG additionally skip non-finite numerics; MIN/MAX work on any non-null value (strings included). SUM/AVG of empty input returns null per SQL spec. The narrower scanColumn fast path still preempts the batch path for direct table scans without WHERE — column-scan can dispatch per-column reads which are cheaper than batch iteration. Tests in test/execute/batchAggregate.test.js cover: - All five aggregate kinds on Uint32Array/Float64Array/string columns - Null handling (nulls skipped, empty/all-filtered → null for SUM/AVG) - Multi-aggregate single-pass (COUNT/SUM/AVG/MIN/MAX in one query) - Fallback paths (DISTINCT, FILTER, expression wrapping, HAVING) - scanColumn precedence verification - Abort signal honoring between batches - Microbench: SUM over 100k Uint32Array rows runs ~3.7x faster than row-mode (0.84M vs 0.23M rows/s locally) This branch is stacked on polecat/sq-lcw.2 (PR #32, mr strategy, not yet merged). Rebases onto integration/batch-execution after #32 lands. Full suite: 1610/1610 passing. Lint and tsc clean.

philcunliffe and others added 5 commits May 6, 2026 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget)#36

Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget)#36
philcunliffe wants to merge 5 commits into
masterfrom
integration/batch-execution

philcunliffe commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented May 6, 2026

Summary

Changes by Bead

Validation

Risks / Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant