Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget)#36
Draft
philcunliffe wants to merge 5 commits into
Draft
Conversation
#31) Phase 5a establishes the columnar batch interface without changing any operator. Sources that implement scanBatches can hand columnar data to the engine; sources that only implement scan() are bridged via the row to batch adapter. Operator migration to native batch mode is Phase 5b. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…w.2) (#32) * Add scanBatches() to AsyncDataSource interface plus row/batch adapters Phase 5a establishes the columnar batch interface without changing any operator. Sources that implement scanBatches can hand columnar data to the engine; sources that only implement scan() are bridged via the row to batch adapter. Operator migration to native batch mode is Phase 5b. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Migrate scan/project/filter to batch mode (sq-lcw.2) Operators now produce ColumnBatch streams natively when the underlying data source provides scanBatches(). Each batch-aware operator exposes an optional batches() method on QueryResults; consumers that don't implement batch mode pull rows() at the boundary via adaptBatchesToRows. - executeScan: emits batches when source has scanBatches; applies WHERE and LIMIT engine-side via filterBatches/limitBatches per Phase 5a contract (scanBatches has no pushdown surface). - executeFilter: chains through batches when child supports them. Preserves typed-array column types (Uint32Array stays Uint32Array) and forwards all-pass batches unchanged. - executeProject: chains through batches for "simple" SELECT lists (stars and bare identifiers) by aliasing column arrays — no per-row copy. Complex expressions still go through the row-mode path. - limitBatches: slice-aware batch limiter that terminates iteration early once the limit is hit, so sources stop producing. Microbench (test/execute/batchBench.test.js): 100k Uint32Array rows filtered with `n < 50000` runs ~3.8-5.7x faster on the batch path (~1.5M rows/s) than the row-mode fallback (~0.3M rows/s). Row-mode fallback is unchanged for sources without scanBatches; all 1591 tests pass (1571 pre-existing + 20 new for batch operators). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sorts feeding a small LIMIT now use a bounded max-heap of size limit + offset instead of buffering the full input. Memory stays O(k) regardless of input size, matching Phase 5b's bounded-memory goal for ORDER BY x LIMIT 100 over 100M-row scans. Planner: SortNode gains an optional `limit` cap. The planSelect / planSet construction sets it when LIMIT is defined, LIMIT + OFFSET <= 10000, and no DISTINCT sits between Sort and Limit (DISTINCT can drop rows, so the cap would be unsafe). Executor: a new TopKHeap maintains the worst-among-top-K candidate at the root. Sort keys are evaluated lazily, term by term, so multi-key ORDER BY keeps later (often expensive) terms unevaluated unless earlier terms tie — preserving the same cell-access economy the existing full-sort path guarantees. First-key evaluation is parallelized in chunks for streaming throughput. Falls back to the full-sort path when no limit hint is set. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumbs an optional SqlExecutionBudget through the executor context so materializing operators can detect runaway resource use and abort with a structured error indicating which limit was hit. Budget fields: maxRowsToMaterialize (global row counter across operators), maxHeapBytes (global byte counter), maxIntermediateBytes (per-operator byte counter), timeoutMs (wall-clock deadline), allowDerivedColumnScan (opts out of the scalar aggregate scanColumn fast path). Operators that materialize now call context.budget?.operator(name).addRow() before pinning each row: Sort, HashAggregate, ScalarAggregate slow path, HashJoin and NestedLoopJoin (right side), PositionalJoin (both sides), Distinct, UNION/INTERSECT/EXCEPT, and Window's non-streaming path. Streaming loops (filterRows, scanColumnAggregate chunks) check the timeout deadline so long-running pure-stream queries also abort. Errors throw a new SqlBudgetError with limit/value/max/operator fields. Budget passes through Subquery, LATERAL, and correlated subquery contexts. 28 new tests cover each limit reaching threshold, the structured error shape, fast-path opt-out, and confirm budget-free queries are unchanged. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a batch-mode fast path to executeScalarAggregate that consumes ColumnBatch streams from the child operator (Scan/Filter/Project) and computes aggregates directly off typed-array column data without per-row AsyncRow materialization. Activates when the child exposes batches() and every output column is a simple aggregate (COUNT/SUM/AVG/MIN/MAX) on a plain identifier or COUNT(*). DISTINCT, FILTER, HAVING, expressions wrapping aggregates, and non-aggregate columns all bail out to the existing row-mode path which now naturally bridges over batches() via adaptBatchesToRows. Null-handling mirrors scanColumnAggregate: nulls skipped for every function; SUM/AVG additionally skip non-finite numerics; MIN/MAX work on any non-null value (strings included). SUM/AVG of empty input returns null per SQL spec. The narrower scanColumn fast path still preempts the batch path for direct table scans without WHERE — column-scan can dispatch per-column reads which are cheaper than batch iteration. Tests in test/execute/batchAggregate.test.js cover: - All five aggregate kinds on Uint32Array/Float64Array/string columns - Null handling (nulls skipped, empty/all-filtered → null for SUM/AVG) - Multi-aggregate single-pass (COUNT/SUM/AVG/MIN/MAX in one query) - Fallback paths (DISTINCT, FILTER, expression wrapping, HAVING) - scanColumn precedence verification - Abort signal honoring between batches - Microbench: SUM over 100k Uint32Array rows runs ~3.7x faster than row-mode (0.84M vs 0.23M rows/s locally) This branch is stacked on polecat/sq-lcw.2 (PR #32, mr strategy, not yet merged). Rebases onto integration/batch-execution after #32 lands. Full suite: 1610/1610 passing. Lint and tsc clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rebase + resolve conflicts on PR #33 (sq-lcw.5 budget)
Changes by Bead
Validation
Risks / Follow-ups
integration/batch-executionbefore marking this PR ready.