fix: track join_arrays memory in reservation after SMJ spill#21962
fix: track join_arrays memory in reservation after SMJ spill#21962SubhamSinghal wants to merge 7 commits intoapache:mainfrom
Conversation
|
Thanks for picking this up, the accounting fix itself reads cleanly and the A few things I wanted to ask about: Tests that fail without the fix?
Since this is a pure accounting fix with no observable behavior change, could you point to a test here that fails on
In Test duplication The three Minor
Overall the change looks safe and the invariant is nice. Mainly curious about the test angle before signing off. |
|
@mbutrovich Thanks for reviewing this PR. I have removed redundant tests. I have added UT which would always hit spill path, with this Added |
There was a problem hiding this comment.
@SubhamSinghal
Thanks for working on this. I think there is still a memory accounting gap in the spill path that needs to be addressed before this can merge.
| // usually succeed. If it fails, reserved_amount stays 0 - | ||
| // best-effort tracking, free_reservation will safely be a no-op. | ||
| let join_arrays_mem = buffered_batch.join_arrays_mem(); | ||
| if self.reservation.try_grow(join_arrays_mem).is_ok() { |
There was a problem hiding this comment.
I think the spill path can still leave retained join key arrays invisible to the memory pool.
Right now, if the full batch try_grow(size_estimation) fails because the pool is full, and the follow-up try_grow(join_arrays_mem) also fails, we spill the IPC batch but still push buffered_batch with reserved_amount = 0.
At that point the operator is still holding the retained join_arrays, but the pool is no longer aware of them when making spill decisions for other operators. This seems like the same invariant violation we were trying to avoid.
I think this can still happen with concurrent reservations or when the memory limit is below a single join-array allocation, and in those cases many skewed spilled batches could accumulate untracked memory.
Can we make the retained in-memory portion accounted deterministically here? For example, by growing or resizing the reservation after the physical memory is retained, or by returning an error instead of continuing untracked.
It would also be great to add a regression test that covers the no-headroom path where try_grow(join_arrays_mem) fails, since the current test only exercises the successful reservation case.
|
Thanks for the review @kosiew I've updated the implementation to use unconditional grow() instead. The join key arrays are physically in memory. Not tracking them gives the pool a stale view and could let concurrent operators over-allocate. Since grow() is infallible (returns ()), the accounting is now deterministic — no conditional path.
|
|
Thanks for turning this around quickly, the force-grow version is easier to reason about than the conditional one. A few things I wanted to ask about before signing off: grow() can push over the pool limit
Recomputing join_arrays memory
try_shrink vs shrink on free Given the invariant that we only shrink by what we grew, would peak_mem assertion strength The two new tests check Perf-wise I looked at this against #20729 and I don't think the concerns there apply here. The extra |
Which issue does this PR close?
Related to the TODO at
materializing_stream.rs:283(from #17429): spilledBufferedBatchjoin key arrays are not tracked in memory reservation.Rationale for this change
When a
BufferedBatchis spilled to disk in Sort Merge Join, only theRecordBatchdata is written to the IPC file. Thejoin_arrays(evaluated join key columns) remain in memory because the merge-scan comparator needs them to detect key group boundaries.Before this fix, these in-memory
join_arrayswere invisible to the memory pool:allocate_reservation():
try_grow(size_estimation) → FAILS (pool full)
spill batch to disk
→ join_arrays still in memory, but reservation was never grown
→ pool thinks 0 bytes are used for this batch
free_reservation():
if InMemory → shrink(size_estimation)
if Spilled → no-op ← correct (nothing was grown), but join_arrays are invisible
With many spilled batches for a skewed key (e.g., millions of rows sharing the same join key), the untracked
join_arraysmemory accumulates. The memory pool cannot account for this when making spill decisions for concurrent operators.What changes are included in this PR?
Memory accounting fix (
materializing_stream.rs):reserved_amountfield toBufferedBatch— tracks how much memory was actually reserved in the pool for this batchjoin_arrays_mem()helper — computes total memory of join key arraysallocate_reservation(): after spilling, callstry_grow(join_arrays_mem)to track the remaining in-memory data. If the pool is too tight for even that,reserved_amountstays 0(best-effort, safe)
free_reservation(): shrinks byreserved_amountinstead of checkingInMemoryvariant. Invariant: only shrink by what was actually grown — no underflow risktry_growreserved_amounttry_shrinkTests (
tests.rs):spill_many_batches_same_key— 10+5 batches all sharing key=1, verifies correctness under heavy spillingspill_string_join_keys— Utf8 join keys to exercise largerjoin_arraysfootprintspill_mixed_keys_some_match— multiple distinct keys with partial matching, tests Full outer join NULL rows from spilled batchesspill_join_arrays_memory_accounting— verifies memory pool is fully released after join completes (memory_pool.reserved() == 0) andpeak_mem_used > 0Are these changes tested?
Yes. Four new tests added covering heavy spilling with same-key batches, string join keys, mixed keys with partial matching, and memory pool accounting verification.
Are there any user-facing changes?
No.