[VL] Restore hash shuffle reader payload merging#12097
Conversation
After apache#10499, the hash shuffle reader changed from potentially merging multiple payloads into larger batches to returning one batch per payload. That kept shuffle-read output batches small and increased downstream overhead. Restore reader-side coalescing for mergeable plain hash shuffle payloads, but flush at Spark shuffle stream boundaries so payloads from different input streams are never combined. Keep dictionary and complex-type payloads unmerged, reset dictionary state per stream, and carry over a payload that would exceed the configured batch size. Add stream-local merge tests covering multi-column primitive/bool/string/nullable data, per-stream merge boundaries, carry-over, dictionary and complex-type paths, and invalid batch sizes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
The payload merging in shuffle reader was removed due to the functionality is duplicated with
Another benefit of using |
|
@marin-ma I re-checked the current code paths. The issue we are trying to address here is not really a shuffle regression by itself. After #10499, the hash shuffle reader can emit one I also re-checked the side-effect risk in this PR. The merge is local to the hash shuffle reader, capped by
|
|
@zhli1142015 I still think the "fast path" is a duplication of the If the goal is to have the shuffle reader produces larger output, can we simply follow the native implementation in |
Gate the reader-side raw payload merge fast path behind a Velox config and document how it complements VeloxResizeBatchesExec. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
We are using reader-side raw payload merge mainly because it has lower cost than VeloxResizeBatchesExec: it merges plain hash shuffle payload buffers before Velox vectors are materialized, so it avoids the generic RowVector append/resizing overhead for this case. I think it is better to treat this as a fast path rather than a replacement for VeloxResizeBatchesExec. For completeness, users can still enable VeloxResizeBatchesExec separately to cover the generic cases that this raw-payload fast path intentionally does not handle, such as complex types or dictionary-encoded payloads. |
What changes are proposed in this pull request?
After #10499, the hash shuffle reader changed from potentially merging multiple payloads into larger batches to returning one batch per payload. That kept shuffle-read output batches small and increased downstream overhead.
Restore reader-side coalescing for mergeable plain hash shuffle payloads, but flush at Spark shuffle stream boundaries so payloads from different input streams are never combined. Keep dictionary and complex-type payloads unmerged, reset dictionary state per stream, and carry over a payload that would exceed the configured batch size.
Add stream-local merge tests covering multi-column primitive/bool/string/nullable data, per-stream merge boundaries, carry-over, dictionary and complex-type paths, and invalid batch sizes.

Now behavior:
Expected behavior:

How was this patch tested?
UT
Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot