[Spark] Add asynchronous Auto Compaction (PR 1: core async machinery)#7090
Open
mwc360 wants to merge 3 commits into
Open
[Spark] Add asynchronous Auto Compaction (PR 1: core async machinery)#7090mwc360 wants to merge 3 commits into
mwc360 wants to merge 3 commits into
Conversation
374e1a0 to
e63bd3f
Compare
Introduces an opt-in mode that runs Auto Compaction on a JVM-wide daemon
thread pool instead of inline on the writer thread, eliminating the
write-latency impact of AC for streaming and high-throughput workloads.
Default behavior is unchanged.
Components:
* AsyncAutoCompactService: singleton with a bounded ThreadPoolExecutor
(size = autoCompact.async.parallelism, queue = autoCompact.async.maxQueueSize)
and a per-table inflight counter. Submitting a CommittedTransaction is
fire-and-forget; the worker re-reads deltaLog.update() and re-runs
AutoCompact.compactIfNecessary against the fresh snapshot, so a peer
that already compacted causes the worker to skip cleanly.
- Sets/clears SparkSession.{active,default} on the worker thread so
downstream code paths that consult SparkSession.getActiveSession
(e.g. OptimizeExecutor, DeltaUDF) see the right session.
- Wraps execution in NonFatal try/catch and records a telemetry event;
writer threads are never observably affected by async failures.
- Backpressure: queue-full submissions are recorded as
delta.autoCompaction.async.dropped and silently declined.
* AutoCompactBase.run: branches on AsyncAutoCompactService.isEnabled.
Inline path is byte-identical to today when the flag is off.
* Per-table dedup: at most one running + one queued task per tableId.
Redundant submissions are recorded as
delta.autoCompaction.async.coalesced.
* New configs (DeltaSQLConf):
- autoCompact.async.enabled (default false; opt-in)
- autoCompact.async.parallelism (default 2)
- autoCompact.async.maxQueueSize (default 64)
- autoCompact.async.backpressure ("drop", reserved for future policies)
* Telemetry events: delta.autoCompaction.async.{submitted, coalesced,
dropped, completed, error}.
Tests: AutoCompactAsyncSuite (6 tests) covers default-off behaviour,
opt-in commit-after-write-returns, snapshot freshness on a fresh
snapshot, queue-full backpressure, and per-table coalescing under load.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Async Auto Compaction introduced a possible race against in-flight user
DML (MERGE/DELETE/UPDATE/OVERWRITE): an AC OPTIMIZE that wins the commit
race causes the user's DML to throw ConcurrentModificationException
after potentially minutes of Spark work. Even though Delta retries are
cheap at the metadata level, MERGE's retry harness only handles
CHECKPOINT_RDD_BLOCK_ID_NOT_FOUND, not CME, so users would see the
failure.
This change inverts the conflict: async AC yields to in-flight DML
rather than the other way around. AC is opportunistic background work;
user writes are the critical workload, so deferring AC by one cycle is
strictly preferable to surfacing an avoidable failure to a user query
that has already done expensive shuffles and DV writes.
Components:
* InflightDMLRegistry: JVM-wide singleton mapping tableId -> active DML
counter. Supports a single acquire-listener slot per table (single is
sufficient because the async-AC submission path's per-table dedup
guarantees <= 1 async AC task per table at a time).
* Three coordination points in AsyncAutoCompactService:
1. Dequeue-time check: cheap. If a DML is already active when the
worker dequeues, skip silently and let the DML's own post-commit
hook resubmit AC when it releases.
2. Mid-flight cancellation: worker tags its Spark jobs with a job
group and registers an acquire-listener. On DML acquire the
listener calls sparkContext.cancelJobGroup(jobGroupId,
interruptOnCancel=true), which surfaces as a SparkException on the
driver and is caught.
3. Pre-commit gate: a check inside OptimizeExecutor.commitAndRetry
(guarded by a ThreadLocal `asyncWorkerThread` flag so manual
OPTIMIZE is unaffected) throws AsyncAutoCompactCancelledException
before the first commit attempt if DML went active.
A re-check after listener install catches the narrow race where DML
acquires between the dequeue check and the listener installation.
* DML hook points: MergeIntoCommandBase.run, DeleteCommand.run,
UpdateCommand.run, and WriteIntoDelta.run (overwrite only -- APPEND
skips because appends never conflict with AC) wrap their bodies in
InflightDMLRegistry.acquire/release try/finally.
* Telemetry + observability:
- WARN logs on every yield path with MDC(METADATA_ID).
- Three telemetry events:
delta.autoCompaction.async.yieldedToDML.{atDequeue, midFlight,
preCommit}.
- Test-visible per-table per-kind yield counters.
Tests: AutoCompactDMLYieldSuite covers registry primitives (reference
counting, listener safety), end-to-end dequeue yield, end-to-end
pre-commit yield, manual OPTIMIZE bypassing the yield gate, MERGE
acquire/release lifecycle, and composition with the per-table async-AC
dedup.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…L yield Streaming MERGE/UPDATE/DELETE workloads commit small batches in tight loops. With async Auto Compaction enabled, every commit submits an AC task to the async pool; before that task can finish, the next DML batch arrives and the in-flight AC yields to it (at dequeue, mid-flight, or at pre-commit). The result is wasted Spark cycles: AC is submitted, scheduled, partially executed, and then aborted -- without making any progress on the table. This change adds a sticky per-table fallback: when an async AC task yields to DML for any reason, the table is marked as "inline-fallback" in a JVM-wide ConcurrentHashMap. Future writes to that table run AC inline on the writer's own commit path, so the work completes (no future writer exists yet to cancel it) and no Spark cycles are wasted on the async-submit-then-yield loop. Design choices: - Sticky forever (until JVM restart). On a streaming pipeline where the first yield happened, the next yield is overwhelmingly likely; recovery heuristics add complexity for no real benefit. - One yield is enough to engage the fallback. All three yield kinds (atDequeue, midFlight, preCommit) count. - Opt-out via the new conf spark.databricks.delta.autoCompact.async.fallbackToInline.enabled (default true). When false, behavior is identical to the prior async path -- the table keeps re-submitting and re-yielding. - One-shot telemetry: delta.autoCompaction.async.fallbackEngaged fires exactly once per table via putIfAbsent, so log volume is bounded regardless of write rate. - Per-table state: yielding table A does not put table B in inline mode. - Manual OPTIMIZE is unaffected (it does not consult the async path). Tests (AutoCompactInlineFallbackSuite): - After a single dequeue yield, the next AC eval on the same table runs inline (no async.submitted event; AC commit lands on the writer's own commit version). - fallbackEngaged telemetry fires exactly once per table. - Per-table isolation. - With the conf disabled, the next AC still submits async and yields. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
acce1cf to
13e4be6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Delta project/connector is this regarding?
Description
Introduces an opt-in mode that runs Auto Compaction on a JVM-wide daemon thread pool instead of inline on the writer thread, eliminating the write-latency impact of AC for streaming and high-throughput workloads. Default behavior is unchanged.
Components:
AsyncAutoCompactService: singleton with a bounded ThreadPoolExecutor (size = autoCompact.async.parallelism, queue = autoCompact.async.maxQueueSize) and a per-table inflight counter. Submitting a CommittedTransaction is fire-and-forget; the worker re-reads deltaLog.update() and re-runs AutoCompact.compactIfNecessary against the fresh snapshot, so a peer that already compacted causes the worker to skip cleanly.
New session-only confs (no table property -- async is writer-mechanics, no need for table specifics):
Telemetry: delta.autoCompaction.async.{submitted,dropped,error,completed}.
Correctness: AutoCompactPartitionReserve still serializes overlapping AC runs because eligibility + reservation + optimize all execute on the same worker thread. OPTIMIZE conflict-retry (OptimizeExecutor.commitAndRetry) covers cross-JVM races as it does today. No new conflict class is introduced.
Shutdown drain and additional backpressure modes (block, fallback-inline) are intentionally reserved for follow-up PRs.
Resolves PR 1 of #7089
How was this patch tested?
Tests (AutoCompactAsyncSuite, 5 tests, all passing):
Perf benchmark:
Config: 50 sequential writes, 8 small files per write, sync
MIN_NUM_FILES=24, asyncMIN_NUM_FILES=0, async parallelism=2.Does this PR introduce any user-facing changes?
AutoCompactBase.run: branches on AsyncAutoCompactService.isEnabled. Inline path is byte-identical to today when the flag is off. Disabled by default, users can opt-in to it. It may make sense to make this the default mode in a future Delta major version.