feat: support Spark Structured Streaming writes#564
Draft
LuciferYang wants to merge 1 commit into
Draft
Conversation
Closes lance-format#246. Adds a Spark Structured Streaming sink for Lance. Each non-empty micro-batch produces a single Lance transaction stamped with (streamingQueryId, epochId) in its transaction properties; replay dedupe scans recent transaction history via DatasetDelta.listTransactions for an existing pair and skips the commit if it finds one. Empty epochs issue no transaction. Append, Complete, and Update output modes are routed through SparkWriteBuilder (now also implementing SupportsStreamingUpdateAsAppend). Complete maps to a Lance Overwrite per epoch; Update is append-only per Spark's marker contract. CTAS / staged-commit flows are rejected with an actionable error since per-epoch commits are incompatible with the single-shot staged commit cadence. User surface: - `streamingQueryId` (required) — globally unique idempotency key. Two queries sharing the same id would dedupe each other's epochs. - `lance.streaming.dedupe.lookback.versions` (default 100, max 10000) — how far back the replay scan looks before assuming the epoch is new. Raise it on high-churn tables; lower it to bound restart-time scans. Transaction-property keys `lance.streaming.queryId` and `lance.streaming.epochId` are stamped on every commit and are part of the stability contract — external tooling can read them straight from Lance transaction history. Notes / non-goals: - Requires lance-core with the DatasetDelta JNI binding rename (lance-format/lance#6963); pom bumps lance.version to 7.1.0-beta.4. - Streaming reads (MicroBatchStream) are not implemented yet — tracked separately. - Row-level UPDATE/DELETE via position-delta is not exposed on the streaming path. - The target Lance table must exist before the query starts; the sink does not auto-create. - The sink does NOT pin the dataset version on the writer — every commit opens at the current latest so the dedupe scan window and the transaction's readVersion both reflect on-disk reality. A multi-epoch regression test (testMultipleEpochsOnSameSinkAdvanceVersionMonotonically) uses maxFilesPerTrigger=1 to share one sink across three epochs and asserts versions advance monotonically. Test coverage: - BaseStreamingWriteTest: 5 cases covering append happy path, missing streamingQueryId failure, replay dedupe, multi-epoch on one sink, empty-epoch no-op. - SparkWriteTest: toStreaming returns LanceStreamingWrite when streamingQueryId is provided, throws IAE without it, rejects staged commits. User-facing doc at docs/src/streaming.md covers semantics, output modes, exactly-once contract, bounded at-least-once fallback, and OPTIMIZE cadence guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #246.
Summary
Adds a Structured Streaming sink for Lance. Each non-empty micro-batch issues one Lance transaction stamped with
lance.streaming.queryIdandlance.streaming.epochIdin its transaction properties; replay dedupe scans recent transaction history viaDatasetDelta.listTransactionsfor a matching pair. Empty epochs issue no transaction.SparkWriteBuildernow implementsSupportsStreamingUpdateAsAppendandLanceDatasetadvertisesTableCapability.STREAMING_WRITE. Append / Complete / Update flow through the existing V2 plumbing. CTAS and staged commits are rejected attoStreaming()with a clear error.streamingQueryIdis a required option — it's the idempotency key for the dedupe scan and must be globally unique across concurrent queries on the same table.lance.streaming.dedupe.lookback.versions(default 100, max 10 000) caps the scan depth.Why one transaction per epoch
PR #399 used a two-transaction Append + UpdateConfig design, which doubled manifest growth and per-epoch latency at scale. Stamping the identity inside the Append moves the dedupe signal into transaction history and keeps writes to one manifest update per epoch. The upstream JNI fix (#6963) is what makes the
DatasetDeltascan actually buildable from Java.Why the sink doesn't pin the dataset version
LanceBatchWritepinsdataset.version()in its constructor because it commits exactly once. Streaming reuses one sink across many epochs, so a pinned version would go stale immediately — the dedupe scan would point at the wrong range and the transaction'sreadVersionwould lag. Every commit opens at the current latest instead.testMultipleEpochsOnSameSinkAdvanceVersionMonotonicallyregresses this path.Tests
BaseStreamingWriteTestcovers append happy-path, missingstreamingQueryId, replay dedupe, multi-epoch on one sink, and empty-epoch no-op.SparkWriteTestadds three assertions ontoStreaming's contract: returnsLanceStreamingWritewhen configured, throwsIllegalArgumentExceptionwithoutstreamingQueryId, throwsUnsupportedOperationExceptionfor staged commits.Per-module clean runs against
lance-core 7.1.0-beta.4:make lintclean.Out of scope
MicroBatchStream) — follow-up.DatasetDeltascan.Docs at
docs/src/streaming.md.