Skip to content

backfill: dedup against unflushed spool rows (#107)#114

Merged
philcunliffe merged 3 commits into
masterfrom
fix/issue-107-backfill-dedup-spool
Jun 16, 2026
Merged

backfill: dedup against unflushed spool rows (#107)#114
philcunliffe merged 3 commits into
masterfrom
fix/issue-107-backfill-dedup-spool

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Root cause

hyp backfill deduped only against committed Iceberg part_ids via scanExistingPartIds (committed-partition scan in dataset.js), and was blind to the spool. When rows are captured live but still sitting unflushed in the spool — the window is the ~2-minute flush debounce, so this is deterministic, not a tight race — backfill cannot see them. It materializes its own copy of those messages, then the spool flushes its copy later, leaving two rows with the same part_id.

Fix

Fold the part_ids of rows pending in the spool into the backfill materializer's pre-write seen-set.

  • src/core/cache/spool.js — add a read-only readSpooledRows(tablePath) that parses the pending active.jsonl + rotated flush-*.jsonl envelopes ({version, columns, rows}) and yields rows. It never rotates files or advances flush progress, so it is safe to call alongside live capture; any error degrades to skipping that file/line. discoverSpoolTables is now exported.
  • src/core/cache/storage.js — add readSpooledRows(dataset, columns) to the extended storage service: discovers the dataset's spool tables (filtering by datasetForTablePath), strips internal fields, and projects to the requested columns for parity with readRows.
  • hypaware-core/plugins-workspace/ai-gateway/src/dataset.js — add scanSpooledPartIds(storage, seen) and invoke it only from createBackfillDedupe, once per run alongside the committed scan. Best-effort: a storage stub without the spool surface, or any read error, leaves the seen-set untouched rather than dropping rows (a dedupe miss is recoverable via compaction; dropping rows is not).

Settle-path hazard avoided

scanExistingPartIds is also used by dedupeByPartId inside createSettleBatch, which runs during flush. The rows being settled at flush are the spool rows. If the spool scan were wired into the settle path, every row would match itself in the seen-set and be dropped — the flush would delete the very data it is committing. The spool scan is therefore strictly opt-in (backfill only), and both scanSpooledPartIds and dedupeByPartId carry comments stating this constraint.

LLP

llp/0027-cache-settlement.decision.md — the "backfill-vs-spool same-id duplicates" open question is marked resolved (scan spooled rows in the materializer). @ref LLP 0027#open-questions annotations added above the spool-scan constructs in dataset.js and storage.js.

Tests

  • Materializer (test/plugins/ai-gateway-backfill-materializer.test.js): spooled rows are not re-materialized; only spool-overlapping parts are skipped; legacy spooled rows match via message_id+part_index; committed ∪ spool unioned into one seen-set; unreadable spool degrades to committed-only dedup; a stub without readSpooledRows still dedupes against committed. Existing no-spool tests stay green.
  • Storage (test/core/cache-storage.test.js): readSpooledRows yields unflushed rows then is empty after flush; projects to requested columns and filters by dataset; empty stream for an unknown dataset.

npm test (1155 pass / 0 fail / 1 skipped), npm run lint, and npm run typecheck all pass.

Note: the backfill_claude_fixture hermetic smoke is red on master independently of this change (it produces zero backfilled rows on a pristine checkout) and is unaffected here — this dedup change can only ever skip rows that already exist.

Fixes #107

🤖 Generated with Claude Code

`hyp backfill` deduped only against committed Iceberg `part_id`s
(`scanExistingPartIds`) and was blind to the spool. Rows captured live
but still sitting unflushed in the spool (debounce window, captured not
yet committed) were invisible, so backfill materialized its own copy and
the spool later flushed its copy — two rows with the same `part_id`.

Fix: fold spooled `part_id`s into the backfill materializer's pre-write
seen-set.

- spool.js: read-only `readSpooledRows(tablePath)` that parses the
  pending `active.jsonl` + `flush-*.jsonl` envelopes and yields rows; it
  never rotates files or advances flush progress, so it is safe beside
  live capture. `discoverSpoolTables` is now exported.
- storage.js: `readSpooledRows(dataset, columns)` on the extended storage
  service — discovers a dataset's spool tables, strips internal fields,
  projects to the requested columns (parity with `readRows`).
- dataset.js: `scanSpooledPartIds` folds spool `part_id`s into the
  seen-set, invoked ONLY from `createBackfillDedupe` (once per run,
  alongside the committed scan). Best-effort: a missing spool surface or
  read error leaves the seen-set untouched rather than dropping rows.

Settle-path hazard avoided: `dedupeByPartId` (flush-time settlement) is
deliberately left scanning committed partitions only. The rows it settles
at flush ARE the spool rows; seeding its seen-set with spool `part_id`s
would make every row match itself and be dropped — the flush would delete
the data it is committing. Both functions carry comments stating this.

LLP 0027: marks the "backfill-vs-spool same-id duplicates" open question
resolved (scan spooled rows in the materializer). `@ref LLP 0027` added
above the spool-scan constructs.

Tests: materializer tests for spool-overlap dedupe (full skip, partial
skip, legacy message_id+part_index recompose, committed∪spool union,
unreadable-spool degrade-to-committed, missing-surface skip) and storage
tests for `readSpooledRows` (pending-then-empty-after-flush, column
projection + dataset filtering, unknown dataset). Existing no-spool tests
stay green. Full `npm test`, lint, and typecheck pass.

Fixes #107

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: low
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Codex major/high: readSpooledRows envelope validation looser than streamFlushFile (spool.js:235, streaming-reader.js:106) Yes — Targets (readSpooledRows), Risks (envelope-validation asymmetry)
Claude No issues found (n/a)
Codex review

Fix Validations

Backfill dedupes against unflushed spool rows

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:449, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:554, src/core/cache/storage.js:300, test/plugins/ai-gateway-backfill-materializer.test.js:317
  • Assessment: Valid spooled rows now seed the backfill seen-set before materialized rows are returned, and tests cover full overlap, partial overlap, legacy message_id + part_index, read failure, and missing spool surface.

Settle-path self-drop hazard avoided

  • Status: correct
  • Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:322, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:325, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:454
  • Assessment: Flush-time dedupeByPartId still scans only committed partitions; scanSpooledPartIds is only invoked from backfill dedupe.

Findings

1) Behavioral Correctness

  • Severity: major
  • Confidence: high
  • Evidence: src/core/cache/spool.js:235, src/core/cache/spool.js:236, src/core/cache/streaming-reader.js:106, src/core/cache/streaming-reader.js:112
  • Why it matters: readSpooledRows can dedupe backfill against a parseable-but-unflushable spool line because it accepts {version:1, rows:[...]} without validating columns, while the actual flush reader rejects that same envelope as malformed and drops it.
  • Suggested fix: Make readSpooledRows use the same envelope validity contract as streamFlushFile: require Array.isArray(envelope.columns) before yielding rows, and add a storage/materializer test with a malformed parseable spool line proving backfill does not skip rows that flush cannot commit.

No Finding

  1. Contract & Interface Fidelity
  2. Change Impact / Blast Radius
  3. Concurrency, Ordering & State Safety
  4. Error Handling & Resilience
  5. Security Surface
  6. Resource Lifecycle & Cleanup
  7. Release Safety
  8. Test Evidence Quality
  9. Architectural Consistency
  10. Debuggability & Operability

Evidence Bundle

  • Changed hot paths: createBackfillDedupe -> scanExistingPartIds + scanSpooledPartIds; createQueryStorageService.readSpooledRows; createCacheSpool.readSpooledRows; discoverSpoolTables.
  • Impacted callers: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:454, src/core/cache/storage.js:311, src/core/cache/spool.js:146.
  • Impacted tests: test/plugins/ai-gateway-backfill-materializer.test.js:317, test/plugins/ai-gateway-backfill-materializer.test.js:328, test/plugins/ai-gateway-backfill-materializer.test.js:339, test/plugins/ai-gateway-backfill-materializer.test.js:349, test/core/cache-storage.test.js:283, test/core/cache-storage.test.js:310.
  • Unresolved uncertainty: I did not run the test suite; I reviewed the supplied diff plus targeted caller/contract context.
Claude review

Claude review

No issues found.


Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-114/dual-review/pr-114

philcunliffe and others added 2 commits June 16, 2026 12:00
…g flush

streamFlushFile treats a parseable envelope without an Array `columns`
field as malformed and drops it — its rows never reach a committed
partition. The backfill spool reader validated only `version` and `rows`,
so it would surface those rows and let backfill dedupe against them,
refusing to materialize rows flush will never commit.

Align readSpooledRows' envelope-validity contract with streamFlushFile
(require Array.isArray(columns)) and add a regression test feeding a
malformed parseable spool line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	llp/0027-cache-settlement.decision.md
@philcunliffe philcunliffe merged commit e3de836 into master Jun 16, 2026
6 checks passed
@philcunliffe philcunliffe deleted the fix/issue-107-backfill-dedup-spool branch June 16, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backfill doesn't dedup against the spool

1 participant