backfill: dedup against unflushed spool rows (#107) by philcunliffe · Pull Request #114 · hyparam/hypaware

philcunliffe · 2026-06-16T17:21:33Z

Root cause

hyp backfill deduped only against committed Iceberg part_ids via scanExistingPartIds (committed-partition scan in dataset.js), and was blind to the spool. When rows are captured live but still sitting unflushed in the spool — the window is the ~2-minute flush debounce, so this is deterministic, not a tight race — backfill cannot see them. It materializes its own copy of those messages, then the spool flushes its copy later, leaving two rows with the same part_id.

Fix

Fold the part_ids of rows pending in the spool into the backfill materializer's pre-write seen-set.

src/core/cache/spool.js — add a read-only readSpooledRows(tablePath) that parses the pending active.jsonl + rotated flush-*.jsonl envelopes ({version, columns, rows}) and yields rows. It never rotates files or advances flush progress, so it is safe to call alongside live capture; any error degrades to skipping that file/line. discoverSpoolTables is now exported.
src/core/cache/storage.js — add readSpooledRows(dataset, columns) to the extended storage service: discovers the dataset's spool tables (filtering by datasetForTablePath), strips internal fields, and projects to the requested columns for parity with readRows.
hypaware-core/plugins-workspace/ai-gateway/src/dataset.js — add scanSpooledPartIds(storage, seen) and invoke it only from createBackfillDedupe, once per run alongside the committed scan. Best-effort: a storage stub without the spool surface, or any read error, leaves the seen-set untouched rather than dropping rows (a dedupe miss is recoverable via compaction; dropping rows is not).

Settle-path hazard avoided

scanExistingPartIds is also used by dedupeByPartId inside createSettleBatch, which runs during flush. The rows being settled at flush are the spool rows. If the spool scan were wired into the settle path, every row would match itself in the seen-set and be dropped — the flush would delete the very data it is committing. The spool scan is therefore strictly opt-in (backfill only), and both scanSpooledPartIds and dedupeByPartId carry comments stating this constraint.

LLP

llp/0027-cache-settlement.decision.md — the "backfill-vs-spool same-id duplicates" open question is marked resolved (scan spooled rows in the materializer). @ref LLP 0027#open-questions annotations added above the spool-scan constructs in dataset.js and storage.js.

Tests

Materializer (test/plugins/ai-gateway-backfill-materializer.test.js): spooled rows are not re-materialized; only spool-overlapping parts are skipped; legacy spooled rows match via message_id+part_index; committed ∪ spool unioned into one seen-set; unreadable spool degrades to committed-only dedup; a stub without readSpooledRows still dedupes against committed. Existing no-spool tests stay green.
Storage (test/core/cache-storage.test.js): readSpooledRows yields unflushed rows then is empty after flush; projects to requested columns and filters by dataset; empty stream for an unknown dataset.

npm test (1155 pass / 0 fail / 1 skipped), npm run lint, and npm run typecheck all pass.

Note: the backfill_claude_fixture hermetic smoke is red on master independently of this change (it produces zero backfilled rows on a pristine checkout) and is unaffected here — this dedup change can only ever skip rows that already exist.

Fixes #107

🤖 Generated with Claude Code

`hyp backfill` deduped only against committed Iceberg `part_id`s (`scanExistingPartIds`) and was blind to the spool. Rows captured live but still sitting unflushed in the spool (debounce window, captured not yet committed) were invisible, so backfill materialized its own copy and the spool later flushed its copy — two rows with the same `part_id`. Fix: fold spooled `part_id`s into the backfill materializer's pre-write seen-set. - spool.js: read-only `readSpooledRows(tablePath)` that parses the pending `active.jsonl` + `flush-*.jsonl` envelopes and yields rows; it never rotates files or advances flush progress, so it is safe beside live capture. `discoverSpoolTables` is now exported. - storage.js: `readSpooledRows(dataset, columns)` on the extended storage service — discovers a dataset's spool tables, strips internal fields, projects to the requested columns (parity with `readRows`). - dataset.js: `scanSpooledPartIds` folds spool `part_id`s into the seen-set, invoked ONLY from `createBackfillDedupe` (once per run, alongside the committed scan). Best-effort: a missing spool surface or read error leaves the seen-set untouched rather than dropping rows. Settle-path hazard avoided: `dedupeByPartId` (flush-time settlement) is deliberately left scanning committed partitions only. The rows it settles at flush ARE the spool rows; seeding its seen-set with spool `part_id`s would make every row match itself and be dropped — the flush would delete the data it is committing. Both functions carry comments stating this. LLP 0027: marks the "backfill-vs-spool same-id duplicates" open question resolved (scan spooled rows in the materializer). `@ref LLP 0027` added above the spool-scan constructs. Tests: materializer tests for spool-overlap dedupe (full skip, partial skip, legacy message_id+part_index recompose, committed∪spool union, unreadable-spool degrade-to-committed, missing-surface skip) and storage tests for `readSpooledRows` (pending-then-empty-after-flush, column projection + dataset filtering, unknown dataset). Existing no-spool tests stay green. Full `npm test`, lint, and typecheck pass. Fixes #107 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

philcunliffe · 2026-06-16T18:27:56Z

Dual-agent review — `request_changes`

Verdict: request_changes
Risk class: low
Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Codex	major/high: `readSpooledRows` envelope validation looser than `streamFlushFile` (spool.js:235, streaming-reader.js:106)	Yes — Targets (`readSpooledRows`), Risks (envelope-validation asymmetry)
Claude	No issues found	(n/a)

Codex review

Fix Validations

Backfill dedupes against unflushed spool rows

Status: correct
Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:449, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:554, src/core/cache/storage.js:300, test/plugins/ai-gateway-backfill-materializer.test.js:317
Assessment: Valid spooled rows now seed the backfill seen-set before materialized rows are returned, and tests cover full overlap, partial overlap, legacy message_id + part_index, read failure, and missing spool surface.

Settle-path self-drop hazard avoided

Status: correct
Evidence: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:322, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:325, hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:454
Assessment: Flush-time dedupeByPartId still scans only committed partitions; scanSpooledPartIds is only invoked from backfill dedupe.

Findings

1) Behavioral Correctness

Severity: major
Confidence: high
Evidence: src/core/cache/spool.js:235, src/core/cache/spool.js:236, src/core/cache/streaming-reader.js:106, src/core/cache/streaming-reader.js:112
Why it matters: readSpooledRows can dedupe backfill against a parseable-but-unflushable spool line because it accepts {version:1, rows:[...]} without validating columns, while the actual flush reader rejects that same envelope as malformed and drops it.
Suggested fix: Make readSpooledRows use the same envelope validity contract as streamFlushFile: require Array.isArray(envelope.columns) before yielding rows, and add a storage/materializer test with a malformed parseable spool line proving backfill does not skip rows that flush cannot commit.

No Finding

Contract & Interface Fidelity
Change Impact / Blast Radius
Concurrency, Ordering & State Safety
Error Handling & Resilience
Security Surface
Resource Lifecycle & Cleanup
Release Safety
Test Evidence Quality
Architectural Consistency
Debuggability & Operability

Evidence Bundle

Changed hot paths: createBackfillDedupe -> scanExistingPartIds + scanSpooledPartIds; createQueryStorageService.readSpooledRows; createCacheSpool.readSpooledRows; discoverSpoolTables.
Impacted callers: hypaware-core/plugins-workspace/ai-gateway/src/dataset.js:454, src/core/cache/storage.js:311, src/core/cache/spool.js:146.
Impacted tests: test/plugins/ai-gateway-backfill-materializer.test.js:317, test/plugins/ai-gateway-backfill-materializer.test.js:328, test/plugins/ai-gateway-backfill-materializer.test.js:339, test/plugins/ai-gateway-backfill-materializer.test.js:349, test/core/cache-storage.test.js:283, test/core/cache-storage.test.js:310.
Unresolved uncertainty: I did not run the test suite; I reviewed the supplied diff plus targeted caller/contract context.

Claude review

Claude review

No issues found.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr-114/dual-review/pr-114

…g flush streamFlushFile treats a parseable envelope without an Array `columns` field as malformed and drops it — its rows never reach a committed partition. The backfill spool reader validated only `version` and `rows`, so it would surface those rows and let backfill dedupe against them, refusing to materialize rows flush will never commit. Align readSpooledRows' envelope-validity contract with streamFlushFile (require Array.isArray(columns)) and add a regression test feeding a malformed parseable spool line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # llp/0027-cache-settlement.decision.md

philcunliffe and others added 2 commits June 16, 2026 12:00

Merge remote-tracking branch 'origin/master' into merge-114

0367d7b

# Conflicts: # llp/0027-cache-settlement.decision.md

philcunliffe merged commit e3de836 into master Jun 16, 2026
6 checks passed

philcunliffe deleted the fix/issue-107-backfill-dedup-spool branch June 16, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backfill: dedup against unflushed spool rows (#107)#114

backfill: dedup against unflushed spool rows (#107)#114
philcunliffe merged 3 commits into
masterfrom
fix/issue-107-backfill-dedup-spool

philcunliffe commented Jun 16, 2026

Uh oh!

philcunliffe commented Jun 16, 2026

Fix Validations

Backfill dedupes against unflushed spool rows

Settle-path self-drop hazard avoided

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Jun 16, 2026

Root cause

Fix

Settle-path hazard avoided

LLP

Tests

Uh oh!

philcunliffe commented Jun 16, 2026

Dual-agent review — request_changes

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Fix Validations

Backfill dedupes against unflushed spool rows

Settle-path self-drop hazard avoided

Findings

1) Behavioral Correctness

No Finding

Evidence Bundle

Claude review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dual-agent review — `request_changes`