feat(reconciler): automatically retry failed ingestion logs#541
Open
MicahParks wants to merge 2 commits into
Open
feat(reconciler): automatically retry failed ingestion logs#541MicahParks wants to merge 2 commits into
MicahParks wants to merge 2 commits into
Conversation
Vulnerability Scan: Passed — diode-ingesterImage: No vulnerabilities found. Commit: d76065a |
Vulnerability Scan: Passed — diode-authImage:
Commit: d76065a |
Vulnerability Scan: Passed — diode-reconcilerImage: No vulnerabilities found. Commit: d76065a |
|
Go test coverage
Total coverage: 54.2% |
ae4770d to
fe0c634
Compare
When bulk-plan-apply failed, the ingestion log was left in FAILED and was terminal: nothing re-claimed it, and re-ingesting the same entity deduped against it, so the data was never applied and no error surfaced. Auto-apply tenants had no recovery path. With ENABLE_FAILED_RETRY on, a failed apply is parked in PENDING_RETRY with a jittered exponential backoff and re-claimed by the AutoApplyProcessor — in the same pipeline as fresh QUEUED work, FIFO by id so retries are processed in line rather than starved or contending for NetBox throughput — until it applies or exhausts its budget and is retired to terminal ERRORED. Off by default; when off, failures stay terminal FAILED exactly as before. - migration: add retry_count, next_retry_at + partial index on PENDING_RETRY - proto: add PENDING_RETRY state - ClaimQueuedForAutoApply claims QUEUED + due PENDING_RETRY in one FIFO batch - MarkIngestionLogRetry: increment, jittered exponential backoff, or retire to ERRORED - dedup excludes terminal ERRORED so a re-ingest after give-up re-queues - retry gated by RetryPolicy.Enabled, wired from ENABLE_FAILED_RETRY Pro inherits the new repository method via struct embedding, so it compiles on pin-bump with no changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fe0c634 to
01881cf
Compare
jajeffries
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
bulk-plan-applyfails during a reconciler run, the ingestion log is left in FAILED — which today is terminal:ClaimQueuedForAutoApplyclaims onlyQUEUED;ResetApplyingIngestionLogsonly recoversAPPLYING).FindPriorIngestionLogsByEntityHashesmatches onentity_hash+ branch with no state filter), so it's silently swallowed — data never applied, no error to the caller.Net effect for auto-apply tenants: a transient apply failure (NetBox redeploy, lock contention, timeout) permanently strands the data.
Fix — one pipeline, a new transient state
A failed apply (when retry is enabled) is parked in a new
PENDING_RETRYstate with a jittered exponential backoff, and the existingAutoApplyProcessorre-claims it once the backoff elapses — in the same pipeline as freshQUEUEDwork, so retries never spin up a second component contending with fresh ingest for NetBox throughput. It applies on a later attempt, or exhausts its budget and is retired to terminalERRORED.State semantics:
FAILED— terminal, exactly as todayPENDING_RETRY— re-claimed after backoffERRORED— terminalKey design points
ClaimQueuedForAutoApplynow selectsQUEUED OR (PENDING_RETRY AND backoff-elapsed)ordered byid— not fresh-first. A backoff-elapsed retry takes its place in line and is processed in turn, so retries are never starved when the queue never empties. Exponential backoff keeps the due retry set a trickle, so interleaving costs fresh throughput negligibly.next_retry_at = now + base·2^n, capped, ×random[0.5,1). The jitter de-synchronises herds — a mass failure backs many rows off to the same instant otherwise — and smooths the NetBox load when they retry.ERROREDso a manual re-ingest after the system gives up re-queues instead of deduping.PENDING_RETRY/FAILEDstill dedupe (the pipeline owns their recovery).ENABLE_FAILED_RETRY=false). When off, failures stay terminalFAILEDand noPENDING_RETRYrows exist, so the broadened claim is inert — behaviour is identical to today.Config
ENABLE_FAILED_RETRY=false,FAILED_RETRY_MAX_RETRIES=5,FAILED_RETRY_BASE_BACKOFF_SECONDS=30,FAILED_RETRY_MAX_BACKOFF_SECONDS=3600.Safety / blast radius
ingestion_logs.AUTO_APPLY_CHANGESETS=false) is untouched — those failures still surface as change sets for manual review.1→8and9→8claims are safe against the state-counts trigger (additions handled generically).ENABLE_FAILED_RETRY+ bumps the pin).Testing
go build/vet/golangci-lintclean; full./reconciler/...+./dbstore/...suites pass.MarkIngestionLogRetrywhen enabled and to terminalFAILEDwhen disabled; config→policy mapping incl. the enable gate.Ops/Repository, controllable fake plugin), run 5× per scenario:PENDING_RETRYrow is claimed alongside fresh despite 20 queued / batch 5 — confirms FIFO does not starve retries (5/5).Known limitation (rare)
If a single un-processable entity makes the plugin return a whole-batch 4xx/5xx (rather than the per-entity 207 it's designed to), that batch's rows are failed together and re-tried together. The plugin isolates essentially all data-level failures to 207, so this is a low-probability, bug-class event; bounded by backoff +
MAX_RETRIES. A size-1/bisection fallback on whole-batch retry errors is a sensible evidence-gated follow-up if the ERRORED metric ever shows it.🤖 Generated with Claude Code