Skip to content

feat(reconciler): automatically retry failed ingestion logs#541

Open
MicahParks wants to merge 2 commits into
developfrom
fix/obs-3116-failed-retrier
Open

feat(reconciler): automatically retry failed ingestion logs#541
MicahParks wants to merge 2 commits into
developfrom
fix/obs-3116-failed-retrier

Conversation

@MicahParks

@MicahParks MicahParks commented May 28, 2026

Copy link
Copy Markdown
Contributor

Problem

When bulk-plan-apply fails during a reconciler run, the ingestion log is left in FAILED — which today is terminal:

  • Nothing re-claims FAILED rows (ClaimQueuedForAutoApply claims only QUEUED; ResetApplyingIngestionLogs only recovers APPLYING).
  • Re-ingesting the same entity dedupes against the FAILED row (FindPriorIngestionLogsByEntityHashes matches on entity_hash + branch with no state filter), so it's silently swallowed — data never applied, no error to the caller.
  • The reconciler API is read-only, so there's no manual reprocess path either.

Net effect for auto-apply tenants: a transient apply failure (NetBox redeploy, lock contention, timeout) permanently strands the data.

Fix — one pipeline, a new transient state

A failed apply (when retry is enabled) is parked in a new PENDING_RETRY state with a jittered exponential backoff, and the existing AutoApplyProcessor re-claims it once the backoff elapses — in the same pipeline as fresh QUEUED work, so retries never spin up a second component contending with fresh ingest for NetBox throughput. It applies on a later attempt, or exhausts its budget and is retired to terminal ERRORED.

State semantics:

Situation State
fail, retry off (default) FAILED — terminal, exactly as today
fail, retry on, budget remains PENDING_RETRY — re-claimed after backoff
fail, retry on, budget exhausted ERRORED — terminal

Key design points

  • One claim, FIFO by id. ClaimQueuedForAutoApply now selects QUEUED OR (PENDING_RETRY AND backoff-elapsed) ordered by idnot fresh-first. A backoff-elapsed retry takes its place in line and is processed in turn, so retries are never starved when the queue never empties. Exponential backoff keeps the due retry set a trickle, so interleaving costs fresh throughput negligibly.
  • Jittered backoff. next_retry_at = now + base·2^n, capped, ×random[0.5,1). The jitter de-synchronises herds — a mass failure backs many rows off to the same instant otherwise — and smooths the NetBox load when they retry.
  • Dedup excludes terminal ERRORED so a manual re-ingest after the system gives up re-queues instead of deduping. PENDING_RETRY/FAILED still dedupe (the pipeline owns their recovery).
  • Off by default (ENABLE_FAILED_RETRY=false). When off, failures stay terminal FAILED and no PENDING_RETRY rows exist, so the broadened claim is inert — behaviour is identical to today.

Config

ENABLE_FAILED_RETRY=false, FAILED_RETRY_MAX_RETRIES=5, FAILED_RETRY_BASE_BACKOFF_SECONDS=30, FAILED_RETRY_MAX_BACKOFF_SECONDS=3600.

Safety / blast radius

  • Off by default; with it off, the only observable change is two inert columns on ingestion_logs.
  • Plan-only mode (AUTO_APPLY_CHANGESETS=false) is untouched — those failures still surface as change sets for manual review.
  • Concurrent 1→8 and 9→8 claims are safe against the state-counts trigger (additions handled generically).
  • diode-pro inherits the new repository method via struct embedding, so it compiles on pin-bump with no changes (retry stays off there until it wires ENABLE_FAILED_RETRY + bumps the pin).

Testing

  • go build/vet/golangci-lint clean; full ./reconciler/... + ./dbstore/... suites pass.
  • Unit tests: failure routes to MarkIngestionLogRetry when enabled and to terminal FAILED when disabled; config→policy mapping incl. the enable gate.
  • Local integration harness against a real Postgres (each branch's real migrations + the real Ops/Repository, controllable fake plugin), run 5× per scenario:
    • reproduce: develop 5/5 stranded in FAILED; this branch 5/5 recovers to APPLIED.
    • never-empty queue: a due PENDING_RETRY row is claimed alongside fresh despite 20 queued / batch 5 — confirms FIFO does not starve retries (5/5).

Known limitation (rare)

If a single un-processable entity makes the plugin return a whole-batch 4xx/5xx (rather than the per-entity 207 it's designed to), that batch's rows are failed together and re-tried together. The plugin isolates essentially all data-level failures to 207, so this is a low-probability, bug-class event; bounded by backoff + MAX_RETRIES. A size-1/bisection fallback on whole-batch retry errors is a sensible evidence-gated follow-up if the ERRORED metric ever shows it.

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Vulnerability Scan: Passed — diode-ingester

Image: diode-ingester:scan

No vulnerabilities found.

Commit: d76065a

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Vulnerability Scan: Passed — diode-auth

Image: diode-auth:scan

Source Library CVE Severity Installed Fixed Title
usr/bin/hydra github.com/docker/docker CVE-2026-34040 🟠 HIGH v28.3.3+incompatible 29.3.1 Moby: Moby: Authorization bypass vulnerability
usr/bin/hydra github.com/docker/docker CVE-2026-33997 🟡 MEDIUM v28.3.3+incompatible 29.3.1 moby: docker: github.com/moby/moby: Moby: Privilege validation bypass during plu
usr/bin/hydra github.com/go-jose/go-jose/v3 CVE-2026-34986 🟠 HIGH v3.0.4 3.0.5 github.com/go-jose/go-jose/v3: github.com/go-jose/go-jose/v4: Go JOSE: Denial of
usr/bin/hydra github.com/jackc/pgx/v5 CVE-2026-33816 🔴 CRITICAL v5.7.5 5.9.0 github.com/jackc/pgx/v5: github.com/jackc/pgx: Memory-safety vulnerability
usr/bin/hydra github.com/jackc/pgx/v5 CVE-2026-41889 ⚪ LOW v5.7.5 5.9.2 github.com/jackc/pgx: golang: pgx: SQL injection via specific SQL query conditio
usr/bin/hydra go.opentelemetry.io/otel CVE-2026-29181 🟠 HIGH v1.40.0 1.41.0 CVE-2026-29181 affecting package azurelinux-image-tools for versions less than 1
usr/bin/hydra go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp CVE-2026-39882 🟡 MEDIUM v1.37.0 1.43.0 OpenTelemetry-Go is the Go implementation of OpenTelemetry. Prior to 1 ...
usr/bin/hydra go.opentelemetry.io/otel/sdk CVE-2026-39883 🟠 HIGH v1.40.0 1.43.0 opentelemetry-go: BSD kenv command not using absolute path enables PATH hijackin
usr/bin/hydra stdlib CVE-2026-25679 🟠 HIGH v1.26.0 1.25.8, 1.26.1 net/url: Incorrect parsing of IPv6 host literals in net/url
usr/bin/hydra stdlib CVE-2026-27137 🟠 HIGH v1.26.0 1.26.1 crypto/x509: Incorrect enforcement of email constraints in crypto/x509
usr/bin/hydra stdlib CVE-2026-32280 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/x509: crypto/tls: golang: Go: Denial of Service vulnerability in certific
usr/bin/hydra stdlib CVE-2026-32281 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/x509: golang: Go crypto/x509: Denial of Service via inefficient certifica
usr/bin/hydra stdlib CVE-2026-32283 🟠 HIGH v1.26.0 1.25.9, 1.26.2 crypto/tls: golang: Go crypto/tls: Denial of Service via multiple TLS 1.3 key up
usr/bin/hydra stdlib CVE-2026-33810 🟠 HIGH v1.26.0 1.26.2 crypto/x509: golang: Go crypto/x509: Certificate validation bypass due to incorr
usr/bin/hydra stdlib CVE-2026-33811 🟠 HIGH v1.26.0 1.25.10, 1.26.3 When using LookupCNAME with the cgo DNS resolver, a very long CNAME re ...
usr/bin/hydra stdlib CVE-2026-33814 🟠 HIGH v1.26.0 1.25.10, 1.26.3 When processing HTTP/2 SETTINGS frames, transport will enter an infini ...
usr/bin/hydra stdlib CVE-2026-39820 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Well-crafted inputs reaching ParseAddress, ParseAddressList, and Parse ...
usr/bin/hydra stdlib CVE-2026-39823 🟠 HIGH v1.26.0 1.25.10, 1.26.3 CVE-2026-27142 fixed a vulnerability in which URLs were not correctly ...
usr/bin/hydra stdlib CVE-2026-39825 🟠 HIGH v1.26.0 1.25.10, 1.26.3 ReverseProxy can forward queries containing parameters not visible to ...
usr/bin/hydra stdlib CVE-2026-39826 🟠 HIGH v1.26.0 1.25.10, 1.26.3 If a trusted template author were to write a <script> tag containing a ...
usr/bin/hydra stdlib CVE-2026-39836 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Panic in Dial and LookupPort when handling NUL byte on Windows in net
usr/bin/hydra stdlib CVE-2026-42499 🟠 HIGH v1.26.0 1.25.10, 1.26.3 Pathological inputs could cause DoS through consumePhrase when parsing ...
usr/bin/hydra stdlib CVE-2026-27142 🟡 MEDIUM v1.26.0 1.25.8, 1.26.1 html/template: URLs in meta content attribute actions are not escaped in html/te
usr/bin/hydra stdlib CVE-2026-32282 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 golang: internal/syscall/unix: Root.Chmod can follow symlinks out of the root
usr/bin/hydra stdlib CVE-2026-32288 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 archive/tar: golang: Go's archive/tar package: Denial of Service via maliciously
usr/bin/hydra stdlib CVE-2026-32289 🟡 MEDIUM v1.26.0 1.25.9, 1.26.2 html/template: golang: html/template: Cross-Site Scripting (XSS) via improper co
usr/bin/hydra stdlib CVE-2026-27138 ⚪ LOW v1.26.0 1.26.1 crypto/x509: Panic in name constraint checking for malformed certificates in cry
usr/bin/hydra stdlib CVE-2026-27139 ⚪ LOW v1.26.0 1.25.8, 1.26.1 os: FileInfo can escape from a Root in golang os module

Commit: d76065a

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Vulnerability Scan: Passed — diode-reconciler

Image: diode-reconciler:scan

No vulnerabilities found.

Commit: d76065a

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Go test coverage

STATUS ELAPSED PACKAGE COVER PASS FAIL SKIP
🟢 PASS 1.56s github.com/netboxlabs/diode/diode-server/auth 44.7% 42 0 0
🟢 PASS 1.05s github.com/netboxlabs/diode/diode-server/auth/cli 0.0% 0 0 0
🟢 PASS 1.03s github.com/netboxlabs/diode/diode-server/authutil 82.8% 5 0 0
🟢 PASS 0.17s github.com/netboxlabs/diode/diode-server/dbstore/postgres 0.0% 0 0 0
🟢 PASS 1.10s github.com/netboxlabs/diode/diode-server/entityhash 79.2% 13 0 0
🟢 PASS 1.10s github.com/netboxlabs/diode/diode-server/entitymatcher 82.8% 97 0 0
🟢 PASS 0.09s github.com/netboxlabs/diode/diode-server/errors 0.0% 0 0 0
🟢 PASS 1.16s github.com/netboxlabs/diode/diode-server/graph 52.0% 81 0 0
🟢 PASS 1.03s github.com/netboxlabs/diode/diode-server/grpckeepalive 100.0% 1 0 0
🟢 PASS 1.42s github.com/netboxlabs/diode/diode-server/ingester 85.4% 66 0 0
🟢 PASS 1.11s github.com/netboxlabs/diode/diode-server/matching 94.1% 66 0 0
🟢 PASS 1.06s github.com/netboxlabs/diode/diode-server/migrator 70.4% 4 0 0
🟢 PASS 3.13s github.com/netboxlabs/diode/diode-server/netboxdiodeplugin 45.4% 23 0 0
🟢 PASS 0.16s github.com/netboxlabs/diode/diode-server/pprof 0.0% 0 0 0
🟢 PASS 5.09s github.com/netboxlabs/diode/diode-server/reconciler 70.2% 96 0 0
🟢 PASS 0.11s github.com/netboxlabs/diode/diode-server/reconciler/changeset 0.0% 0 0 0
🟢 PASS 1.06s github.com/netboxlabs/diode/diode-server/reconciler/differ 49.3% 23 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/server 85.7% 14 0 0
🟢 PASS 1.01s github.com/netboxlabs/diode/diode-server/strcase 100.0% 24 0 0
🟢 PASS 1.02s github.com/netboxlabs/diode/diode-server/telemetry 28.0% 26 0 0
🟢 PASS 1.01s github.com/netboxlabs/diode/diode-server/telemetry/otel 90.2% 25 0 0
🟢 PASS 0.09s github.com/netboxlabs/diode/diode-server/tls 0.0% 0 0 0
🟢 PASS 1.01s github.com/netboxlabs/diode/diode-server/version 100.0% 2 0 0

Total coverage: 54.2%

@MicahParks MicahParks force-pushed the fix/obs-3116-failed-retrier branch from ae4770d to fe0c634 Compare May 28, 2026 22:19
@github-actions github-actions Bot added documentation Improvements or additions to documentation markdown diode-proto labels May 28, 2026
@MicahParks MicahParks changed the title feat(reconciler): retry FAILED ingestion logs instead of stranding them (OBS-3116) feat(reconciler): automatically retry failed ingestion logs May 28, 2026
When bulk-plan-apply failed, the ingestion log was left in FAILED and was
terminal: nothing re-claimed it, and re-ingesting the same entity deduped
against it, so the data was never applied and no error surfaced. Auto-apply
tenants had no recovery path.

With ENABLE_FAILED_RETRY on, a failed apply is parked in PENDING_RETRY with a
jittered exponential backoff and re-claimed by the AutoApplyProcessor — in the
same pipeline as fresh QUEUED work, FIFO by id so retries are processed in line
rather than starved or contending for NetBox throughput — until it applies or
exhausts its budget and is retired to terminal ERRORED. Off by default; when
off, failures stay terminal FAILED exactly as before.

- migration: add retry_count, next_retry_at + partial index on PENDING_RETRY
- proto: add PENDING_RETRY state
- ClaimQueuedForAutoApply claims QUEUED + due PENDING_RETRY in one FIFO batch
- MarkIngestionLogRetry: increment, jittered exponential backoff, or retire to ERRORED
- dedup excludes terminal ERRORED so a re-ingest after give-up re-queues
- retry gated by RetryPolicy.Enabled, wired from ENABLE_FAILED_RETRY

Pro inherits the new repository method via struct embedding, so it compiles on
pin-bump with no changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MicahParks MicahParks force-pushed the fix/obs-3116-failed-retrier branch from fe0c634 to 01881cf Compare May 28, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants