feat(reconciler): automatically retry failed ingestion logs by MicahParks · Pull Request #541 · netboxlabs/diode

MicahParks · 2026-05-28T18:49:35Z

Problem

When bulk-plan-apply fails during a reconciler run, the ingestion log is left in FAILED — which today is terminal:

Nothing re-claims FAILED rows (ClaimQueuedForAutoApply claims only QUEUED; ResetApplyingIngestionLogs only recovers APPLYING).
Re-ingesting the same entity dedupes against the FAILED row (FindPriorIngestionLogsByEntityHashes matches on entity_hash + branch with no state filter), so it's silently swallowed — data never applied, no error to the caller.
The reconciler API is read-only, so there's no manual reprocess path either.

Net effect for auto-apply tenants: a transient apply failure (NetBox redeploy, lock contention, timeout) permanently strands the data.

Fix — one pipeline, a new transient state

A failed apply (when retry is enabled) is parked in a new PENDING_RETRY state with a jittered exponential backoff, and the existing AutoApplyProcessor re-claims it once the backoff elapses — in the same pipeline as fresh QUEUED work, so retries never spin up a second component contending with fresh ingest for NetBox throughput. It applies on a later attempt, or exhausts its budget and is retired to terminal ERRORED.

State semantics:

Situation	State
fail, retry off (default)	`FAILED` — terminal, exactly as today
fail, retry on, budget remains	`PENDING_RETRY` — re-claimed after backoff
fail, retry on, budget exhausted	`ERRORED` — terminal

Key design points

One claim, FIFO by id. ClaimQueuedForAutoApply now selects QUEUED OR (PENDING_RETRY AND backoff-elapsed) ordered by id — not fresh-first. A backoff-elapsed retry takes its place in line and is processed in turn, so retries are never starved when the queue never empties. Exponential backoff keeps the due retry set a trickle, so interleaving costs fresh throughput negligibly.
Jittered backoff. next_retry_at = now + base·2^n, capped, ×random[0.5,1). The jitter de-synchronises herds — a mass failure backs many rows off to the same instant otherwise — and smooths the NetBox load when they retry.
Dedup excludes terminal ERRORED so a manual re-ingest after the system gives up re-queues instead of deduping. PENDING_RETRY/FAILED still dedupe (the pipeline owns their recovery).
Off by default (ENABLE_FAILED_RETRY=false). When off, failures stay terminal FAILED and no PENDING_RETRY rows exist, so the broadened claim is inert — behaviour is identical to today.

Config

ENABLE_FAILED_RETRY=false, FAILED_RETRY_MAX_RETRIES=5, FAILED_RETRY_BASE_BACKOFF_SECONDS=30, FAILED_RETRY_MAX_BACKOFF_SECONDS=3600.

Safety / blast radius

Off by default; with it off, the only observable change is two inert columns on ingestion_logs.
Plan-only mode (AUTO_APPLY_CHANGESETS=false) is untouched — those failures still surface as change sets for manual review.
Concurrent 1→8 and 9→8 claims are safe against the state-counts trigger (additions handled generically).
diode-pro inherits the new repository method via struct embedding, so it compiles on pin-bump with no changes (retry stays off there until it wires ENABLE_FAILED_RETRY + bumps the pin).

Testing

go build/vet/golangci-lint clean; full ./reconciler/... + ./dbstore/... suites pass.
Unit tests: failure routes to MarkIngestionLogRetry when enabled and to terminal FAILED when disabled; config→policy mapping incl. the enable gate.
Local integration harness against a real Postgres (each branch's real migrations + the real Ops/Repository, controllable fake plugin), run 5× per scenario:
- reproduce: develop 5/5 stranded in FAILED; this branch 5/5 recovers to APPLIED.
- never-empty queue: a due PENDING_RETRY row is claimed alongside fresh despite 20 queued / batch 5 — confirms FIFO does not starve retries (5/5).

Known limitation (rare)

If a single un-processable entity makes the plugin return a whole-batch 4xx/5xx (rather than the per-entity 207 it's designed to), that batch's rows are failed together and re-tried together. The plugin isolates essentially all data-level failures to 207, so this is a low-probability, bug-class event; bounded by backoff + MAX_RETRIES. A size-1/bisection fallback on whole-batch retry errors is a sensible evidence-gated follow-up if the ERRORED metric ever shows it.

🤖 Generated with Claude Code

github-actions · 2026-05-28T18:51:06Z

Vulnerability Scan: Passed — diode-ingester

Image: diode-ingester:scan

No vulnerabilities found.

Commit: d76065a

github-actions · 2026-05-28T18:51:10Z

Vulnerability Scan: Passed — diode-auth

Image: diode-auth:scan

Source	Library	CVE	Severity	Installed	Fixed	Title
usr/bin/hydra	github.com/docker/docker	CVE-2026-34040	🟠 HIGH	v28.3.3+incompatible	29.3.1	Moby: Moby: Authorization bypass vulnerability
usr/bin/hydra	github.com/docker/docker	CVE-2026-33997	🟡 MEDIUM	v28.3.3+incompatible	29.3.1	moby: docker: github.com/moby/moby: Moby: Privilege validation bypass during plu
usr/bin/hydra	github.com/go-jose/go-jose/v3	CVE-2026-34986	🟠 HIGH	v3.0.4	3.0.5	github.com/go-jose/go-jose/v3: github.com/go-jose/go-jose/v4: Go JOSE: Denial of
usr/bin/hydra	github.com/jackc/pgx/v5	CVE-2026-33816	🔴 CRITICAL	v5.7.5	5.9.0	github.com/jackc/pgx/v5: github.com/jackc/pgx: Memory-safety vulnerability
usr/bin/hydra	github.com/jackc/pgx/v5	CVE-2026-41889	⚪ LOW	v5.7.5	5.9.2	github.com/jackc/pgx: golang: pgx: SQL injection via specific SQL query conditio
usr/bin/hydra	go.opentelemetry.io/otel	CVE-2026-29181	🟠 HIGH	v1.40.0	1.41.0	CVE-2026-29181 affecting package azurelinux-image-tools for versions less than 1
usr/bin/hydra	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp	CVE-2026-39882	🟡 MEDIUM	v1.37.0	1.43.0	OpenTelemetry-Go is the Go implementation of OpenTelemetry. Prior to 1 ...
usr/bin/hydra	go.opentelemetry.io/otel/sdk	CVE-2026-39883	🟠 HIGH	v1.40.0	1.43.0	opentelemetry-go: BSD kenv command not using absolute path enables PATH hijackin
usr/bin/hydra	stdlib	CVE-2026-25679	🟠 HIGH	v1.26.0	1.25.8, 1.26.1	net/url: Incorrect parsing of IPv6 host literals in net/url
usr/bin/hydra	stdlib	CVE-2026-27137	🟠 HIGH	v1.26.0	1.26.1	crypto/x509: Incorrect enforcement of email constraints in crypto/x509
usr/bin/hydra	stdlib	CVE-2026-32280	🟠 HIGH	v1.26.0	1.25.9, 1.26.2	crypto/x509: crypto/tls: golang: Go: Denial of Service vulnerability in certific
usr/bin/hydra	stdlib	CVE-2026-32281	🟠 HIGH	v1.26.0	1.25.9, 1.26.2	crypto/x509: golang: Go crypto/x509: Denial of Service via inefficient certifica
usr/bin/hydra	stdlib	CVE-2026-32283	🟠 HIGH	v1.26.0	1.25.9, 1.26.2	crypto/tls: golang: Go crypto/tls: Denial of Service via multiple TLS 1.3 key up
usr/bin/hydra	stdlib	CVE-2026-33810	🟠 HIGH	v1.26.0	1.26.2	crypto/x509: golang: Go crypto/x509: Certificate validation bypass due to incorr
usr/bin/hydra	stdlib	CVE-2026-33811	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	When using LookupCNAME with the cgo DNS resolver, a very long CNAME re ...
usr/bin/hydra	stdlib	CVE-2026-33814	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	When processing HTTP/2 SETTINGS frames, transport will enter an infini ...
usr/bin/hydra	stdlib	CVE-2026-39820	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	Well-crafted inputs reaching ParseAddress, ParseAddressList, and Parse ...
usr/bin/hydra	stdlib	CVE-2026-39823	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	CVE-2026-27142 fixed a vulnerability in which URLs were not correctly ...
usr/bin/hydra	stdlib	CVE-2026-39825	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	ReverseProxy can forward queries containing parameters not visible to ...
usr/bin/hydra	stdlib	CVE-2026-39826	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	If a trusted template author were to write a <script> tag containing a ...
usr/bin/hydra	stdlib	CVE-2026-39836	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	Panic in Dial and LookupPort when handling NUL byte on Windows in net
usr/bin/hydra	stdlib	CVE-2026-42499	🟠 HIGH	v1.26.0	1.25.10, 1.26.3	Pathological inputs could cause DoS through consumePhrase when parsing ...
usr/bin/hydra	stdlib	CVE-2026-27142	🟡 MEDIUM	v1.26.0	1.25.8, 1.26.1	html/template: URLs in meta content attribute actions are not escaped in html/te
usr/bin/hydra	stdlib	CVE-2026-32282	🟡 MEDIUM	v1.26.0	1.25.9, 1.26.2	golang: internal/syscall/unix: Root.Chmod can follow symlinks out of the root
usr/bin/hydra	stdlib	CVE-2026-32288	🟡 MEDIUM	v1.26.0	1.25.9, 1.26.2	archive/tar: golang: Go's archive/tar package: Denial of Service via maliciously
usr/bin/hydra	stdlib	CVE-2026-32289	🟡 MEDIUM	v1.26.0	1.25.9, 1.26.2	html/template: golang: html/template: Cross-Site Scripting (XSS) via improper co
usr/bin/hydra	stdlib	CVE-2026-27138	⚪ LOW	v1.26.0	1.26.1	crypto/x509: Panic in name constraint checking for malformed certificates in cry
usr/bin/hydra	stdlib	CVE-2026-27139	⚪ LOW	v1.26.0	1.25.8, 1.26.1	os: FileInfo can escape from a Root in golang os module

Commit: d76065a

github-actions · 2026-05-28T18:51:51Z

Vulnerability Scan: Passed — diode-reconciler

Image: diode-reconciler:scan

No vulnerabilities found.

Commit: d76065a

github-actions · 2026-05-28T18:52:15Z

Go test coverage

STATUS	ELAPSED	PACKAGE	COVER	PASS
🟢 PASS	1.56s	github.com/netboxlabs/diode/diode-server/auth	44.7%	42
🟢 PASS	1.05s	github.com/netboxlabs/diode/diode-server/auth/cli	0.0%	0
🟢 PASS	1.03s	github.com/netboxlabs/diode/diode-server/authutil	82.8%	5
🟢 PASS	0.17s	github.com/netboxlabs/diode/diode-server/dbstore/postgres	0.0%	0
🟢 PASS	1.10s	github.com/netboxlabs/diode/diode-server/entityhash	79.2%	13
🟢 PASS	1.10s	github.com/netboxlabs/diode/diode-server/entitymatcher	82.8%	97
🟢 PASS	0.09s	github.com/netboxlabs/diode/diode-server/errors	0.0%	0
🟢 PASS	1.16s	github.com/netboxlabs/diode/diode-server/graph	52.0%	81
🟢 PASS	1.03s	github.com/netboxlabs/diode/diode-server/grpckeepalive	100.0%	1
🟢 PASS	1.42s	github.com/netboxlabs/diode/diode-server/ingester	85.4%	66
🟢 PASS	1.11s	github.com/netboxlabs/diode/diode-server/matching	94.1%	66
🟢 PASS	1.06s	github.com/netboxlabs/diode/diode-server/migrator	70.4%	4
🟢 PASS	3.13s	github.com/netboxlabs/diode/diode-server/netboxdiodeplugin	45.4%	23
🟢 PASS	0.16s	github.com/netboxlabs/diode/diode-server/pprof	0.0%	0
🟢 PASS	5.09s	github.com/netboxlabs/diode/diode-server/reconciler	70.2%	96
🟢 PASS	0.11s	github.com/netboxlabs/diode/diode-server/reconciler/changeset	0.0%	0
🟢 PASS	1.06s	github.com/netboxlabs/diode/diode-server/reconciler/differ	49.3%	23
🟢 PASS	1.02s	github.com/netboxlabs/diode/diode-server/server	85.7%	14
🟢 PASS	1.01s	github.com/netboxlabs/diode/diode-server/strcase	100.0%	24
🟢 PASS	1.02s	github.com/netboxlabs/diode/diode-server/telemetry	28.0%	26
🟢 PASS	1.01s	github.com/netboxlabs/diode/diode-server/telemetry/otel	90.2%	25
🟢 PASS	0.09s	github.com/netboxlabs/diode/diode-server/tls	0.0%	0
🟢 PASS	1.01s	github.com/netboxlabs/diode/diode-server/version	100.0%	2

Total coverage: 54.2%

When bulk-plan-apply failed, the ingestion log was left in FAILED and was terminal: nothing re-claimed it, and re-ingesting the same entity deduped against it, so the data was never applied and no error surfaced. Auto-apply tenants had no recovery path. With ENABLE_FAILED_RETRY on, a failed apply is parked in PENDING_RETRY with a jittered exponential backoff and re-claimed by the AutoApplyProcessor — in the same pipeline as fresh QUEUED work, FIFO by id so retries are processed in line rather than starved or contending for NetBox throughput — until it applies or exhausts its budget and is retired to terminal ERRORED. Off by default; when off, failures stay terminal FAILED exactly as before. - migration: add retry_count, next_retry_at + partial index on PENDING_RETRY - proto: add PENDING_RETRY state - ClaimQueuedForAutoApply claims QUEUED + due PENDING_RETRY in one FIFO batch - MarkIngestionLogRetry: increment, jittered exponential backoff, or retire to ERRORED - dedup excludes terminal ERRORED so a re-ingest after give-up re-queues - retry gated by RetryPolicy.Enabled, wired from ENABLE_FAILED_RETRY Pro inherits the new repository method via struct embedding, so it compiles on pin-bump with no changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added go diode-server diode-reconciler labels May 28, 2026

MicahParks force-pushed the fix/obs-3116-failed-retrier branch from ae4770d to fe0c634 Compare May 28, 2026 22:19

github-actions Bot added documentation Improvements or additions to documentation markdown diode-proto labels May 28, 2026

MicahParks changed the title ~~feat(reconciler): retry FAILED ingestion logs instead of stranding them (OBS-3116)~~ feat(reconciler): automatically retry failed ingestion logs May 28, 2026

MicahParks force-pushed the fix/obs-3116-failed-retrier branch from fe0c634 to 01881cf Compare May 28, 2026 22:25

Remove comment

80dae05

MicahParks marked this pull request as ready for review May 28, 2026 22:28

MicahParks requested review from grant-nbl, jajeffries, leoparente, manrodrigues, marc-barry, mfiedorowicz and paulstuart as code owners May 28, 2026 22:28

jajeffries approved these changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reconciler): automatically retry failed ingestion logs#541

feat(reconciler): automatically retry failed ingestion logs#541
MicahParks wants to merge 2 commits into
developfrom
fix/obs-3116-failed-retrier

MicahParks commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MicahParks commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix — one pipeline, a new transient state

Key design points

Config

Safety / blast radius

Testing

Known limitation (rare)

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vulnerability Scan: Passed — diode-ingester

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vulnerability Scan: Passed — diode-auth

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vulnerability Scan: Passed — diode-reconciler

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MicahParks commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading