Recover orphaned running jobs at worker startup by amitaibu · Pull Request #2644 · digitallyinduced/ihp

amitaibu · 2026-04-15T13:27:45Z

Summary

When a worker process dies abnormally (RTS heap overflow panic, SIGKILL from the OOM killer, segfault, kernel panic) the Haskell exception machinery never runs, so rows it had picked up remain in job_status_running with the dead worker's UUID in locked_by. The existing periodic stale-job recovery loop only reclaims these rows after the configured staleJobTimeout (default 10 minutes) has elapsed, so a fast crash/restart loop in development can leave the queue blocked on every restart, and there's no signal to the developer about what happened.

This PR runs a stale-job sweep once at worker startup, before the dispatcher and PG listener are wired up, with environment-aware thresholds and a clear log line.

Reproduction (the bug this fixes)

Set GHCRTS=-M512M in any IHP app
Write a job that allocates a few GB of intermediate Maps
devenv up, observe ghci panic with "heap overflow"
Inspect the job table: row stuck in job_status_running, no last_error, locked by the now-dead worker
Restart devenv up and the row stays stuck for the full 10-minute staleJobTimeout window

Approach

New helper IHP.Job.Queue.recoverStaleJobsForTable :: Pool -> Text -> NominalDiffTime -> IO (Int, [UUID]). Same two-tier recovery as the existing recoverStaleJobs, but uses a CTE so RETURNING captures the pre-update locked_by values (a plain UPDATE ... RETURNING would observe the post-update NULL). Returns (count, previousWorkerUuids) so callers can log what happened. recoverStaleJobs is now a thin wrapper that discards the report.
New runBootStaleJobSweep in IHP.Job.Runner.WorkerLoop. Called once at the top of jobWorkerFetchAndRunLoop, before any STM/dispatcher/PG-listener setup, so no concurrent worker could have just legitimately locked a row.
Threshold:
- Development (isDevelopment): 0 seconds — sweep everything. The dev server is single-worker, so any running row is from the previous now-dead process.
- Production: the configured staleJobTimeout. Avoids stomping on a peer worker's in-flight job in multi-worker deployments.
Skipped silently when staleJobTimeout @job is Nothing (recovery disabled by the user).
Recovery does not bump attempts_count — the job never got to run cleanly, so the user's full attempt budget is preserved. Same as the existing periodic sweep.
Logs a single line Recovered N stale running job(s) at startup from previous worker(s): X, Y, Z so a developer iterating on a leaky job immediately sees what happened.

Tests

ihp/Test/Test/JobQueueSpec.hs adds a new IHP.Job.Queue.recoverStaleJobsForTable describe with three cases:

Resets a stuck job_status_running row, returns its previous worker UUID, and a second sweep is a no-op.
Returns previous worker UUIDs for multiple stale rows from different workers.
Leaves rows alone when locked_at is younger than the threshold.

The existing createTestTable helper gained locked_at and last_error columns (additive, no impact on the existing trigger tests).

IHP.Job.Queue
  recreates missing triggers when poller repair is enabled [v]
  does not recreate missing triggers when poller repair is disabled [v]
  keeps polling after trigger repair errors and recovers once table exists again [v]
  recreates missing triggers for long table names when poller repair is enabled [v]
IHP.Job.Queue.recoverStaleJobsForTable
  resets a stuck running job and returns its previous worker uuid [v]
  returns previous worker uuids for multiple stale rows [v]
  leaves rows alone when their locked_at is younger than the threshold [v]

Finished in 7.85 seconds
7 examples, 0 failures

Future work (out of scope)

Postgres session-bound advisory locks tied to the worker connection. Postgres would release these automatically on connection drop, making the in-Haskell stale-job machinery redundant. More robust but requires a schema and protocol shift.
LISTEN/NOTIFY dead-man-switch: workers heartbeat via NOTIFY and peers reset rows whose worker UUID hasn't beaten in N seconds.

The startup-sweep fix in this PR solves the developer-facing pain with no schema changes; the larger redesigns can come later if the team wants them.

Test plan

ihp/Test/Test/JobQueueSpec.hs — 7/7 green against a local Postgres
ihp/Test/Test/Main.hs compiles cleanly (147 modules)
Manual repro: run an IHP app with a leaky job, kill the worker with kill -9, restart, verify the orphaned row is recovered immediately and logged

🤖 Generated with Claude Code

When a worker process dies abnormally (RTS heap overflow, SIGKILL, segfault) the Haskell exception machinery never runs, so rows it had locked stay in 'job_status_running' with the dead worker's UUID. The existing periodic recovery loop only reclaims them after the configured 'staleJobTimeout' (default 10 min) has elapsed, so a fast crash/restart loop in development can leave the queue blocked on every restart. Run a stale-job sweep once at worker startup, before the dispatcher and PG listener are wired up. In Development the sweep uses a 0s threshold since the previous worker is definitely dead. In Production it uses the configured 'staleJobTimeout' to avoid stomping on a peer worker's in-flight job in multi-worker deployments. Also expose 'recoverStaleJobsForTable', which returns the count and the prior 'locked_by' UUIDs (captured via a CTE so RETURNING sees the pre-update value), so the boot sweep can log a clear "Recovered N stale running job(s) from previous worker(s): X, Y, Z" line. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

UUID is already re-exported via IHP.Prelude, so the explicit import is redundant and trips -Wunused-imports / -Werror in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

UUID and sort are already re-exported via IHP.Prelude, so the explicit imports trip -Werror=unused-imports in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-15T19:34:13Z

Core Size & Compile Allocations Benchmark

Metric	Baseline (master)	This PR	Change
Core size	11048302 bytes	11048313 bytes	0.0%
Compile allocations	27108899640 bytes	27109538424 bytes	0.0%

Core size within threshold
Compile allocations within threshold

HTTP Latency (GET /, 5000 reqs, 10 concurrent)

Metric	Baseline (master)	This PR	Change
Mean	3.04ms	3.17ms	4.3%
p50	2.90ms	3.10ms	—
p99	6.30ms	6.20ms	—
Min	0.50ms	0.60ms	—
Max	39.20ms	17.80ms	—
Req/s	3188	3081	—

HTTP latency within threshold

Top 10 modules (this PR)

Module	Size (bytes)
Web.Types.thr	547347
Web.Routes.thr	423681
Web.FrontController.thr	384020
Web.View.Threads.Show.thr	304253
Web.Controller.Threads.thr	291653
Web.Controller.Comments.thr	277382
Admin.FrontController.thr	269137
Admin.Types.thr	264263
build.Generated.User.thr	259919
Admin.Routes.thr	258465

amitaibu · 2026-04-15T19:41:07Z

Open question: should the recovery sweep bump attempts_count?

The current PR (and the existing periodic recoverStaleJobs) does not bump it. I want to flag this because I think it's wrong, and the fix is small.

The failure mode this misses: if a job deterministically crashes the worker (heap overflow, segfault, OOM kill on a leaky job), every restart hands the same poisoned row to a fresh worker who dies the same way. With attempts_count untouched, the loop is infinite. The queue stays poisoned forever and this PR has hidden the symptom (no more orphaned rows) without solving the underlying class of bugs.

Pros of bumping:

Bounds the crash loop. Default maxAttempts=10 + LinearBackoff 30s → poison-pill rows are permanently failed after ~5 minutes, queue self-heals, no operator intervention.
Symmetric with jobDidFail / jobDidTimeout, which already charge attempts on non-clean terminations. Recovery is the same situation seen from a different angle: "the worker that would have called jobDidFail couldn't, so we do it on its behalf."
Correct semantics for maxAttempts=1 non-idempotent jobs: if the worker died after side effects but before jobDidSucceed, we should give up rather than re-run.

Cons of bumping:

A legitimate worker death (deploy SIGTERM not drained, noisy-neighbour OOM, hardware reboot) charges an attempt against an innocent job. With maxAttempts=10 you'd need 10 ill-timed restarts in a row to actually exhaust a real job — generous headroom.
Pedantically, "attempts" stops meaning "perform ran to completion". But jobDidTimeout already breaks that invariant, so this ship has sailed.

Why not: the "preserve the user's full attempt budget" intuition is appealing but makes the framework defenseless against poison pills — exactly the bug class this PR was opened to address.

Proposal: bump attempts_count in both the boot sweep and the existing periodic recoverStaleJobs (single recovery path, one definition of "an attempt"). Set last_error to something descriptive like Recovered after worker death (attempt N/M) so failed rows are greppable. ~15 LOC, no schema change.

amitaibu and others added 3 commits April 15, 2026 16:27

Drop unused Data.UUID import in Job/Queue/Result.hs

c28b777

UUID is already re-exported via IHP.Prelude, so the explicit import is redundant and trips -Wunused-imports / -Werror in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop unused Data.UUID and Data.List imports in JobQueueSpec

f70f071

UUID and sort are already re-exported via IHP.Prelude, so the explicit imports trip -Werror=unused-imports in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover orphaned running jobs at worker startup#2644

Recover orphaned running jobs at worker startup#2644
amitaibu wants to merge 3 commits intomasterfrom
fix/stale-job-recovery-on-startup

amitaibu commented Apr 15, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

amitaibu commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amitaibu commented Apr 15, 2026

Summary

Reproduction (the bug this fixes)

Approach

Tests

Future work (out of scope)

Test plan

Uh oh!

github-actions Bot commented Apr 15, 2026

Core Size & Compile Allocations Benchmark

HTTP Latency (GET /, 5000 reqs, 10 concurrent)

Top 10 modules (this PR)

Uh oh!

amitaibu commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant