Recover orphaned running jobs at worker startup#2644
Recover orphaned running jobs at worker startup#2644
Conversation
When a worker process dies abnormally (RTS heap overflow, SIGKILL, segfault) the Haskell exception machinery never runs, so rows it had locked stay in 'job_status_running' with the dead worker's UUID. The existing periodic recovery loop only reclaims them after the configured 'staleJobTimeout' (default 10 min) has elapsed, so a fast crash/restart loop in development can leave the queue blocked on every restart. Run a stale-job sweep once at worker startup, before the dispatcher and PG listener are wired up. In Development the sweep uses a 0s threshold since the previous worker is definitely dead. In Production it uses the configured 'staleJobTimeout' to avoid stomping on a peer worker's in-flight job in multi-worker deployments. Also expose 'recoverStaleJobsForTable', which returns the count and the prior 'locked_by' UUIDs (captured via a CTE so RETURNING sees the pre-update value), so the boot sweep can log a clear "Recovered N stale running job(s) from previous worker(s): X, Y, Z" line. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UUID is already re-exported via IHP.Prelude, so the explicit import is redundant and trips -Wunused-imports / -Werror in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UUID and sort are already re-exported via IHP.Prelude, so the explicit imports trip -Werror=unused-imports in the nix flake check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core Size & Compile Allocations Benchmark
HTTP Latency (GET /, 5000 reqs, 10 concurrent)
Top 10 modules (this PR)
|
|
Open question: should the recovery sweep bump The current PR (and the existing periodic The failure mode this misses: if a job deterministically crashes the worker (heap overflow, segfault, OOM kill on a leaky job), every restart hands the same poisoned row to a fresh worker who dies the same way. With Pros of bumping:
Cons of bumping:
Why not: the "preserve the user's full attempt budget" intuition is appealing but makes the framework defenseless against poison pills — exactly the bug class this PR was opened to address. Proposal: bump |
Summary
When a worker process dies abnormally (RTS heap overflow panic, SIGKILL from the OOM killer, segfault, kernel panic) the Haskell exception machinery never runs, so rows it had picked up remain in
job_status_runningwith the dead worker's UUID inlocked_by. The existing periodic stale-job recovery loop only reclaims these rows after the configuredstaleJobTimeout(default 10 minutes) has elapsed, so a fast crash/restart loop in development can leave the queue blocked on every restart, and there's no signal to the developer about what happened.This PR runs a stale-job sweep once at worker startup, before the dispatcher and PG listener are wired up, with environment-aware thresholds and a clear log line.
Reproduction (the bug this fixes)
GHCRTS=-M512Min any IHP appMapsdevenv up, observe ghci panic with "heap overflow"job_status_running, nolast_error, locked by the now-dead workerdevenv upand the row stays stuck for the full 10-minutestaleJobTimeoutwindowApproach
IHP.Job.Queue.recoverStaleJobsForTable :: Pool -> Text -> NominalDiffTime -> IO (Int, [UUID]). Same two-tier recovery as the existingrecoverStaleJobs, but uses a CTE so RETURNING captures the pre-updatelocked_byvalues (a plain UPDATE ... RETURNING would observe the post-update NULL). Returns(count, previousWorkerUuids)so callers can log what happened.recoverStaleJobsis now a thin wrapper that discards the report.runBootStaleJobSweepinIHP.Job.Runner.WorkerLoop. Called once at the top ofjobWorkerFetchAndRunLoop, before any STM/dispatcher/PG-listener setup, so no concurrent worker could have just legitimately locked a row.isDevelopment): 0 seconds — sweep everything. The dev server is single-worker, so any running row is from the previous now-dead process.staleJobTimeout. Avoids stomping on a peer worker's in-flight job in multi-worker deployments.staleJobTimeout @jobisNothing(recovery disabled by the user).attempts_count— the job never got to run cleanly, so the user's full attempt budget is preserved. Same as the existing periodic sweep.Recovered N stale running job(s) at startup from previous worker(s): X, Y, Zso a developer iterating on a leaky job immediately sees what happened.Tests
ihp/Test/Test/JobQueueSpec.hsadds a newIHP.Job.Queue.recoverStaleJobsForTabledescribe with three cases:job_status_runningrow, returns its previous worker UUID, and a second sweep is a no-op.locked_atis younger than the threshold.The existing
createTestTablehelper gainedlocked_atandlast_errorcolumns (additive, no impact on the existing trigger tests).Future work (out of scope)
NOTIFYand peers reset rows whose worker UUID hasn't beaten in N seconds.The startup-sweep fix in this PR solves the developer-facing pain with no schema changes; the larger redesigns can come later if the team wants them.
Test plan
ihp/Test/Test/JobQueueSpec.hs— 7/7 green against a local Postgresihp/Test/Test/Main.hscompiles cleanly (147 modules)kill -9, restart, verify the orphaned row is recovered immediately and logged🤖 Generated with Claude Code