Skip to content

Recover orphaned running jobs at worker startup#2644

Draft
amitaibu wants to merge 3 commits intomasterfrom
fix/stale-job-recovery-on-startup
Draft

Recover orphaned running jobs at worker startup#2644
amitaibu wants to merge 3 commits intomasterfrom
fix/stale-job-recovery-on-startup

Conversation

@amitaibu
Copy link
Copy Markdown
Collaborator

Summary

When a worker process dies abnormally (RTS heap overflow panic, SIGKILL from the OOM killer, segfault, kernel panic) the Haskell exception machinery never runs, so rows it had picked up remain in job_status_running with the dead worker's UUID in locked_by. The existing periodic stale-job recovery loop only reclaims these rows after the configured staleJobTimeout (default 10 minutes) has elapsed, so a fast crash/restart loop in development can leave the queue blocked on every restart, and there's no signal to the developer about what happened.

This PR runs a stale-job sweep once at worker startup, before the dispatcher and PG listener are wired up, with environment-aware thresholds and a clear log line.

Reproduction (the bug this fixes)

  1. Set GHCRTS=-M512M in any IHP app
  2. Write a job that allocates a few GB of intermediate Maps
  3. devenv up, observe ghci panic with "heap overflow"
  4. Inspect the job table: row stuck in job_status_running, no last_error, locked by the now-dead worker
  5. Restart devenv up and the row stays stuck for the full 10-minute staleJobTimeout window

Approach

  • New helper IHP.Job.Queue.recoverStaleJobsForTable :: Pool -> Text -> NominalDiffTime -> IO (Int, [UUID]). Same two-tier recovery as the existing recoverStaleJobs, but uses a CTE so RETURNING captures the pre-update locked_by values (a plain UPDATE ... RETURNING would observe the post-update NULL). Returns (count, previousWorkerUuids) so callers can log what happened. recoverStaleJobs is now a thin wrapper that discards the report.
  • New runBootStaleJobSweep in IHP.Job.Runner.WorkerLoop. Called once at the top of jobWorkerFetchAndRunLoop, before any STM/dispatcher/PG-listener setup, so no concurrent worker could have just legitimately locked a row.
  • Threshold:
    • Development (isDevelopment): 0 seconds — sweep everything. The dev server is single-worker, so any running row is from the previous now-dead process.
    • Production: the configured staleJobTimeout. Avoids stomping on a peer worker's in-flight job in multi-worker deployments.
  • Skipped silently when staleJobTimeout @job is Nothing (recovery disabled by the user).
  • Recovery does not bump attempts_count — the job never got to run cleanly, so the user's full attempt budget is preserved. Same as the existing periodic sweep.
  • Logs a single line Recovered N stale running job(s) at startup from previous worker(s): X, Y, Z so a developer iterating on a leaky job immediately sees what happened.

Tests

ihp/Test/Test/JobQueueSpec.hs adds a new IHP.Job.Queue.recoverStaleJobsForTable describe with three cases:

  1. Resets a stuck job_status_running row, returns its previous worker UUID, and a second sweep is a no-op.
  2. Returns previous worker UUIDs for multiple stale rows from different workers.
  3. Leaves rows alone when locked_at is younger than the threshold.

The existing createTestTable helper gained locked_at and last_error columns (additive, no impact on the existing trigger tests).

IHP.Job.Queue
  recreates missing triggers when poller repair is enabled [v]
  does not recreate missing triggers when poller repair is disabled [v]
  keeps polling after trigger repair errors and recovers once table exists again [v]
  recreates missing triggers for long table names when poller repair is enabled [v]
IHP.Job.Queue.recoverStaleJobsForTable
  resets a stuck running job and returns its previous worker uuid [v]
  returns previous worker uuids for multiple stale rows [v]
  leaves rows alone when their locked_at is younger than the threshold [v]

Finished in 7.85 seconds
7 examples, 0 failures

Future work (out of scope)

  • Postgres session-bound advisory locks tied to the worker connection. Postgres would release these automatically on connection drop, making the in-Haskell stale-job machinery redundant. More robust but requires a schema and protocol shift.
  • LISTEN/NOTIFY dead-man-switch: workers heartbeat via NOTIFY and peers reset rows whose worker UUID hasn't beaten in N seconds.

The startup-sweep fix in this PR solves the developer-facing pain with no schema changes; the larger redesigns can come later if the team wants them.

Test plan

  • ihp/Test/Test/JobQueueSpec.hs — 7/7 green against a local Postgres
  • ihp/Test/Test/Main.hs compiles cleanly (147 modules)
  • Manual repro: run an IHP app with a leaky job, kill the worker with kill -9, restart, verify the orphaned row is recovered immediately and logged

🤖 Generated with Claude Code

amitaibu and others added 3 commits April 15, 2026 16:27
When a worker process dies abnormally (RTS heap overflow, SIGKILL,
segfault) the Haskell exception machinery never runs, so rows it had
locked stay in 'job_status_running' with the dead worker's UUID. The
existing periodic recovery loop only reclaims them after the configured
'staleJobTimeout' (default 10 min) has elapsed, so a fast crash/restart
loop in development can leave the queue blocked on every restart.

Run a stale-job sweep once at worker startup, before the dispatcher and
PG listener are wired up. In Development the sweep uses a 0s threshold
since the previous worker is definitely dead. In Production it uses the
configured 'staleJobTimeout' to avoid stomping on a peer worker's
in-flight job in multi-worker deployments.

Also expose 'recoverStaleJobsForTable', which returns the count and the
prior 'locked_by' UUIDs (captured via a CTE so RETURNING sees the
pre-update value), so the boot sweep can log a clear "Recovered N stale
running job(s) from previous worker(s): X, Y, Z" line.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UUID is already re-exported via IHP.Prelude, so the explicit import is
redundant and trips -Wunused-imports / -Werror in the nix flake check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UUID and sort are already re-exported via IHP.Prelude, so the explicit
imports trip -Werror=unused-imports in the nix flake check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Core Size & Compile Allocations Benchmark

Metric Baseline (master) This PR Change
Core size 11048302 bytes 11048313 bytes 0.0%
Compile allocations 27108899640 bytes 27109538424 bytes 0.0%

Core size within threshold
Compile allocations within threshold

HTTP Latency (GET /, 5000 reqs, 10 concurrent)

Metric Baseline (master) This PR Change
Mean 3.04ms 3.17ms 4.3%
p50 2.90ms 3.10ms
p99 6.30ms 6.20ms
Min 0.50ms 0.60ms
Max 39.20ms 17.80ms
Req/s 3188 3081

HTTP latency within threshold

Top 10 modules (this PR)

Module Size (bytes)
Web.Types.thr 547347
Web.Routes.thr 423681
Web.FrontController.thr 384020
Web.View.Threads.Show.thr 304253
Web.Controller.Threads.thr 291653
Web.Controller.Comments.thr 277382
Admin.FrontController.thr 269137
Admin.Types.thr 264263
build.Generated.User.thr 259919
Admin.Routes.thr 258465

@amitaibu
Copy link
Copy Markdown
Collaborator Author

Open question: should the recovery sweep bump attempts_count?

The current PR (and the existing periodic recoverStaleJobs) does not bump it. I want to flag this because I think it's wrong, and the fix is small.

The failure mode this misses: if a job deterministically crashes the worker (heap overflow, segfault, OOM kill on a leaky job), every restart hands the same poisoned row to a fresh worker who dies the same way. With attempts_count untouched, the loop is infinite. The queue stays poisoned forever and this PR has hidden the symptom (no more orphaned rows) without solving the underlying class of bugs.

Pros of bumping:

  • Bounds the crash loop. Default maxAttempts=10 + LinearBackoff 30s → poison-pill rows are permanently failed after ~5 minutes, queue self-heals, no operator intervention.
  • Symmetric with jobDidFail / jobDidTimeout, which already charge attempts on non-clean terminations. Recovery is the same situation seen from a different angle: "the worker that would have called jobDidFail couldn't, so we do it on its behalf."
  • Correct semantics for maxAttempts=1 non-idempotent jobs: if the worker died after side effects but before jobDidSucceed, we should give up rather than re-run.

Cons of bumping:

  • A legitimate worker death (deploy SIGTERM not drained, noisy-neighbour OOM, hardware reboot) charges an attempt against an innocent job. With maxAttempts=10 you'd need 10 ill-timed restarts in a row to actually exhaust a real job — generous headroom.
  • Pedantically, "attempts" stops meaning "perform ran to completion". But jobDidTimeout already breaks that invariant, so this ship has sailed.

Why not: the "preserve the user's full attempt budget" intuition is appealing but makes the framework defenseless against poison pills — exactly the bug class this PR was opened to address.

Proposal: bump attempts_count in both the boot sweep and the existing periodic recoverStaleJobs (single recovery path, one definition of "an attempt"). Set last_error to something descriptive like Recovered after worker death (attempt N/M) so failed rows are greppable. ~15 LOC, no schema change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant