Skip to content

program: exit(0) directly in terminateWithSignal to avoid runtime exit(2)#337

Open
aron-muon wants to merge 1 commit into
buildbarn:mainfrom
aron-muon:aron/fix-terminate-with-signal-exit2
Open

program: exit(0) directly in terminateWithSignal to avoid runtime exit(2)#337
aron-muon wants to merge 1 commit into
buildbarn:mainfrom
aron-muon:aron/fix-terminate-with-signal-exit2

Conversation

@aron-muon
Copy link
Copy Markdown
Contributor

@aron-muon aron-muon commented May 13, 2026

Problem

terminateWithSignal() re-raises the original signal back to the process via signal.Reset() + process.Signal() to mimic a signal-style exit (e.g. 128+SIGTERM=143). But signal.Reset() does not install SIG_DFL, it only disables Go's user-channel routing. The runtime's signal trampoline is still installed and intercepts the raised signal, dispatching it to runtime.dieFromSignal():

raise(sig); osyield x3; setsig(sig, _SIG_DFL); raise(sig); osyield x3; exit(2)

Multi-goroutine programs running as PID 1 in a PID namespace reliably hit the exit(2) fall-through. bb_worker and other daemons exit code 2 on graceful SIGTERM despite a clean shutdown, pods end Failed/Error in k8s. The existing time.Sleep + os.Exit fallback was unreachable code (and time.Sleep(5) is 5 nanoseconds anyway).

Approach

Skip the signal-raise dance and exit with the right code directly. 128+signal is what POSIX shells and init systems derive from WIFSIGNALED, so it's user-visibly equivalent without going near Go's signal machinery.

Per review feedback: exit 128+signal is correct for one-shot tools (bb_copy, sync_jwks_to_configmap) where signal interruption should leave a non-zero status, but wrong for daemons where graceful shutdown should look successful to the supervisor (k8s pod phase Succeeded, systemd inactive (dead) without failure).

Add a variadic Option to RunMain:

  • Default, exit 128+signal (correct for one-shot CLI tools)
  • program.WithDaemonExit(), exit 0 (correct for long-running daemons)

Migrate this repo's daemons (bb_storage, bb_replicator) in the same PR; leave bb_copy and sync_jwks_to_configmap unchanged. Cross-repo daemons (bb_worker, bb_scheduler, bb_runner in bb-remote-execution; bb_autoscaler elsewhere) pick the daemon variant matching their lifecycle when they bump the dependency.

Refs

@aron-muon aron-muon changed the title program: exit(0) directly in terminateWithSignal to avoid runtime exi… program: exit(0) directly in terminateWithSignal to avoid runtime exit(2) May 13, 2026
@aron-muon aron-muon force-pushed the aron/fix-terminate-with-signal-exit2 branch 2 times, most recently from 212e498 to ec57cf3 Compare May 13, 2026 17:36
@aron-muon aron-muon marked this pull request as ready for review May 13, 2026 17:59
@EdSchouten
Copy link
Copy Markdown
Member

Even though returning 'exit 0' may be acceptable for daemons, program.RunMain() may also be used by one-shot processes like bb_copy and bb_autoscaler. I don't think we want to return an exit code of zero if those are terminated through signal delivery.

Per buildbarn#337 review: returning exit 0 on graceful signal-triggered
shutdown is correct for daemons (k8s pod phase Succeeded, systemd
inactive (dead) without failure), but wrong for one-shot CLI tools
(bb_copy, sync_jwks_to_configmap, etc.) where Ctrl-C should leave a
non-zero status so wrapper scripts know the work was interrupted
mid-run.

Add a variadic Option to RunMain. The default behaviour is the
one-shot semantic (exit 128+signal, the POSIX convention that bash,
zsh, systemd, and kubelet all derive from WIFSIGNALED). Daemons opt
in via program.WithDaemonExit() for the exit-0 behaviour.

terminateWithSignal still skips the signal.Reset() + process.Signal()
self-raise dance — that path is what triggered the runtime
dieFromSignal exit(2) trap on multi-goroutine programs running as
PID 1 in a PID namespace, and it adds nothing now that we exit with
the conventional code directly. The PID-1 reaper in
relaunchIfPID1 takes the same daemon flag so its exit-code policy
matches the original launch.

Migrate the daemons in this repo: bb_storage and bb_replicator. Leave
bb_copy and sync_jwks_to_configmap unchanged so they exit
128+signal on interruption. bb-remote-execution daemons (bb_worker,
bb_scheduler, bb_runner, bb_noop_worker, bb_virtual_tmp) and
external one-shots like bb_autoscaler will pick the variant matching
their lifecycle when they bump the dependency.

Refs:
  - golang/go#19326
  - golang/go#46321
@aron-muon aron-muon force-pushed the aron/fix-terminate-with-signal-exit2 branch from ec57cf3 to dda1059 Compare May 14, 2026 12:45
@aron-muon
Copy link
Copy Markdown
Contributor Author

Even though returning 'exit 0' may be acceptable for daemons, program.RunMain() may also be used by one-shot processes like bb_copy and bb_autoscaler. I don't think we want to return an exit code of zero if those are terminated through signal delivery.

Yeah, makes sense - new approach is in, thoughts on self selecting whether running as daemon vs default one shot process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants