Skip to content

Forward termination signals to child process#332

Merged
EdSchouten merged 1 commit into
buildbarn:mainfrom
meroton:forward-signals-to-child
Mar 26, 2026
Merged

Forward termination signals to child process#332
EdSchouten merged 1 commit into
buildbarn:mainfrom
meroton:forward-signals-to-child

Conversation

@oscar-meroton
Copy link
Copy Markdown
Contributor

Commit d2a3a16,
"On Linux, relaunch the current process when running as PID 1 (#311)", relaunches the command in a child process and then waits for it to exit, if the parent process has PID 1. If the command is ran in a container, any termination signal sent to the container is forwarded to the parent process. This unfortunately prevents the persistent state files from being saved to disk if the container is killed, as the child process never receives the shutdown signal.
This commit forwards SIGINT and SIGTERM signals to the child process, which then saves the state files to disk, exits, and triggers a graceful exit in PID 1.

Commit d2a3a16,
"On Linux, relaunch the current process when running as PID 1 (buildbarn#311)",
relaunches the command in a child process and then waits for it to exit,
if the parent process has PID 1. If the command is ran in a container,
any termination signal sent to the container is forwarded to the parent
process. This unfortunately prevents the persistent state files from
being saved to disk if the container is killed, as the child process
never receives the shutdown signal.
This commit forwards `SIGINT` and `SIGTERM` signals to the child
process, which then saves the state files to disk, exits, and triggers
a graceful exit in PID 1.

Co-authored-by: Fredrik Medley <fredrik@meroton.com>
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented Mar 24, 2026

Test

1 test target passed

Targets
//pkg/blobstore/sharding/integration:integration_test [k8-fastbuild]                 73ms

Total test execution time was 73ms. 29 tests (96.7%) were fully cached saving 5s.

@oscar-meroton
Copy link
Copy Markdown
Contributor Author

@EdSchouten,

  1. Do you have any thoughts on whether to forward any signal, or just the two listed in terminationSignals?
  2. Currently the error from syscall.Kill is not handled, should the program behave differently if an error is returned?

@EdSchouten
Copy link
Copy Markdown
Member

  1. Sticking to the termination signals is all right.
  2. What you have in terms of error handling is also all right!

\o/

@oscar-meroton
Copy link
Copy Markdown
Contributor Author

@EdSchouten Great! Is there anything else or can we go ahead and merge?

@EdSchouten
Copy link
Copy Markdown
Member

Oh sorry. Was waiting for CI to complete, and forgot to merge.

@EdSchouten EdSchouten merged commit d0c6f26 into buildbarn:main Mar 26, 2026
3 checks passed
@oscar-meroton oscar-meroton deleted the forward-signals-to-child branch March 26, 2026 17:09
aron-muon added a commit to aron-muon/bb-storage that referenced this pull request May 13, 2026
The previous fix bumped time.Sleep(5) -> time.Sleep(time.Second) and
os.Exit(1) -> os.Exit(0) on the assumption that the fallback was
firing on clean shutdown. A bb_worker termination test on staging
showed the program still exits with code 2, never reaching the
fallback at all.

Root cause traced via strace + Go runtime source:

terminateWithSignal() does signal.Reset(sig) followed by
process.Signal(sig) to make the program "die from the original
signal" so init systems see a signal-style exit. But signal.Reset()
only disables Go's user-channel routing for that signal — it does
NOT install SIG_DFL. The runtime's signal trampoline is still in
place and intercepts the raised signal, dispatching it to
runtime.dieFromSignal() (runtime/signal_unix.go).

dieFromSignal() then does:

    raise(sig); osyield x3; setsig(sig, _SIG_DFL); raise(sig); osyield x3
    exit(2)

PID 1 in a PID namespace and multi-goroutine Go programs reliably
hit the exit(2) fall-through. Our time.Sleep(time.Second) +
os.Exit(0) never executed because the runtime exited first. The
behaviour predates buildbarn#332 (signal forwarding to child); buildbarn#332 just made
it observable for bb_worker because graceful shutdown is now reachable.

Fix: skip the signal-raise dance entirely. We initiated this
shutdown intentionally — either an explicit signal from the parent
or all routines completed cleanly — so just exit 0. Container exit 0
is "Succeeded" in k8s, vs the spurious "Failed: Error" the runtime
exit(2) produced.

Tradeoff: callers no longer observe a signal-style exit (e.g. 143
for SIGTERM). For PID 1 / single-process supervisors that's a real
loss; for k8s/systemd-restartPolicy=Always it's a no-op since
exit-code-meaning is identical. Better to lose the cosmetic signal
exit code than to ship a fundamentally broken shutdown path.

Validated on staging:
  ghcr.io/aron-muon/bb-worker:20260513T141137Z-1d0616a (with sleep+exit(0))
  -> exit 2 across 4 termination tests
  ghcr.io/aron-muon/bb-worker:<next> (this commit)
  -> expected: exit 0

References:
  golang/go#19326
  golang/go#46321
  Go runtime: src/runtime/signal_unix.go dieFromSignal()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants