Skip to content

rabbit_msg_store: terminate GC with exit signal during shutdown (backport #15498) (backport #16266)#16267

Merged
lhoguin merged 2 commits intov4.2.xfrom
mergify/bp/v4.2.x/pr-16266
Apr 29, 2026
Merged

rabbit_msg_store: terminate GC with exit signal during shutdown (backport #15498) (backport #16266)#16267
lhoguin merged 2 commits intov4.2.xfrom
mergify/bp/v4.2.x/pr-16266

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify Bot commented Apr 29, 2026

Problem

When rabbit_msg_store shuts down, its terminate/2 callback calls rabbit_msg_store_gc:stop/1, which sends a gen_server2:call(stop, infinity) to the GC process. If the GC is blocked mid-handle_cast on disk I/O (for example, during compaction under disk pressure), the call blocks indefinitely. terminate/2 never reaches the code that writes the recovery files (file_summary.ets, msg_store_index.ets, clean.dot). Eventually the msg_store child's supervisor shutdown timeout expires (msg_store_shutdown_timeout, default 600s) and the supervisor kills the msg_store process. With no recovery files on disk, the next startup rebuilds indices from scratch by scanning every segment file, which is slow and expensive for large stores.

We hit this in production on a broker under PerfTest load. The persistent message store for the / vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reason killed. The GC process was blocked on disk I/O while disk free space was hovering near the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via rabbitmqctl stop.

Fix

Replace the synchronous rabbit_msg_store_gc:stop/1 call in terminate/2 with a new stop_gc/2 helper that monitors the GC, sends exit(GcPid, shutdown), and waits for the 'DOWN' message.

rabbit_msg_store_gc does not trap exits, so an exit signal terminates it immediately even if it is running inside a handle_cast callback on disk I/O. Exit signals are processed by the scheduler, not the user-level receive loop, so the signal preempts the blocked callback. The GC's terminate/2 is a no-op, so bypassing it has no side effect.

The wait for 'DOWN' is bounded by max(msg_store_shutdown_timeout - 60_000, 5_000) so that terminate/2 stays within the msg_store child's own supervisor shutdown timeout and leaves at least 60s for the remaining steps (syncing the current file, writing the file summary, tearing down ETS, writing recovery terms). If the shutdown signal does not produce a 'DOWN' in time, fall back to exit(GCPid, kill). kill cannot be trapped, so the inner receive has no timeout. A warning is logged on the fallback path so the case can be tracked in production logs.

Killing the GC mid-operation is safe with respect to message data:

  • compact_file copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."
  • truncate_file only removes data that has already been compacted to earlier offsets.
  • delete_file only deletes files with zero valid messages, enforced by assertions before the delete.

The unclean recovery path (build_index/3) rebuilds everything from the actual segment files on disk using scan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the recovery files are fully consistent and the next startup recovers cleanly without a rebuild.

Tests

Two test cases in backing_queue_SUITE exercise the fix:

  • msg_store_gc_stuck_suspended suspends the GC with sys:suspend, terminates the persistent store via the supervisor, and verifies that the store recovers cleanly (successfully_recovered_state returns true) with all messages intact. This covers the case where the GC is blocked in its receive loop.
  • msg_store_gc_stuck_mid_callback mocks compact_file/2 to block indefinitely inside the handle_cast callback, sends a compact cast to the GC to put it in the blocking callback, then terminates the store via the supervisor. This covers the scenario that motivated the PR: a GC stuck on disk I/O inside a callback.

Both tests confirm that exit(GcPid, shutdown) terminates the GC process regardless of what Erlang code it is running, that terminate/2 completes and writes the recovery files, and that the next startup recovers cleanly.


This is an automatic backport of pull request #15498 done by [Mergify](https://mergify.com).
This is an automatic backport of pull request #16266 done by [Mergify](https://mergify.com).

When `rabbit_msg_store` shuts down, its `terminate/2` callback calls
`rabbit_msg_store_gc:stop/1`, which sends a `gen_server2:call(stop,
infinity)`. If the GC process is blocked mid-`handle_cast` on disk
I/O (for example, during compaction under disk pressure), the call
blocks indefinitely. `terminate/2` never reaches the code that writes
the recovery files (`file_summary.ets`, `msg_store_index.ets`,
`clean.dot`). Eventually the msg_store child's supervisor shutdown
timeout expires (`msg_store_shutdown_timeout`, default 600s) and the
supervisor kills the msg_store process with reason `killed`. With no
recovery files on disk, the next startup rebuilds indices from
scratch by scanning every segment file, which is slow and expensive
for large stores.

This was observed in production on a broker under PerfTest load. The
persistent message store for the `/` vhost logged "Stopping message
store" and then nothing for 10 minutes until the supervisor killed
it. The GC process was blocked on disk I/O while disk free space was
hovering near the 2 GiB limit. On restart the store logged
"rebuilding indices from scratch" despite the shutdown having been
initiated gracefully via `rabbitmqctl stop`.

Replace the synchronous `rabbit_msg_store_gc:stop/1` call in
`terminate/2` with a new `stop_gc/1` helper that monitors the GC,
sends `exit(GCPid, shutdown)`, and waits for the `'DOWN'` message.

`rabbit_msg_store_gc` does not trap exits, so an exit signal
terminates it immediately even if it is running inside a
`handle_cast` callback on disk I/O. Exit signals are processed by
the scheduler, not the user-level receive loop, so the signal
preempts the blocked callback. The GC's `terminate/2` is a no-op, so
bypassing it has no side effect.

The wait for `'DOWN'` is bounded by
`max(msg_store_shutdown_timeout - 60_000, 5_000)` so that `terminate/2`
stays within the msg_store child's own supervisor shutdown timeout
and leaves at least 60s for the remaining steps (syncing the current
file, writing the file summary, tearing down ETS, writing recovery
terms). If the shutdown signal does not produce a `'DOWN'` in time,
fall back to `exit(GCPid, kill)`. `kill` cannot be trapped, so the
inner `receive` has no timeout.

Killing the GC mid-operation is safe with respect to message data:

- `compact_file` copies messages before updating the index, and the
  original data remains on disk until truncation. The code comments
  confirm: "it's OK if we crash at any point before we update the
  index because the old data is still there until we truncate."
- `truncate_file` only removes data that has already been compacted
  to earlier offsets.
- `delete_file` only deletes files with zero valid messages,
  enforced by assertions before the delete.

The unclean recovery path (`build_index/3`) rebuilds everything from
the actual segment files on disk using `scan_file_for_valid_messages`,
so any inconsistency between the file summary and the on-disk state
is handled. In the common case (GC killed before it modified the file
summary ETS), the recovery files are fully consistent and the next
startup recovers cleanly without a rebuild.

Add `rabbit_msg_store:gc_pid/1` to expose the GC pid for testing.

Add two test cases to `backing_queue_SUITE`:

- `msg_store_gc_stuck_suspended` suspends the GC with `sys:suspend`,
  terminates the persistent store via the supervisor, and verifies
  that the store recovers cleanly (`successfully_recovered_state`
  returns `true`) with all messages intact. This covers the case
  where the GC is blocked in its receive loop.
- `msg_store_gc_stuck_mid_callback` mocks `compact_file/2` to block
  indefinitely inside the `handle_cast` callback, sends a compact
  cast to the GC to put it in the blocking callback, then terminates
  the store via the supervisor. This covers the scenario that
  motivated the PR: a GC stuck on disk I/O inside a callback.

Both tests confirm that `exit(GcPid, shutdown)` terminates the GC
process regardless of what Erlang code it is running, that
`terminate/2` completes and writes the recovery files, and that the
next startup recovers cleanly.

(cherry picked from commit 2a548bf)
(cherry picked from commit ec2c976)
The `exit(GCPid, shutdown)` signal is expected to terminate the GC
immediately because the GC does not trap exits. If it does not, the
fallback `exit(GCPid, kill)` path runs silently and the operator has
no visibility into a case that should, by design, never happen.

Emit a warning on the fallback path identifying the store directory
and the timeout that expired. This makes the fallback searchable in
fleet-wide logs so we can tell whether the bounded-wait was ever
actually needed in production.

Thread the `Dir` argument back into `stop_gc/2` for the log message.

(cherry picked from commit 91ec75f)
(cherry picked from commit 922c36e)
@lhoguin lhoguin merged commit 7501bf7 into v4.2.x Apr 29, 2026
300 checks passed
@lhoguin lhoguin deleted the mergify/bp/v4.2.x/pr-16266 branch April 29, 2026 11:08
@lhoguin lhoguin added this to the 4.2.7 milestone Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants