Skip to content

Terminate when ra_systems_sup exits#15782

Draft
the-mikedavis wants to merge 1 commit intomainfrom
md/terminate-on-ra-systems-sup-exit
Draft

Terminate when ra_systems_sup exits#15782
the-mikedavis wants to merge 1 commit intomainfrom
md/terminate-on-ra-systems-sup-exit

Conversation

@the-mikedavis
Copy link
Copy Markdown
Collaborator

The goal of this change is to shut down Rabbit when the ra_systems_sup supervisor exits. If disk is completely exhausted then the ra_systems_sup supervisor can exit from the repeated enospc errors. If we do not also crash Rabbit, Rabbit continues on but Khepri is unavailable, so no incoming connections can successfully log in. Plus there are other effects like rabbit_vhost_process deleting a vhost because Khepri does not say that it exists.

Connects to rabbitmq/ra#585

@the-mikedavis the-mikedavis self-assigned this Mar 19, 2026
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an extra gen_server just to crash the Rabbit app seems a bit hacky to me, though it works. I'm sure there must be a better way to crash Rabbit when ra_systems_sup exits.

@the-mikedavis the-mikedavis marked this pull request as draft March 19, 2026 15:26
@the-mikedavis
Copy link
Copy Markdown
Collaborator Author

the-mikedavis commented Mar 19, 2026

Here is a full log from a make run-broker on the current main (4b59c19) run where ra_systems_sup exits from repeated enospc. It's easier to reproduce with a config like so:

raft.data_dir = /path/to/small/tmpfs/mount
# Tune down WAL file size. If there are a few large WAL
# files then recovery might take long enough that it avoids
# hitting max restart intensity.
raft.wal_max_size_bytes = 67108864

enospc-crash-log.txt

@kjnilsson kjnilsson requested a review from dumbbell April 14, 2026 19:06
@dumbbell
Copy link
Copy Markdown
Collaborator

Thanks, I will review it next week when I get back from holidays.

init(_) ->
process_flag(trap_exit, true),
Pid = whereis(ra_systems_sup),
true = link(Pid),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t you need to unlink in the terminate/2 callback? Otherwise, stopping rabbit will kill ra_systems_sup. I know that rabbit messes with its dependencies so it might not be a problem right now. But if one day, we fix rabbit, it will be one.

The goal of this change is to shut down Rabbit when the `ra_systems_sup`
supervisor exits. If disk is completely exhausted then the
`ra_systems_sup` supervisor can exit from the repeated `enospc` errors.
If we do not also crash Rabbit, Rabbit continues on but Khepri is
unavailable, so no incoming connections can successfully log in. Plus
there are other effects like `rabbit_vhost_process` deleting a vhost
because Khepri does not say that it exists.
@the-mikedavis the-mikedavis force-pushed the md/terminate-on-ra-systems-sup-exit branch from 56ddbbd to 8a3cd92 Compare April 20, 2026 14:25
true = link(Pid),
{ok, Pid, hibernate}.

handle_call(_Request, _From, State) -> {noreply, State}.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leave the caller waiting for a reply, until timeout if it set one. I think you can reply with an ok just to avoid that. What do you think?

?MODULE_STRING ": Ra system supervisor exited with reason ~tp~n",
[Reason],
#{domain => ?RMQLOG_DOMAIN_GLOBAL}),
exit(E);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about returning {stop, Reason, State}? Would it have a different behaviour compared to exiting?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah actually, Rabbit doesn't exit unless we use exit/1 here. With {stop, Reason, State} Rabbit keeps going. I'm not really sure why. This is the part that seems pretty hacky to me 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants