Skip to content

rabbit_mnesia: add retries#15502

Closed
mkuratczyk wants to merge 1 commit intomainfrom
mnesia-retry
Closed

rabbit_mnesia: add retries#15502
mkuratczyk wants to merge 1 commit intomainfrom
mnesia-retry

Conversation

@mkuratczyk
Copy link
Copy Markdown
Contributor

Occasionally, clustering will fail with the log
as pasted before. I believe it's because of the parallel node startup, sometimes leading to crashes.

Hopefully, with retries, we'll handle this more gracefully.

Feature flags: nodes `rmq-ct-cluster_size_3_2-2-21072@localhost` and `rmq-ct-cluster_size_3_2-1-21000@localhost` are compatible

Mnesia('rmq-ct-cluster_size_3_2-2-21072@localhost'): ** ERROR ** (ignoring core) ** FATAL ** mnesia_monitor crashed:
{{badmatch, <0.203.0>, Ref<0.1988436133.884998146.137464>}},
{mnesia_monitor, handle_info, 2, [{file, "mnesia_monitor.erl"}, {line, 583}]},
gen_server, try_handle_info, 3, [{file, "gen_server.erl"}, {line, 2434}]},
gen_server, handle_msg, 3, [{file, "gen_server.erl"}, {line, 2420}]},
proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 333}]}]}

Error in process <0.300.0> on node 'rmq-ct-cluster_size_3_2-2-21072@localhost' with exit value:
{badarg,[{erlang,send,
                [mnesia_locker,{release_tid,{tid,142,<24815.431.0>}}],
                [{error_info,#{module => erl_erts_errors}}]},
        {mnesia_locker,release_tid,1,[{file,"mnesia_locker.erl"},{line,128}]},
        {mnesia_tm,commit_participant,7,
                   [{file,"mnesia_tm.erl"},{line,1828}]}]}

Application mnesia exited with reason: stopped

BOOT FAILED
===========
Exception during startup:

Exit:{killed,{gen_server,call,[<0.280.0>,{negotiate_protocol,['rmq-ct-cluster_size_3_2-1-21000@localhost']},infinity]}}

   gen_server:call/3, line 1301
   mnesia_monitor:call/1, line 232
   rabbit_mnesia:-check_mnesia_consistency/2-fun-0-/2, line 1002
   rabbit_mnesia:with_running_or_clean_mnesia/1, line 1036
   rabbit_mnesia:check_cluster_consistency/2, line 719
   lists:foldl/3, line 2466
   rabbit_mnesia:check_cluster_consistency/0, line 680
   rabbit_prelaunch_cluster:setup/1, line 27

@mkuratczyk mkuratczyk marked this pull request as draft February 18, 2026 16:26
@mkuratczyk
Copy link
Copy Markdown
Contributor Author

Comment thread deps/rabbit/src/rabbit_mnesia.erl Outdated
Comment thread deps/rabbit/src/rabbit_mnesia.erl
@mkuratczyk mkuratczyk force-pushed the mnesia-retry branch 2 times, most recently from 02fb19c to 39d97cb Compare February 24, 2026 07:31
Occasionally, clustering will fail with the log
as pasted before. I believe it's because of the parallel
node startup, sometimes leading to crashes.

Hopefully, with retries, we'll handle this more gracefully.

```
Feature flags: nodes `rmq-ct-cluster_size_3_2-2-21072@localhost` and `rmq-ct-cluster_size_3_2-1-21000@localhost` are compatible

Mnesia('rmq-ct-cluster_size_3_2-2-21072@localhost'): ** ERROR ** (ignoring core) ** FATAL ** mnesia_monitor crashed:
{{badmatch, <0.203.0>, Ref<0.1988436133.884998146.137464>}},
{mnesia_monitor, handle_info, 2, [{file, "mnesia_monitor.erl"}, {line, 583}]},
gen_server, try_handle_info, 3, [{file, "gen_server.erl"}, {line, 2434}]},
gen_server, handle_msg, 3, [{file, "gen_server.erl"}, {line, 2420}]},
proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 333}]}]}

Error in process <0.300.0> on node 'rmq-ct-cluster_size_3_2-2-21072@localhost' with exit value:
{badarg,[{erlang,send,
                [mnesia_locker,{release_tid,{tid,142,<24815.431.0>}}],
                [{error_info,#{module => erl_erts_errors}}]},
        {mnesia_locker,release_tid,1,[{file,"mnesia_locker.erl"},{line,128}]},
        {mnesia_tm,commit_participant,7,
                   [{file,"mnesia_tm.erl"},{line,1828}]}]}

Application mnesia exited with reason: stopped

BOOT FAILED
===========
Exception during startup:

Exit:{killed,{gen_server,call,[<0.280.0>,{negotiate_protocol,['rmq-ct-cluster_size_3_2-1-21000@localhost']},infinity]}}

   gen_server:call/3, line 1301
   mnesia_monitor:call/1, line 232
   rabbit_mnesia:-check_mnesia_consistency/2-fun-0-/2, line 1002
   rabbit_mnesia:with_running_or_clean_mnesia/1, line 1036
   rabbit_mnesia:check_cluster_consistency/2, line 719
   lists:foldl/3, line 2466
   rabbit_mnesia:check_cluster_consistency/0, line 680
   rabbit_prelaunch_cluster:setup/1, line 27
```
@michaelklishin
Copy link
Copy Markdown
Collaborator

@mkuratczyk is this still relevant now that Mnesia was removed in main? The functions in question seem to be fairly Mnesia-specific to me.

@mkuratczyk
Copy link
Copy Markdown
Contributor Author

It's 100% Mnesia specific indeed. It should prevent a very rare cluster startup failure, which I've seen both in CI and in "real-life" (when I deployed a cluster to Kubernetes, one of the nodes failed to start). I'm not precious about this change - we can just drop it, but technically it should resolve a rare issue in 4.2 and older (I have no way of reproducing the issue though).

@mkuratczyk
Copy link
Copy Markdown
Contributor Author

Let's just close it. It's a very rare situattion specific to Mnesia, which has been removed in 4.3.

@mkuratczyk mkuratczyk closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants