fix: re-gossip dead/suspect accusations on stale alive and unknown node by emam07 · Pull Request #345 · hashicorp/memberlist

emam07 · 2026-05-31T19:47:59Z

Problem

During mass cluster restarts, nodes can get permanently stuck as dead
in peers' views. The refutation mechanism never triggers because the
accusation (dead/suspect message) never reaches the restarted node.

Root cause — two silent drops in state.go:

Bug 1 (aliveNode, line ~1076): When a node receives
alive(node2, inc=1) but already holds dead(node2, inc=100), it
returns silently. It does not re-gossip the dead accusation back toward
the restarted node, so the node never calls refute().

Bug 2 (deadNode / suspectNode): When a dead or suspect message
arrives at a node that has never heard of the target (common in
freshly-joined nodes during a mass restart), it is silently dropped
instead of forwarded. This creates a gossip black hole that prevents
the accusation from propagating through nodes with incomplete views.

Both bugs together mean the restarted node broadcasts alive(inc=1)
indefinitely but no node ever sends back the dead(inc=100) accusation
it needs to refute. The node stays dead in affected peers' views
permanently — until an accidental TCP push/pull sync fixes it.

Reported in #311.

Fix

aliveNode(): when a stale alive message is received for a dead or
suspect node, re-queue the accusation for gossip so the restarted
node can receive it and refute.
deadNode() / suspectNode(): forward dead/suspect messages for
unknown nodes rather than dropping them. The TransmitLimitedQueue
already bounds retransmissions to RetransmitMult × log(N+1).
Added [INFO] log lines when nodes are marked suspect or dead for
operator visibility.

Tests

Four new tests in state_test.go:

Test	What it proves
`TestMemberList_AliveNode_ReGossipsDeadAccusation`	Stale alive re-queues dead accusation
`TestMemberList_AliveNode_ReGossipsSuspectAccusation`	Stale alive re-queues suspect accusation
`TestMemberList_DeadNode_UnknownNode_ForwardsMessage`	Dead msg forwarded for unknown node
`TestMemberList_SuspectNode_UnknownNode_ForwardsMessage`	Suspect msg forwarded for unknown node

All existing tests pass.

During mass cluster restarts a node can get permanently stuck as dead in peers' views because the refutation mechanism never triggers. Two bugs prevent the accusation from reaching the restarted node: 1. aliveNode() silently dropped stale alive messages (inc <= current) even when the local state was dead/suspect. It now re-queues the dead/suspect message so the restarted node can receive and refute it. 2. deadNode() and suspectNode() silently dropped messages about nodes not yet in the local map. Freshly-joined nodes during a mass restart act as a gossip black hole. They now forward the message so it can propagate through nodes with incomplete cluster views. Adds [INFO] logs when nodes are marked suspect/dead for observability. Four new tests cover both bug scenarios directly. Fixes hashicorp#311

hashicorp-cla-app · 2026-05-31T19:48:13Z

All committers have signed the CLA.

hashicorp-cla-app · 2026-05-31T19:48:13Z

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

_{Have you signed the CLA already but the status is still pending? Recheck it.}

ritikrajdev · 2026-06-02T07:46:44Z

CLA check is passing over here, you can proceed with reviewing the PR and further processes.

emam07 · 2026-06-05T07:04:55Z

The two failing tests (TestShuffleNodes and TestMemberList_ProbeNode_Awareness_MissedNack) are pre-existing flaky tests unrelated to this PR.

This branch only modifies state.go — neither failing test exercises that code path:

TestShuffleNodes (util_test.go) tests shuffleNodes() in util.go, which uses rand.Shuffle. With 8 elements there is a ~1/40320 probability the shuffled
order matches the original — a known statistical flake.
TestMemberList_ProbeNode_Awareness_MissedNack is a timing-sensitive test that already uses iretry.Run() and is known to be flaky on loaded CI runners.

Could you re-run the failed job? Happy to investigate further if it reproduces consistently.

tgross · 2026-06-05T13:11:14Z

@emam07 just a heads up that this is on my review queue. I'll try to get to it soon.

tgross

Hi @emam07! Overall I'm having trouble with this PR. As I've left in my comments, I don't think the characterization of the bug is quite accurate around the freshly-joined node issue. And I'm concerned about the added traffic this is going to generate in the case where a node is legitimately dead. Have you run with this patch in your own environment such that you can characterize that? It's pretty clear this was LLM-generated so it's unclear to me whether this is a drive-by patch for fun or whether you're tackling something you've seen yourself.

Also:

The node stays dead in affected peers' views permanently — until an accidental TCP push/pull sync fixes it.

This is part of the purpose of the TCP push/pull sync! So the cluster shouldn't get itself into a permanently wedged state where nodes never come back, but nodes may take a long time to recover if huge chunks of the cluster have dropped state all at once.

tgross · 2026-07-01T14:12:01Z

 	// Update the state
 	state.Incarnation = s.Incarnation
 	state.State = StateSuspect
+	m.logger.Printf("[INFO] memberlist: Marking %s as suspect (incarnation: %d, from: %s)", s.Node, s.Incarnation, s.From)


These logs are going to be extremely noisy in real-world large clusters (ex 1000s of Consul nodes) and balloon operator costs. Let's remove all these new logs.

tgross · 2026-07-01T14:31:11Z

+	// If we've never heard about this node before, forward the suspect message
+	// anyway so it can propagate through nodes that may know the target. During
+	// mass restarts, freshly joined nodes have incomplete cluster views and
+	// silently dropping the message here creates a gossip black hole.
 	if !ok {
+		m.logger.Printf("[WARN] memberlist: Forwarding suspect msg for unknown node %s (inc: %d, from: %s)",
+			s.Node, s.Incarnation, s.From)
+		m.encodeAndBroadcast(s.Node, suspectMsg, s)
 		return
 	}


It's not clear to me why this part is needed, and despite the retransmit limit this seems like it's going to increase cluster traffic unnecessarily when a node has actually left permanently.

In the SWIM protocol the Confirm message ("Refute" in this library) overrides all Alive or Suspect messages regardless of incarnation. That's why you made the fix in aliveNode(): the restarted node is going to broadcast its initial alive message with incarnation=1, get told its incarnation is stale, and then increment is incarnation appropriately for the next message. So the nodes in the "black hole" will converge on getting an alive message for that node, with incarnation higher than any dead/suspect message they've dropped anyways.

tgross · 2026-07-01T14:34:29Z

 }
+
+// ---------------------------------------------------------------------------
+// Tests for the mass-restart incarnation race condition fix


"race condition"?

tgross · 2026-07-01T14:38:23Z

+// Scenario: A node is declared dead at a high incarnation (e.g. 100).  The
+// node then restarts and re-announces itself at incarnation 1.  Two separate
+// bugs prevented the dead accusation from ever reaching the restarted node:
+//
+//   Bug 1 (aliveNode): a stale alive(inc=1) arriving at a node that already
+//   holds dead(inc=100) was silently dropped — the dead message was never
+//   re-gossiped, so the restarted node never learned it needed to refute.
+//
+//   Bug 2 (deadNode / suspectNode): a dead/suspect message arriving at a node
+//   that has never heard of the target was silently dropped instead of being
+//   forwarded, creating a gossip black hole in freshly-joined nodes.
+// ---------------------------------------------------------------------------


Having entirely separate tests as you've done here undermines the reasoning for why you need both fixes. The tests exercise the low-level behavior you've explained in the PR description but not proven that you've fixed the actual user-facing beahvior we care about.

emam07 requested a review from a team as a code owner May 31, 2026 19:48

tgross added this to Nomad - Community Issues Triage Jun 1, 2026

github-project-automation Bot moved this to Needs Triage in Nomad - Community Issues Triage Jun 1, 2026

fix: convert if/else if to switch to satisfy staticcheck QF1003

0385495

tgross self-requested a review June 5, 2026 13:10

tgross self-assigned this Jun 5, 2026

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jun 5, 2026

tgross requested changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: re-gossip dead/suspect accusations on stale alive and unknown node#345

fix: re-gossip dead/suspect accusations on stale alive and unknown node#345
emam07 wants to merge 2 commits into
hashicorp:masterfrom
emam07:fix/incarnation-mass-restart

emam07 commented May 31, 2026

Uh oh!

hashicorp-cla-app Bot commented May 31, 2026 •

edited

Loading

Uh oh!

hashicorp-cla-app Bot commented May 31, 2026

Uh oh!

ritikrajdev commented Jun 2, 2026

Uh oh!

emam07 commented Jun 5, 2026

Uh oh!

tgross commented Jun 5, 2026

Uh oh!

tgross left a comment

Uh oh!

tgross Jul 1, 2026

Uh oh!

tgross Jul 1, 2026

Uh oh!

tgross Jul 1, 2026

Uh oh!

tgross Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

emam07 commented May 31, 2026

Problem

Fix

Tests

Uh oh!

hashicorp-cla-app Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hashicorp-cla-app Bot commented May 31, 2026

Uh oh!

ritikrajdev commented Jun 2, 2026

Uh oh!

emam07 commented Jun 5, 2026

Uh oh!

tgross commented Jun 5, 2026

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

tgross Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

tgross Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

tgross Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

tgross Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hashicorp-cla-app Bot commented May 31, 2026 •

edited

Loading