Skip to content

Cluster roots runner#69179

Open
dwoz wants to merge 2 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2
Open

Cluster roots runner#69179
dwoz wants to merge 2 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 16, 2026

What does this PR do?

Add cluster roots runner to manually sync file/pillar roots.

Daniel A. Wozniak added 2 commits May 16, 2026 01:02
Operator-driven content fan-out for file_roots and pillar_roots.
Run on whichever master holds the canonical content; the cluster
fans out from there to every peer over the encrypted cluster pub
bus.  Same transport, chunk format, and apply path as the join-time
bulk state-sync.

How it works

1. Runner fires 'cluster/runner/sync_roots' to the local event
   bus with the requested channel list.
2. MasterPubServerChannel.publish_payload intercepts that tag --
   instead of broadcasting as a regular cluster event, it spawns
   _run_root_sync_to_peers(channels).
3. For each peer pusher:
   * Allocate a fresh session id.
   * Send a 'cluster/peer/sync-roots-begin' event so the receiver
     pre-registers a state-sync session -- same contract as the
     join-reply flow, but the receiver's on_complete is a no-op
     (this is an ad-hoc push, not a Raft-learner bootstrap).
   * Stream the requested channels in the standard state-sync chunk
     format.
4. Receivers apply chunks via the existing _apply_state_sync_chunk
   path -- no new install code; the channels are file_roots and
   pillar_roots and install_root_chunk handles both.

Live smoke verified on a 5-master cluster: added new content to m1's
srv/salt (including a nested subdirectory) and srv/pillar after the
cluster was up, ran cluster.sync_roots, all 4 peers had the new
files within ~8 seconds.  Nested directories preserved.

What's NOT in this slice

* No completion feedback -- runner returns immediately with
  status='fan-out initiated'.  Operators tail each peer's log for
  the 'state-sync ... installed N items' lines to confirm delivery.
* No re-sync trigger -- operator-driven only.  No filesystem watcher,
  no periodic poll.

Tests

* test_sync_roots_rejects_invalid_roots -- input validation
* test_sync_roots_no_cluster_id_is_skip -- non-cluster master returns
  a structured skip rather than firing a meaningless event
* test_sync_roots_fires_local_event -- happy path: event fires with
  the resolved channel list
* test_sync_roots_file_only_filters_channels -- channel filter
  honoured
End-to-end integration coverage on the 3-master isolated-FS cluster.
Complements the unit tests in
tests/pytests/unit/runners/test_cluster_runner.py (event firing,
input validation) with full daemon-driven runs that exercise the
runner -> master event bus -> publish_payload intercept ->
_run_root_sync_to_peers -> peer state-sync chunk install path.

Two scenarios:

* test_isolated_sync_roots_runner_propagates_content
  After the cluster is steady-state, write new content to master_1's
  file_roots and pillar_roots, run salt-run cluster.sync_roots, poll
  master_2 and master_3 until both files appear (or 30s elapses),
  and assert the marker round-trips through encrypted state-sync.
  Distinct from test_isolated_late_joiner_receives_file_and_pillar_roots
  which covers JOIN-time bulk sync.

* test_isolated_sync_roots_runner_file_only
  Pins the channels= filter: roots=file syncs only file_roots; the
  pillar tree on peers remains untouched.  Operator escape hatch
  against accidentally fanning out secret pillar data when only an
  SLS update is intended.

Pre-existing trivial change: the commented-out warning log line at
the top of publish_payload (one of those //log it all// debug
crutches) is removed since the function already gets per-event log
output via the cluster-event broadcast path and the targeted
publish_payload branches.

Diagnosis note for future debuggers

While bringing these tests up I hit a confusing failure where the
runner fired the event, master_1's daemon broadcast it as
cluster/event/127.0.0.1/cluster/runner/sync_roots, but the
intercept never ran.  Root cause: salt-factories had cached
/tmp/stsuite/scripts/cli_salt_master.py pointing at a *different*
worktree (saw a stale CODE_DIR=.../masterbug entry), so the test
daemons loaded an older copy of salt that didn't have the
publish_payload intercept.  Clearing /tmp/stsuite/scripts fixed it.
Worth knowing if you see 'feature works in unit test but
integration test silently misses it' — check the factory cache
first.
@dwoz dwoz requested a review from a team as a code owner May 16, 2026 08:12
@dwoz dwoz added the test:full Run the full test suite label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:coverage test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant