Cluster roots runner#69179
Open
dwoz wants to merge 2 commits into
Open
Conversation
added 2 commits
May 16, 2026 01:02
Operator-driven content fan-out for file_roots and pillar_roots.
Run on whichever master holds the canonical content; the cluster
fans out from there to every peer over the encrypted cluster pub
bus. Same transport, chunk format, and apply path as the join-time
bulk state-sync.
How it works
1. Runner fires 'cluster/runner/sync_roots' to the local event
bus with the requested channel list.
2. MasterPubServerChannel.publish_payload intercepts that tag --
instead of broadcasting as a regular cluster event, it spawns
_run_root_sync_to_peers(channels).
3. For each peer pusher:
* Allocate a fresh session id.
* Send a 'cluster/peer/sync-roots-begin' event so the receiver
pre-registers a state-sync session -- same contract as the
join-reply flow, but the receiver's on_complete is a no-op
(this is an ad-hoc push, not a Raft-learner bootstrap).
* Stream the requested channels in the standard state-sync chunk
format.
4. Receivers apply chunks via the existing _apply_state_sync_chunk
path -- no new install code; the channels are file_roots and
pillar_roots and install_root_chunk handles both.
Live smoke verified on a 5-master cluster: added new content to m1's
srv/salt (including a nested subdirectory) and srv/pillar after the
cluster was up, ran cluster.sync_roots, all 4 peers had the new
files within ~8 seconds. Nested directories preserved.
What's NOT in this slice
* No completion feedback -- runner returns immediately with
status='fan-out initiated'. Operators tail each peer's log for
the 'state-sync ... installed N items' lines to confirm delivery.
* No re-sync trigger -- operator-driven only. No filesystem watcher,
no periodic poll.
Tests
* test_sync_roots_rejects_invalid_roots -- input validation
* test_sync_roots_no_cluster_id_is_skip -- non-cluster master returns
a structured skip rather than firing a meaningless event
* test_sync_roots_fires_local_event -- happy path: event fires with
the resolved channel list
* test_sync_roots_file_only_filters_channels -- channel filter
honoured
End-to-end integration coverage on the 3-master isolated-FS cluster. Complements the unit tests in tests/pytests/unit/runners/test_cluster_runner.py (event firing, input validation) with full daemon-driven runs that exercise the runner -> master event bus -> publish_payload intercept -> _run_root_sync_to_peers -> peer state-sync chunk install path. Two scenarios: * test_isolated_sync_roots_runner_propagates_content After the cluster is steady-state, write new content to master_1's file_roots and pillar_roots, run salt-run cluster.sync_roots, poll master_2 and master_3 until both files appear (or 30s elapses), and assert the marker round-trips through encrypted state-sync. Distinct from test_isolated_late_joiner_receives_file_and_pillar_roots which covers JOIN-time bulk sync. * test_isolated_sync_roots_runner_file_only Pins the channels= filter: roots=file syncs only file_roots; the pillar tree on peers remains untouched. Operator escape hatch against accidentally fanning out secret pillar data when only an SLS update is intended. Pre-existing trivial change: the commented-out warning log line at the top of publish_payload (one of those //log it all// debug crutches) is removed since the function already gets per-event log output via the cluster-event broadcast path and the targeted publish_payload branches. Diagnosis note for future debuggers While bringing these tests up I hit a confusing failure where the runner fired the event, master_1's daemon broadcast it as cluster/event/127.0.0.1/cluster/runner/sync_roots, but the intercept never ran. Root cause: salt-factories had cached /tmp/stsuite/scripts/cli_salt_master.py pointing at a *different* worktree (saw a stale CODE_DIR=.../masterbug entry), so the test daemons loaded an older copy of salt that didn't have the publish_payload intercept. Clearing /tmp/stsuite/scripts fixed it. Worth knowing if you see 'feature works in unit test but integration test silently misses it' — check the factory cache first.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add cluster roots runner to manually sync file/pillar roots.