Skip to content

Cluster roots runner#69179

Open
dwoz wants to merge 4 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2
Open

Cluster roots runner#69179
dwoz wants to merge 4 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 16, 2026

What does this PR do?

Add cluster roots runner to manually sync file/pillar roots.

Daniel A. Wozniak added 2 commits May 16, 2026 01:02
Operator-driven content fan-out for file_roots and pillar_roots.
Run on whichever master holds the canonical content; the cluster
fans out from there to every peer over the encrypted cluster pub
bus.  Same transport, chunk format, and apply path as the join-time
bulk state-sync.

How it works

1. Runner fires 'cluster/runner/sync_roots' to the local event
   bus with the requested channel list.
2. MasterPubServerChannel.publish_payload intercepts that tag --
   instead of broadcasting as a regular cluster event, it spawns
   _run_root_sync_to_peers(channels).
3. For each peer pusher:
   * Allocate a fresh session id.
   * Send a 'cluster/peer/sync-roots-begin' event so the receiver
     pre-registers a state-sync session -- same contract as the
     join-reply flow, but the receiver's on_complete is a no-op
     (this is an ad-hoc push, not a Raft-learner bootstrap).
   * Stream the requested channels in the standard state-sync chunk
     format.
4. Receivers apply chunks via the existing _apply_state_sync_chunk
   path -- no new install code; the channels are file_roots and
   pillar_roots and install_root_chunk handles both.

Live smoke verified on a 5-master cluster: added new content to m1's
srv/salt (including a nested subdirectory) and srv/pillar after the
cluster was up, ran cluster.sync_roots, all 4 peers had the new
files within ~8 seconds.  Nested directories preserved.

What's NOT in this slice

* No completion feedback -- runner returns immediately with
  status='fan-out initiated'.  Operators tail each peer's log for
  the 'state-sync ... installed N items' lines to confirm delivery.
* No re-sync trigger -- operator-driven only.  No filesystem watcher,
  no periodic poll.

Tests

* test_sync_roots_rejects_invalid_roots -- input validation
* test_sync_roots_no_cluster_id_is_skip -- non-cluster master returns
  a structured skip rather than firing a meaningless event
* test_sync_roots_fires_local_event -- happy path: event fires with
  the resolved channel list
* test_sync_roots_file_only_filters_channels -- channel filter
  honoured
End-to-end integration coverage on the 3-master isolated-FS cluster.
Complements the unit tests in
tests/pytests/unit/runners/test_cluster_runner.py (event firing,
input validation) with full daemon-driven runs that exercise the
runner -> master event bus -> publish_payload intercept ->
_run_root_sync_to_peers -> peer state-sync chunk install path.

Two scenarios:

* test_isolated_sync_roots_runner_propagates_content
  After the cluster is steady-state, write new content to master_1's
  file_roots and pillar_roots, run salt-run cluster.sync_roots, poll
  master_2 and master_3 until both files appear (or 30s elapses),
  and assert the marker round-trips through encrypted state-sync.
  Distinct from test_isolated_late_joiner_receives_file_and_pillar_roots
  which covers JOIN-time bulk sync.

* test_isolated_sync_roots_runner_file_only
  Pins the channels= filter: roots=file syncs only file_roots; the
  pillar tree on peers remains untouched.  Operator escape hatch
  against accidentally fanning out secret pillar data when only an
  SLS update is intended.

Pre-existing trivial change: the commented-out warning log line at
the top of publish_payload (one of those //log it all// debug
crutches) is removed since the function already gets per-event log
output via the cluster-event broadcast path and the targeted
publish_payload branches.

Diagnosis note for future debuggers

While bringing these tests up I hit a confusing failure where the
runner fired the event, master_1's daemon broadcast it as
cluster/event/127.0.0.1/cluster/runner/sync_roots, but the
intercept never ran.  Root cause: salt-factories had cached
/tmp/stsuite/scripts/cli_salt_master.py pointing at a *different*
worktree (saw a stale CODE_DIR=.../masterbug entry), so the test
daemons loaded an older copy of salt that didn't have the
publish_payload intercept.  Clearing /tmp/stsuite/scripts fixed it.
Worth knowing if you see 'feature works in unit test but
integration test silently misses it' — check the factory cache
first.
@dwoz dwoz requested a review from a team as a code owner May 16, 2026 08:12
@dwoz dwoz added the test:full Run the full test suite label May 16, 2026
Each ring becomes its own Raft group with independent leader, log, and
voter set. The cluster log carries a RingRegistryStateMachine and a
RoutingStateMachine that name rings and route data types to them; per-ring
membership and policy live in that ring's own log.

Operator runners drive the lifecycle end to end:

  cluster.ring_create / ring_destroy / ring_set
  cluster.route_set / route_clear / routes / rings
  cluster.shed_unowned / shed_unowned_all / shed_status
  cluster.collect_from_peers
  cluster.migrate_jobs_to_cache

shed and collect reuse the existing sync-roots wire format with a
bank: channel prefix so arbitrary cache banks travel over state-sync
chunks. The salt_cache returner replaces the implicit local-jobs
layout with an explicit cache-backed one (jobs/loads, jobs/minions,
jobs/returns/<jid>, jobs/endtimes, jobs/nocache) so the migration has
a well-defined bank set to move.

Gate sites at salt/master.py call ring_membership.owns_for(opts,
data_type, key); a non-member master no-ops the write and rate-limits
a WARN per (data_type, ring, reason) bucket. A delegate-on-miss event
gives operators a safety net for asymmetric topologies where traffic
can't reach a ring member directly.

Per-ring voter-health watchdog with (group_id, peer_id) cooldown keying
keeps each ring's quorum healthy independently. cluster-health.json is
now structured per ring. Sentinel writes (shed-status, collect-status,
cluster-health) go through salt.utils.atomicfile.atomic_open so a
concurrent reader or doubled fan-out never sees partial JSON. The
publish daemon dedups self.pushers by (pull_host, pull_port) to stop
the shed fan-out from being delivered twice.
The five operator-facing runners (cluster.ring_create, ring_destroy,
route_set, route_clear, ring_set) propose entries through the cluster
Raft log, and that only works on the leader.  The runner intercept in
MasterPubServerChannel.publish_payload dispatched locally only — when
the operator invoked the runner on a non-leader master the propose
raised RuntimeError, the exception was swallowed by log.exception, and
no fan-out reached the actual leader.

The intercept now also schedules a cluster_aes-encrypted
cluster/peer/multi-ring-request event to every peer.  Each peer
re-dispatches through _handle_multi_ring_runner_event; only the master
that is the current Raft leader appends an entry, followers log "not
leader" and skip.  No double-commit because Raft has one leader at any
moment.  A new unit-test file pins the wire shape (one publish per
pusher, cluster_aes round-trip, broken-pusher tolerance, no-peers
no-op, missing-secret silent skip).

Also raise per-test timeouts on slow tests that were tripping the
global 90-second pytest-timeout under coverage tracing on loaded GHA
runners.  The workflow's automatic --lf rerun absorbs the resulting
slow-but-passes case so a transient runner-load spike no longer fails
the job:

  * test_check_minions_grain_target_10000 — 240 s (test budget is
    180 s; the global 90 s default fired before the budget assertion
    could evaluate).
  * test_r_state_apply_logical_resource_no_state_module — 180 s
    (three sequential state.apply runs against dummy resources).
  * test_auth_not_starved_with_routing — 180 s (7-daemon
    saturation test).
  * test_orchestration_with_pillar_dot_items — passes _timeout=120
    to salt_run_cli.run() so the 30 s factory default doesn't kill
    the recursive saltutil.runner('pillar.show_pillar') invocation.

Tornado timing flake: test_when_is_timed_out_is_set_before_other_events_are_completed
used gather_timeout=0.1 s and event_timeout=0.15 s.  The 50 ms ordering
window was regularly inverted by GHA scheduler jitter; bumped to 1 s
and 2 s respectively, preserving the test's logical invariant.

Windows assertion drift: test_win_user_present_new_password now
checks changes.get("passwd") == "XXX-REDACTED-XXX" instead of
strict-equal so the new password_changed / lstchg keys surfaced by
the recent shadow.info refactor don't fail the test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:coverage test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant