Skip to content

fix(rbg): use SSA Apply with distinct field managers on RBG writes#309

Open
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbg-ssa-field-manager
Open

fix(rbg): use SSA Apply with distinct field managers on RBG writes#309
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbg-ssa-field-manager

Conversation

@sebest

@sebest sebest commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Three RBG-controller write paths used full-object client.Update or MergeFrom patches on RoleBasedGroup, which the API server records under the default "manager" field manager (the operator binary is /manager). On any chart-value bump that touched a passively co-owned field (containers, env, volumes, tolerations, ...), Helm v4 SSA Apply with ForceConflicts=false aborted with field_conflict — even though the RBG controller never authoritatively writes those sibling fields.

Replace the three writes with narrow SSA Apply calls and assign each its own field manager:

Apply path Field manager Claim
ensureDiscoveryConfigMode (RBG controller) rbg-discovery metadata.annotations.discovery-config-mode
updateRoleReplicas (RBGSA controller) rbg-replicas spec.roles[].{name, replicas}
updateExistingRBGs (RBGSet controller) rbg-set-sync metadata.{labels, annotations}, spec.roles

A shared "rbg" field manager across these three paths would not work. Per SSA claim-release semantics, when the same field manager re-applies with a narrower claim, fields it previously owned but no longer claims are released and (if no other manager owns them) removed by the API server. With three Apply paths claiming disjoint subsets of the same RBG, a shared field manager makes them ping-pong — each Apply silently strips the others' writes, hot-looping the controller. Distinct field managers give each path an independent ownership boundary.

pkg/utils.PatchObjectApplyConfigurationWithFieldManager is the new field-manager-aware variant; PatchObjectApplyConfiguration remains a thin wrapper that defaults to the legacy "rbg" field manager — preserved for callers that don't share the RBG main subresource with other rbg-controller Apply paths (RBG status writes, RBGSA writes, ConfigMap writes — all on different objects/subresources, no ping-pong risk).

The retry.RetryOnConflict wrappers from the previous Update paths are dropped: SSA Apply with Force=true does not 409 on field-manager conflicts, and no resourceVersion is sent in the apply payload, so version-skew conflicts do not arise either.

Stripping pre-existing legacy "manager" Update entries on already-deployed resources is intentionally out of scope; that is the responsibility of the upstream SSA writer (e.g. via csaupgrade.UpgradeManagedFields or a one-off managedFields strip).

Test plan

  • go build ./... clean
  • go vet ./... clean
  • gofmt -l clean
  • go test ./internal/controller/workloads/... ./pkg/utils/... green (incl. new tests for each Apply path: TestUpdateRoleReplicas, TestBuildSyncedMetadata, TestToRoleSpecApplyConfigurations, TestRoleBasedGroupSetReconciler_updateExistingRBGs_SSA, extended TestEnsureDiscoveryConfigMode)
  • End-to-end on kind cluster:
    • Apply RBG via kubectl apply --server-side --field-manager=external-actor with scalingAdapter.enable=true, scale auto-created RBGSA to drive updateRoleReplicas, apply parent RBGSet to drive updateExistingRBGs.
    • Verify generation is stable (would hot-loop at ~20 Hz with a shared field manager).
    • Verify discovery-config-mode annotation persists alongside external-actor's annotations.
    • Verify external-actor retains ownership of containers/env/image/resources/scalingAdapter.
    • Verify kubectl get rbg --show-managed-fields shows three independent rbg-discovery, rbg-replicas, rbg-set-sync managers each owning only its narrow claim.
    • Verify external-actor SSA chart bump of HEDWIG_CLIENT_ID.value (the original failure mode) with --force-conflicts=false applies cleanly.

Follow-up

#310 — RBGSet child RBGs carry a residual manager Update entry from client.Create in newRBGForSet. Predates this PR and doesn't affect the i8s-bridge scenario (which deploys standalone RBGs, not RBGSets), but worth fixing for hygiene.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the RoleBasedGroup, RoleBasedGroupScalingAdapter, and RoleBasedGroupSet controllers to use Server-Side Apply (SSA) for resource updates, replacing previous update and patch mechanisms that relied on manual retry logic. Key changes include the introduction of a JSON-based conversion helper for complex spec fields and refactored metadata synchronization. Review feedback correctly identifies that the new SSA configurations in the RoleBasedGroup and RoleBasedGroupSet controllers are missing required Kind and APIVersion fields, which would lead to failures when the API server attempts to identify the target resource.

Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread internal/controller/workloads/rolebasedgroupset_controller.go
@sebest sebest force-pushed the fix/rbg-ssa-field-manager branch from 762f04a to e03df62 Compare April 26, 2026 18:10
@sebest sebest marked this pull request as draft April 26, 2026 18:55
@sebest sebest force-pushed the fix/rbg-ssa-field-manager branch from 31795e0 to 7c4d584 Compare April 26, 2026 20:06
@sebest sebest closed this Apr 26, 2026
@sebest sebest reopened this Apr 26, 2026
Three RBG-controller write paths used full-object client.Update or
MergeFrom patches on RoleBasedGroup, which the API server records under
the default "manager" field manager (binary name is /manager). On any
chart-value bump that touched a passively co-owned field (containers,
env, volumes, tolerations, ...), Helm v4 SSA Apply with
ForceConflicts=false aborted with field_conflict — even though the RBG
controller never authoritatively writes those sibling fields.

Replace the three writes with narrow SSA Apply calls and assign each its
own field manager:

  - ensureDiscoveryConfigMode → "rbg-discovery"
    Applies only metadata.annotations.discovery-config-mode.
  - updateRoleReplicas (RBGSA controller) → "rbg-replicas"
    Applies only spec.roles[].{name, replicas}.
  - updateExistingRBGs (RBGSet controller) → "rbg-set-sync"
    Applies metadata.{labels, annotations} and spec.roles.

A single shared "rbg" field manager would not work here. SSA's
claim-release semantics: when the same field manager re-applies with a
narrower claim, fields it previously owned but no longer claims are
released and (if no other manager owns them) removed by the API server.
With three Apply paths claiming disjoint subsets of the same RBG, a
shared field manager makes them ping-pong — each Apply silently strips
the others' writes, hot-looping the controller. Distinct field managers
give each path an independent ownership boundary.

The retry.RetryOnConflict wrappers from the previous Update path are
dropped: SSA Apply with Force=true does not 409 on field-manager
conflicts, and no resourceVersion is sent in the apply payload, so
version-skew conflicts do not arise either.

pkg/utils.PatchObjectApplyConfigurationWithFieldManager is the new
field-manager-aware variant; PatchObjectApplyConfiguration remains a
thin wrapper that defaults to the legacy "rbg" field manager for callers
that don't share the RBG main subresource with other rbg-controller
Apply paths (RBG status writes, RBGSA writes, ConfigMap writes — all on
different objects/subresources, no ping-pong risk).

Stripping pre-existing legacy "manager" Update entries on already-
deployed resources is intentionally out of scope; that is the
responsibility of the upstream SSA writer (e.g. via
csaupgrade.UpgradeManagedFields or a one-off managedFields strip).

Validated end-to-end on kind: applied an RBG via
'kubectl apply --server-side --field-manager=external-actor' with
scalingAdapter.enable=true, scaled the auto-created RBGSA to drive
updateRoleReplicas, and applied a parent RBGSet to drive
updateExistingRBGs. Generation is stable (would hot-loop at ~20 Hz with
a shared field manager), the discovery-config-mode annotation persists
alongside external-actor's annotations, external-actor retains
ownership of containers/env/image/resources/scalingAdapter, and
managedFields shows three independent rbg-* field managers each owning
only its narrow claim.
@sebest sebest force-pushed the fix/rbg-ssa-field-manager branch from 7c4d584 to 3b1ad5a Compare April 26, 2026 21:13
@sebest sebest changed the title fix(rbg): use SSA Apply for spec writes to avoid claiming "manager" field manager fix(rbg): use SSA Apply with distinct field managers on RBG writes Apr 26, 2026
@sebest sebest marked this pull request as ready for review April 26, 2026 21:51

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: SSA Apply with distinct field managers

The overall architecture is sound. Using distinct field managers per logical Apply path is the correct approach to avoid SSA claim-release ping-pong. The commit message is thorough and well-reasoned. A few items need attention before this is merge-ready.


E2E Test Failure (blocking)

The e2e-test job fails on the "rbgset controller exclusive-topology" test case. The debug output shows:

[RBGSet] Found 0 RBGs
[RBGSet] Status: Replicas=0, ReadyReplicas=0

No child RBGs were created at all. While the scaleUp creation path itself was not changed in this PR, the RBGSet reconciler could be erroring out in the updateExistingRBGs or toRoleSpecApplyConfigurations path before reaching the scale-up logic. The flaky-check script reports this is NOT a flaky failure. Please investigate whether this is caused by your changes or is a pre-existing infra issue (e.g., single-node kind cluster + exclusive topology scheduling). If pre-existing, link an existing issue or show a green run on main at the same SHA.


Replicas field ownership overlap (design question)

rbg-set-sync applies the full spec.roles (including replicas) from the RBGSet template, while rbg-replicas applies only {name, replicas} for a specific role. Both use Force: true.

When the RBGSet template is updated (e.g., image bump), toRoleSpecApplyConfigurations round-trips the entire RoleSpec including Replicas. This means rbg-set-sync will reset replicas to the template value, overriding whatever the RBGSA controller previously set. The RBGSA will eventually re-scale, but there is a transient flap window.

Is this intentional? If not, consider either:

  1. Zeroing out Replicas in the apply config within updateExistingRBGs so rbg-set-sync never claims that field, or
  2. Documenting this as expected behavior (template replicas are the "base" and RBGSA overrides are eventually consistent).

Minor observations

  1. toRoleSpecApplyConfigurations JSON round-trip: The approach is pragmatic but inherently fragile if the generated apply-configuration types ever drift from the API types in JSON tags. A comment noting this coupling already exists, which is good. Consider adding a unit test that round-trips a role with every populated field and asserts no data loss (the current tests cover most but not all fields).

  2. Test for updateRoleReplicas: The test comment correctly notes that the fake client cannot verify SSA list-merge. It would strengthen confidence to add an envtest case (or reference an existing one) that exercises the multi-role scenario where only one role's replicas should change.

  3. buildSyncedMetadata is now a package-level function: Good refactoring. The added test case for "System labels cannot be overridden by template labels" is a nice improvement over the old test suite.


Summary

  • Architecture: correct use of distinct field managers; well-motivated
  • Correctness: E2E failure must be resolved; replicas ownership overlap needs clarification
  • Security: no concerns
  • Test coverage: solid unit coverage; envtest/E2E gap on multi-manager concurrency

Verdict: needs-work until the E2E failure is explained/fixed and the replicas ownership question is answered.

@cheyang

cheyang commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

The SSA migration is well-designed: three disjoint write paths now each use their own field manager (rbg-discovery, rbg-replicas, rbg-set-sync), correctly preventing SSA claim-release ping-pong. The commit message thoroughly explains the rationale. JSON round-trip conversion for deeply nested RoleSpec types is pragmatic, and the buildSyncedMetadata refactor removes duplication. Good test coverage for the new paths.

Two items to address: (1) the doc comment on updateExistingRBGs still says "rbg" instead of "rbg-set-sync", and (2) the exclusive-topology e2e test times out on this branch, which could be related to the SSA rewrite of updateExistingRBGs. Please investigate the e2e failure and fix the stale comment.

// updateExistingRBGs updates existing RoleBasedGroup instances to match the current template.
// updateExistingRBGs updates existing RoleBasedGroup instances to match the current template
// using Server-Side Apply. This claims ownership of metadata.labels, metadata.annotations,
// and spec.roles under the "rbg" field manager instead of the default "manager".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment on updateExistingRBGs says the function claims ownership under the "rbg" field manager, but the code actually uses utils.RBGSetSyncFieldManager which resolves to "rbg-set-sync". Since the whole point of this PR is that distinct field managers matter, the comment should match the code.

Suggested fix:

// ... under the "rbg-set-sync" field manager instead of the default "manager".

WithRoles(
applyconfiguration.RoleSpec().
WithName(targetRoleName).
WithReplicas(*newReplicas),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code assigned role.Replicas = newReplicas (pointer-safe even when nil), but the new code dereferences with *newReplicas on line 508. The current caller guards against nil before reaching this point, so this is not a live bug today, but a defensive nil check (or accepting the pointer directly via WithReplicas) would prevent a panic if a future caller omits the guard.

Non-blocking -- just something to keep in mind.

@@ -498,28 +496,25 @@ func (r *RoleBasedGroupScalingAdapterReconciler) GetTargetRbgFromAdapter(
func (r *RoleBasedGroupScalingAdapterReconciler) updateRoleReplicas(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All calls to updateRoleReplicas share the single rbg-replicas field manager. Under SSA list-map merge semantics, applying only one role releases the manager's claim on any previously-owned roles. If two RBGSAs target different roles of the same RBG, the second apply releases ownership of the first role's replicas field.

Is there a guarantee that only one RBGSA at a time updates replicas for a given RBG? If multiple RBGSAs can exist per RBG, a per-role field manager like rbg-replicas-<roleName> would preserve independent ownership.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants