fix(rbg): use SSA Apply with distinct field managers on RBG writes by sebest · Pull Request #309 · sgl-project/rbg

sebest · 2026-04-26T18:00:41Z

Summary

Three RBG-controller write paths used full-object client.Update or MergeFrom patches on RoleBasedGroup, which the API server records under the default "manager" field manager (the operator binary is /manager). On any chart-value bump that touched a passively co-owned field (containers, env, volumes, tolerations, ...), Helm v4 SSA Apply with ForceConflicts=false aborted with field_conflict — even though the RBG controller never authoritatively writes those sibling fields.

Replace the three writes with narrow SSA Apply calls and assign each its own field manager:

Apply path	Field manager	Claim
`ensureDiscoveryConfigMode` (RBG controller)	`rbg-discovery`	`metadata.annotations.discovery-config-mode`
`updateRoleReplicas` (RBGSA controller)	`rbg-replicas`	`spec.roles[].{name, replicas}`
`updateExistingRBGs` (RBGSet controller)	`rbg-set-sync`	`metadata.{labels, annotations}`, `spec.roles`

A shared "rbg" field manager across these three paths would not work. Per SSA claim-release semantics, when the same field manager re-applies with a narrower claim, fields it previously owned but no longer claims are released and (if no other manager owns them) removed by the API server. With three Apply paths claiming disjoint subsets of the same RBG, a shared field manager makes them ping-pong — each Apply silently strips the others' writes, hot-looping the controller. Distinct field managers give each path an independent ownership boundary.

pkg/utils.PatchObjectApplyConfigurationWithFieldManager is the new field-manager-aware variant; PatchObjectApplyConfiguration remains a thin wrapper that defaults to the legacy "rbg" field manager — preserved for callers that don't share the RBG main subresource with other rbg-controller Apply paths (RBG status writes, RBGSA writes, ConfigMap writes — all on different objects/subresources, no ping-pong risk).

The retry.RetryOnConflict wrappers from the previous Update paths are dropped: SSA Apply with Force=true does not 409 on field-manager conflicts, and no resourceVersion is sent in the apply payload, so version-skew conflicts do not arise either.

Stripping pre-existing legacy "manager" Update entries on already-deployed resources is intentionally out of scope; that is the responsibility of the upstream SSA writer (e.g. via csaupgrade.UpgradeManagedFields or a one-off managedFields strip).

Test plan

Follow-up

#310 — RBGSet child RBGs carry a residual manager Update entry from client.Create in newRBGForSet. Predates this PR and doesn't affect the i8s-bridge scenario (which deploys standalone RBGs, not RBGSets), but worth fixing for hygiene.

gemini-code-assist

Code Review

This pull request migrates the RoleBasedGroup, RoleBasedGroupScalingAdapter, and RoleBasedGroupSet controllers to use Server-Side Apply (SSA) for resource updates, replacing previous update and patch mechanisms that relied on manual retry logic. Key changes include the introduction of a JSON-based conversion helper for complex spec fields and refactored metadata synchronization. Review feedback correctly identifies that the new SSA configurations in the RoleBasedGroup and RoleBasedGroupSet controllers are missing required Kind and APIVersion fields, which would lead to failures when the API server attempts to identify the target resource.

Three RBG-controller write paths used full-object client.Update or MergeFrom patches on RoleBasedGroup, which the API server records under the default "manager" field manager (binary name is /manager). On any chart-value bump that touched a passively co-owned field (containers, env, volumes, tolerations, ...), Helm v4 SSA Apply with ForceConflicts=false aborted with field_conflict — even though the RBG controller never authoritatively writes those sibling fields. Replace the three writes with narrow SSA Apply calls and assign each its own field manager: - ensureDiscoveryConfigMode → "rbg-discovery" Applies only metadata.annotations.discovery-config-mode. - updateRoleReplicas (RBGSA controller) → "rbg-replicas" Applies only spec.roles[].{name, replicas}. - updateExistingRBGs (RBGSet controller) → "rbg-set-sync" Applies metadata.{labels, annotations} and spec.roles. A single shared "rbg" field manager would not work here. SSA's claim-release semantics: when the same field manager re-applies with a narrower claim, fields it previously owned but no longer claims are released and (if no other manager owns them) removed by the API server. With three Apply paths claiming disjoint subsets of the same RBG, a shared field manager makes them ping-pong — each Apply silently strips the others' writes, hot-looping the controller. Distinct field managers give each path an independent ownership boundary. The retry.RetryOnConflict wrappers from the previous Update path are dropped: SSA Apply with Force=true does not 409 on field-manager conflicts, and no resourceVersion is sent in the apply payload, so version-skew conflicts do not arise either. pkg/utils.PatchObjectApplyConfigurationWithFieldManager is the new field-manager-aware variant; PatchObjectApplyConfiguration remains a thin wrapper that defaults to the legacy "rbg" field manager for callers that don't share the RBG main subresource with other rbg-controller Apply paths (RBG status writes, RBGSA writes, ConfigMap writes — all on different objects/subresources, no ping-pong risk). Stripping pre-existing legacy "manager" Update entries on already- deployed resources is intentionally out of scope; that is the responsibility of the upstream SSA writer (e.g. via csaupgrade.UpgradeManagedFields or a one-off managedFields strip). Validated end-to-end on kind: applied an RBG via 'kubectl apply --server-side --field-manager=external-actor' with scalingAdapter.enable=true, scaled the auto-created RBGSA to drive updateRoleReplicas, and applied a parent RBGSet to drive updateExistingRBGs. Generation is stable (would hot-loop at ~20 Hz with a shared field manager), the discovery-config-mode annotation persists alongside external-actor's annotations, external-actor retains ownership of containers/env/image/resources/scalingAdapter, and managedFields shows three independent rbg-* field managers each owning only its narrow claim.

cheyang

Review: SSA Apply with distinct field managers

The overall architecture is sound. Using distinct field managers per logical Apply path is the correct approach to avoid SSA claim-release ping-pong. The commit message is thorough and well-reasoned. A few items need attention before this is merge-ready.

E2E Test Failure (blocking)

The e2e-test job fails on the "rbgset controller exclusive-topology" test case. The debug output shows:

[RBGSet] Found 0 RBGs
[RBGSet] Status: Replicas=0, ReadyReplicas=0

No child RBGs were created at all. While the scaleUp creation path itself was not changed in this PR, the RBGSet reconciler could be erroring out in the updateExistingRBGs or toRoleSpecApplyConfigurations path before reaching the scale-up logic. The flaky-check script reports this is NOT a flaky failure. Please investigate whether this is caused by your changes or is a pre-existing infra issue (e.g., single-node kind cluster + exclusive topology scheduling). If pre-existing, link an existing issue or show a green run on main at the same SHA.

Replicas field ownership overlap (design question)

rbg-set-sync applies the full spec.roles (including replicas) from the RBGSet template, while rbg-replicas applies only {name, replicas} for a specific role. Both use Force: true.

When the RBGSet template is updated (e.g., image bump), toRoleSpecApplyConfigurations round-trips the entire RoleSpec including Replicas. This means rbg-set-sync will reset replicas to the template value, overriding whatever the RBGSA controller previously set. The RBGSA will eventually re-scale, but there is a transient flap window.

Is this intentional? If not, consider either:

Zeroing out Replicas in the apply config within updateExistingRBGs so rbg-set-sync never claims that field, or
Documenting this as expected behavior (template replicas are the "base" and RBGSA overrides are eventually consistent).

Minor observations

toRoleSpecApplyConfigurations JSON round-trip: The approach is pragmatic but inherently fragile if the generated apply-configuration types ever drift from the API types in JSON tags. A comment noting this coupling already exists, which is good. Consider adding a unit test that round-trips a role with every populated field and asserts no data loss (the current tests cover most but not all fields).
Test for updateRoleReplicas: The test comment correctly notes that the fake client cannot verify SSA list-merge. It would strengthen confidence to add an envtest case (or reference an existing one) that exercises the multi-role scenario where only one role's replicas should change.
buildSyncedMetadata is now a package-level function: Good refactoring. The added test case for "System labels cannot be overridden by template labels" is a nice improvement over the old test suite.

Summary

Architecture: correct use of distinct field managers; well-motivated
Correctness: E2E failure must be resolved; replicas ownership overlap needs clarification
Security: no concerns
Test coverage: solid unit coverage; envtest/E2E gap on multi-manager concurrency

Verdict: needs-work until the E2E failure is explained/fixed and the replicas ownership question is answered.

cheyang · 2026-06-12T10:16:45Z

The SSA migration is well-designed: three disjoint write paths now each use their own field manager (rbg-discovery, rbg-replicas, rbg-set-sync), correctly preventing SSA claim-release ping-pong. The commit message thoroughly explains the rationale. JSON round-trip conversion for deeply nested RoleSpec types is pragmatic, and the buildSyncedMetadata refactor removes duplication. Good test coverage for the new paths.

Two items to address: (1) the doc comment on updateExistingRBGs still says "rbg" instead of "rbg-set-sync", and (2) the exclusive-topology e2e test times out on this branch, which could be related to the SSA rewrite of updateExistingRBGs. Please investigate the e2e failure and fix the stale comment.

cheyang · 2026-06-12T10:16:50Z

-// updateExistingRBGs updates existing RoleBasedGroup instances to match the current template.
+// updateExistingRBGs updates existing RoleBasedGroup instances to match the current template
+// using Server-Side Apply. This claims ownership of metadata.labels, metadata.annotations,
+// and spec.roles under the "rbg" field manager instead of the default "manager".


The doc comment on updateExistingRBGs says the function claims ownership under the "rbg" field manager, but the code actually uses utils.RBGSetSyncFieldManager which resolves to "rbg-set-sync". Since the whole point of this PR is that distinct field managers matter, the comment should match the code.

Suggested fix:

// ... under the "rbg-set-sync" field manager instead of the default "manager".

cheyang · 2026-06-12T10:16:52Z

+				WithRoles(
+					applyconfiguration.RoleSpec().
+						WithName(targetRoleName).
+						WithReplicas(*newReplicas),


The old code assigned role.Replicas = newReplicas (pointer-safe even when nil), but the new code dereferences with *newReplicas on line 508. The current caller guards against nil before reaching this point, so this is not a live bug today, but a defensive nil check (or accepting the pointer directly via WithReplicas) would prevent a panic if a future caller omits the guard.

Non-blocking -- just something to keep in mind.

cheyang · 2026-06-12T10:16:54Z

@@ -498,28 +496,25 @@ func (r *RoleBasedGroupScalingAdapterReconciler) GetTargetRbgFromAdapter(
 func (r *RoleBasedGroupScalingAdapterReconciler) updateRoleReplicas(


All calls to updateRoleReplicas share the single rbg-replicas field manager. Under SSA list-map merge semantics, applying only one role releases the manager's claim on any previously-owned roles. If two RBGSAs target different roles of the same RBG, the second apply releases ownership of the first role's replicas field.

Is there a guarantee that only one RBGSA at a time updates replicas for a given RBG? If multiple RBGSAs can exist per RBG, a per-role field manager like rbg-replicas-<roleName> would preserve independent ownership.

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread internal/controller/workloads/rolebasedgroup_controller.go

Comment thread internal/controller/workloads/rolebasedgroupset_controller.go

sebest force-pushed the fix/rbg-ssa-field-manager branch from 762f04a to e03df62 Compare April 26, 2026 18:10

sebest marked this pull request as draft April 26, 2026 18:55

sebest force-pushed the fix/rbg-ssa-field-manager branch from 31795e0 to 7c4d584 Compare April 26, 2026 20:06

sebest mentioned this pull request Apr 26, 2026

RBGSet child RBGs carry stale "manager" field-manager from client.Create #310

Open

sebest closed this Apr 26, 2026

sebest reopened this Apr 26, 2026

sebest force-pushed the fix/rbg-ssa-field-manager branch from 7c4d584 to 3b1ad5a Compare April 26, 2026 21:13

sebest changed the title ~~fix(rbg): use SSA Apply for spec writes to avoid claiming "manager" field manager~~ fix(rbg): use SSA Apply with distinct field managers on RBG writes Apr 26, 2026

sebest marked this pull request as ready for review April 26, 2026 21:51

cheyang requested changes Jun 10, 2026

View reviewed changes

cheyang reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rbg): use SSA Apply with distinct field managers on RBG writes#309

fix(rbg): use SSA Apply with distinct field managers on RBG writes#309
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbg-ssa-field-manager

sebest commented Apr 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cheyang left a comment

Uh oh!

cheyang commented Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -498,28 +496,25 @@ func (r *RoleBasedGroupScalingAdapterReconciler) GetTargetRbgFromAdapter(
		func (r *RoleBasedGroupScalingAdapterReconciler) updateRoleReplicas(

Conversation

sebest commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Follow-up

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Review: SSA Apply with distinct field managers

E2E Test Failure (blocking)

Replicas field ownership overlap (design question)

Minor observations

Summary

Uh oh!

cheyang commented Jun 12, 2026

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebest commented Apr 26, 2026 •

edited

Loading