Skip to content

fix(rbgsa-controller): guard empty selector in LWP follower roles#263

Open
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbgsa-malformed-selector
Open

fix(rbgsa-controller): guard empty selector in LWP follower roles#263
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbgsa-malformed-selector

Conversation

@sebest

@sebest sebest commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Guard against empty status.labelSelector when building the RBGSA selector for RoleInstanceSet + LeaderWorkerPattern roles
  • Without this fix, an empty base selector produces ",rbg.x-k8s.io/component-index=0" (leading comma), which breaks our materializer's /scale subresource parsing and blocks all scaling
  • Add errSelectorNotReady sentinel error so the caller can distinguish the transient "selector not yet populated" case (requeue after 2s) from hard errors (propagate normally)
  • Add Warning event (SelectorNotReady) for observability of the retry loop
  • Clean up extractLabelSelectorDefault: propagate ctx, remove dead variable, use %w wrapping, fix typo

Test plan

  • go test ./internal/controller/workloads/ -run TestExtractLabelSelectorDefault -v — 7 table-driven cases covering populated/empty/missing selector, standalone vs LWP, LWS, and workload-not-found
  • go test ./internal/controller/workloads/ — all existing tests pass (no regressions)
  • go vet ./internal/controller/workloads/ — clean

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Syspretor

Syspretor commented Apr 13, 2026

Copy link
Copy Markdown
Collaborator

@sebest LGTM, please resolve the conflicts and please also resolve the conflicts of #270 as well.

@Syspretor Syspretor self-requested a review April 13, 2026 03:15
@sebest sebest force-pushed the fix/rbgsa-malformed-selector branch from 5a42c10 to 7f17ae5 Compare April 13, 2026 05:47
@sebest

sebest commented Apr 13, 2026

Copy link
Copy Markdown
Contributor Author

@sebest LGTM, please resolve the conflicts and please also resolve the conflicts of #270 as well.

conflicts are fixed on both PR.

When a RoleInstanceSet's status.labelSelector has not been populated yet
(race between the RIS controller and the RBGSA controller), the code
unconditionally prepended a comma before the component-index label,
producing a malformed selector like ",rbg.x-k8s.io/component-index=0".
This broke the kwota materializer's /scale subresource parsing, blocking
all scaling for affected LeaderWorkerPattern workloads.

Add an empty-selector guard that returns errSelectorNotReady (a sentinel
error), causing the reconciler to requeue after 2s with an Info log and
Warning event instead of writing a malformed selector to status. Hard
errors (client.Get failures, bad apiVersion) still propagate normally.

Also clean up extractLabelSelectorDefault: propagate ctx instead of
context.TODO(), remove dead gvk variable, use %w error wrapping, and
fix a typo in an error message.
@sebest sebest force-pushed the fix/rbgsa-malformed-selector branch from 7f17ae5 to f7ed14a Compare April 22, 2026 05:39
@sebest

sebest commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

@Syspretor I'm not sure the remaining failing test is caused by my PR and I cannot retry it.
let me know, thanks

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The fix is straightforward and correct: guard the empty status.labelSelector for RIS + LWP roles so we never produce a malformed leading-comma selector. The sentinel error + requeue pattern is idiomatic for transient controller conditions, and the test coverage is solid (7 table-driven cases).

Blockers before merge:

  1. Merge conflict — this branch conflicts with main. Please rebase.
  2. E2E failure[v1alpha1] deploy with restartPolicy timed out (150s). The failure (workload generation not equal to observed generation) looks unrelated to your change, but CI must be green.

Code observations (non-blocking):

  • The pkgerrors alias + stdlib errors split is clean.
  • Dead gvk variable removal, context.TODO()ctx, %v%w, and the typo fix are good housekeeping.
  • 2s requeue is reasonable for a condition gated on another controller's reconcile.

Once the conflict is resolved and CI passes, this looks good to go.

@cheyang

cheyang commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the fix -- the empty-selector guard and sentinel error pattern look correct in isolation, and the test coverage is solid (7 table-driven cases covering both transient and hard error paths). However the PR currently has merge conflicts with main (specifically #281 which removed the Workload field in favor of GetWorkloadSpec()), and the e2e-test is failing. Could you rebase onto the latest main, adopt role.GetWorkloadSpec(), and verify the e2e suite passes? Happy to re-review once that is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants