Skip to content

fix: prevent updates of RestartInProgress condition be covered by updating role status#88

Open
ShirleyDing wants to merge 1 commit into
sgl-project:mainfrom
ShirleyDing:fix/updateRoleStatus
Open

fix: prevent updates of RestartInProgress condition be covered by updating role status#88
ShirleyDing wants to merge 1 commit into
sgl-project:mainfrom
ShirleyDing:fix/updateRoleStatus

Conversation

@ShirleyDing

@ShirleyDing ShirleyDing commented Nov 6, 2025

Copy link
Copy Markdown
Contributor

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #87

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

  1. Deploy the corresponding YAML
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: restart-policy
spec:
  roles:
    - name: sts
      restartPolicy: RecreateRBGOnPodRestart
      replicas: 2
      template:
        metadata:
          labels:
            appVersion: v1
        spec:
          containers:
            - name: sts
              image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
              ports:
                - containerPort: 80
  1. Manually delete the created Pod
  2. Observe the conditions of RBGStatus
  3. Loop through steps 2 and 3, and observe 'RestartInProgress' status is

VI. Special notes for reviews

Set logLevel=1 and look at "patch content" in pkg/utils/utils.go:30 PatchObjectApplyConfiguration:
After all pod restarted, the condition of RestartInProgress was set to RBGRestartCompleted
image

Then in reconcile cycle of RoleBasedGroup, when updateRoleStatus, it used an outdated condition 'RBGRestart', not 'RBGRestartCompleted'. So the condition of RestartInProgress stayed in 'RBGRestart' forever. When pod restart next time, the restart-policy will not work anymore.
image

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

… covering updates from other reconcile logic (sgl-project#87)

Signed-off-by: dingxy <yingd1206@gmail.com>
Co-authored-by: dingxy <yingd1206@gmail.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Syspretor Syspretor self-requested a review November 7, 2025 05:33
@RongGu RongGu requested a review from Copilot November 7, 2025 05:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a refresh of the RoleBasedGroup object before updating its status to ensure the latest version is being modified, preventing potential stale data issues in status updates.

  • Adds a Get call to refresh the rbg object immediately before status modification
  • Ensures status updates are applied to the most current version of the object
  • Improves reliability of status updates in the RoleBasedGroup reconciliation loop

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #88

Status: needs-work (rebase required)

Merge Conflict

This PR currently has merge conflicts against main. Given it's been open since November 2025, the base branch has diverged significantly. A rebase is needed before this can proceed.

Code Review

The fix itself is reasonable. The race condition is clear: updateRBGStatus operates on a potentially stale rbg object, so when it patches the status back, it can overwrite the RestartInProgress condition that was updated by the restart-policy reconciler in a concurrent reconcile loop. Re-fetching with r.client.Get(...) right before setCondition ensures the conditions slice reflects the latest server state.

One subtlety worth noting: readyCondition is computed before the Get(), based on the roleStatuses argument. If another reconciler modified the RBG between the initial read and this re-fetch, the ready calculation could be momentarily stale. In practice this is fine — the next reconcile loop will correct it — but it's worth being aware of.

Requested Changes

  1. Rebase onto current main and resolve conflicts.
  2. Consider adding a brief inline comment above the Get() call explaining why it's needed (e.g., // Re-fetch to avoid overwriting conditions set by other reconcilers). Future readers will thank you.
  3. The PR description checks the "tests added" box but the diff contains no test changes. If there's a unit test covering this race condition, please include it. If not feasible to unit-test (timing-dependent), that's understandable — just clarify in the description.

Once rebased and CI passes green again, this should be good to merge.


Reviewed at sha 702c81896be1269f0ead81237cfd82d2461572a4

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated Review — PR #88

Status: obsolete — recommend closing

Superseded by upstream changes

On closer inspection, this PR has been fully superseded by subsequent work on main:

  1. PR #324 (merged 2026-05-06) — fix(controller): prevent RestartInProgress condition loss due to stale cache race — provided a more comprehensive fix for the exact same bug, using a non-caching API reader (apiReader) in the PodReconciler and a defense-in-depth check in updateRBGStatus.

  2. PR #340 (merged 2026-05-19) — chore(rbg): deprecate and remove RecreateRBGOnPodRestart restart policy — completely removed the RecreateRBGOnPodRestart feature, the PodReconciler, pod_controller.go, and the RestartInProgress condition type. There are zero references to RestartInProgress on current main.

Since the feature this PR fixes no longer exists, rebasing would not be meaningful. This PR should be closed rather than rebased.

Thank you for the original bug report and fix idea — the race condition analysis in #87 was accurate and valuable. It informed the fix that eventually landed in #324.


Reviewed at sha 702c81896be1269f0ead81237cfd82d2461572a4 against main @ fa3804c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Condition of RestartInProgress maybe covered sometimes

4 participants