Skip to content

chore: add e2e cases for inplace update#285

Open
Syspretor wants to merge 1 commit into
sgl-project:mainfrom
Syspretor:chore/add-e2e-case-for-inplace-update
Open

chore: add e2e cases for inplace update#285
Syspretor wants to merge 1 commit into
sgl-project:mainfrom
Syspretor:chore/add-e2e-case-for-inplace-update

Conversation

@Syspretor

Copy link
Copy Markdown
Collaborator

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of end-to-end tests for in-place updates, covering standalone and LeaderWorker patterns, various update strategies, grace periods, and role template references. The review feedback identifies several areas for improvement: correcting Gomega assertions to prevent potential runtime panics, enabling or documenting disabled test cases, removing redundant sleep statements to optimize test execution, and implementing proper error handling for JSON operations.

Comment on lines +82 to +95
gomega.Eventually(func() bool {
err := f.Client.List(f.Ctx, podList, client.InNamespace(f.Namespace),
client.MatchingLabels{
"rbg.workloads.x-k8s.io/group-name": rbg.Name,
"rbg.workloads.x-k8s.io/role-name": "role-1",
})
if err != nil || len(podList.Items) == 0 {
return false
}
return podList.Items[0].UID == initialPodUID &&
podList.Items[0].Spec.Containers[0].Image == updatedImage
}, testutils.Timeout, testutils.Interval).Should(gomega.BeTrue(),
"Pod UID should remain unchanged for inplace update, initial: %s, got: %s, image changed from %s to %s",
initialPodUID, podList.Items[0].UID, initialImage, updatedImage)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This assertion will likely cause a runtime panic. In Go, arguments to a function call are evaluated before the function is executed. Here, podList.Items[0].UID is evaluated when Should() is called, which happens before Eventually starts polling. Since podList.Items is empty at that point, it will result in an index out of range panic.

A more robust and idiomatic way to write this in Gomega is to use the Eventually(func(g Gomega) { ... }) pattern, which allows making multiple assertions within the polling loop and provides better failure messages without manual formatting.

			gomega.Eventually(func(g gomega.Gomega) {
				err := f.Client.List(f.Ctx, podList, client.InNamespace(f.Namespace),
					client.MatchingLabels{
						"rbg.workloads.x-k8s.io/group-name": rbg.Name,
						"rbg.workloads.x-k8s.io/role-name":  "role-1",
					})
				g.Expect(err).ShouldNot(gomega.HaveOccurred())
				g.Expect(podList.Items).Should(gomega.HaveLen(1))
				g.Expect(podList.Items[0].UID).Should(gomega.Equal(initialPodUID))
				g.Expect(podList.Items[0].Spec.Containers[0].Image).Should(gomega.Equal(updatedImage))
			}, testutils.Timeout, testutils.Interval).Should(gomega.Succeed())

f.ExpectRbgV2Deleted(rbg)
})

ginkgo.XIt("standalone pattern with RecreatePod should always recreate pod even for image-only update", func() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test case is disabled using ginkgo.XIt. If the functionality is implemented and ready for verification, please enable the test by changing it to ginkgo.It. If there are known issues or the feature is incomplete, it's helpful to add a comment or a TODO explaining why it is currently disabled. This applies to other occurrences of XIt in this file as well (lines 209, 262, 475).

Comment on lines +299 to +300
// Wait for some time to allow reconciliation
time.Sleep(5 * time.Second)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The time.Sleep(5 * time.Second) here appears to be redundant. The subsequent gomega.Consistently block (line 303) already polls for 10 seconds to verify that the pod remains unchanged. If the controller were to perform an update, Consistently would detect it regardless of this initial sleep. Removing unnecessary sleeps helps keep the E2E test suite fast and reliable.

type patch struct {
Metadata metadata `json:"metadata"`
}
b, _ := json.Marshal(patch{Metadata: metadata{Labels: labels}})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error from json.Marshal is ignored. While it's unlikely to fail with this specific struct, it's better practice to handle the error to avoid silent failures if the struct definition changes in the future.

	b, err := json.Marshal(patch{Metadata: metadata{Labels: labels}})
	if err != nil {
		return &runtime.RawExtension{}
	}

@Syspretor Syspretor force-pushed the chore/add-e2e-case-for-inplace-update branch from c280a6d to d78abee Compare April 22, 2026 09:13
Signed-off-by: 玖宇 <guotongyu.gty@alibaba-inc.com>
@Syspretor Syspretor force-pushed the chore/add-e2e-case-for-inplace-update branch from d78abee to 50b836e Compare April 22, 2026 09:28

@cheyang cheyang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding comprehensive e2e coverage for in-place update. The test structure is solid and follows existing patterns well.

Two things need attention before this can merge:

1. Merge conflict — the PR currently conflicts with the base branch. Please rebase.

2. CI failure (not flaky) — the leaderWorker pattern with multiple replicas should perform inplace update test times out at 150s (line ~567 in inplace_update.go). The in-place update for a 2-replica LeaderWorkerSet with size=2 (4 pods total) does not converge. This looks like a real bug in the test expectation or controller behavior rather than a timing fluke. Worth investigating whether the controller correctly handles in-place updates across multiple replicas of a LeaderWorkerSet.

A few minor observations below via inline comments.

testutils "sigs.k8s.io/rbgs/test/utils"
wrappersv2 "sigs.k8s.io/rbgs/test/wrappers/v1alpha2"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment: this says newTestImage returns a different image but it's a const, not a function. Trivial nit — just remove or fix the comment.

}
}
return true
}, testutils.Timeout, testutils.Interval).Should(gomega.BeTrue())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that times out in CI. With 2 replicas x size 2 = 4 pods, the Eventually block waits for all pods to retain UIDs and update images. Please check whether the controller actually completes the in-place update for all pods in this scenario, or if there's a reconciliation issue that leaves some pods stuck.

b, err := json.Marshal(patch{Metadata: metadata{Labels: labels}})
if err != nil {
return &runtime.RawExtension{}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silently returning an empty RawExtension on marshal failure will produce confusing test results. Consider using gomega.Expect(err).ShouldNot(gomega.HaveOccurred()) or ginkgo.GinkgoT().Fatal(err) to surface the error.

// Get initial pods (2 groups * 2 pods each = 4 pods)
podList := &corev1.PodList{}
gomega.Eventually(func() bool {
err := f.Client.List(f.Ctx, podList, client.InNamespace(f.Namespace),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test timed out in CI after 150s. The CI logs show that the in-place image update succeeds (SuccessfulUpdatePodInPlace), but then the default RecreateInstanceOnPodRestart restart policy triggers pod deletion and recreation with new UIDs. The assertion that all pod UIDs remain unchanged then never passes.

You likely need to either:

  1. Set a restart policy on the LeaderWorker role that does not recreate instances on container restart, or
  2. Adjust the assertion to account for the fact that in-place updates combined with RecreateInstanceOnPodRestart will produce new pod UIDs.

The same issue may affect the single-replica leaderWorker inplace tests that currently pass - they might be passing only due to timing differences with fewer pods.

"rbg.workloads.x-k8s.io/role-name": "role-1",
})
return err == nil && len(podList.Items) == 3
}, testutils.Timeout, testutils.Interval).Should(gomega.BeTrue())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is titled leaderWorker pattern with InPlaceIfPossible should recreate pods when updating leader/worker patches, but the test body only updates the base template image and then asserts that all pod UIDs remain the same (lines 505-514) - i.e., it verifies in-place update, not recreation.

If the intent is to test recreation when patches change, the test should modify LeaderTemplatePatch or WorkerTemplatePatch with non-image fields and assert UIDs change. If the intent is to verify that in-place update still works when leader/worker patches are present, rename the test to match that behavior.

rbg.Spec.Roles[0].StandalonePattern.Template.Spec.Containers[0].Image = updatedImage
rbg.Spec.Roles[0].StandalonePattern.Template.Spec.Containers[0].Env = []corev1.EnvVar{
{Name: "TEST_ENV", Value: "test"},
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both grace period tests (standalone at line 289 and leaderWorker at line 359) set GracePeriodSeconds: 5 but only verify the final state - image updated and pod UID unchanged. They never assert that the update was delayed by the grace period.

Consider adding a Consistently check right after the update to confirm the pod image has NOT changed yet during the grace window, before the Eventually check that confirms the update eventually completes. Without this, these tests are functionally identical to the basic inplace update tests.

gomega.Expect(f.Client.Create(f.Ctx, rbg)).Should(gomega.Succeed())
f.ExpectRbgV2Equal(rbg)

// Get initial pod

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildRawExtensionPatch here is essentially a duplicate of buildRawPatch in test/wrappers/v1alpha2/role_wrapper.go. Consider reusing the existing helper or extracting both into a shared test utility package.

testutils "sigs.k8s.io/rbgs/test/utils"
wrappersv2 "sigs.k8s.io/rbgs/test/wrappers/v1alpha2"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment // newTestImage returns a different image for testing updates reads like a function doc comment, but decorates a const declaration. Something like // updatedImage is a different image tag used for in-place update test scenarios would be clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants