feat: e2e test framework improvements for local + CI execution#41
feat: e2e test framework improvements for local + CI execution#41rakechill wants to merge 1 commit intocluster-autoscaler-release-1.31.5-aksfrom
Conversation
- Reorganize tests into suites/ structure (Karpenter pattern) - Extract shared environment into pkg/environment/ - Add local dev workflow: make setup-cluster, deploy-local, e2etests - Helm deploy is optional (CI sets image flags, local skips via skaffold) - Add e2e-friendly CAS args to dev deploy template - Add settings.json configmap for fork dynamic config - Add .devcontainer with skaffold, ko, az CLI, kubectl, helm - Add .agents/skills/run-azure-e2e-tests skill Tested: make e2etests TEST_SUITE=scaleup passes locally (~12 min)
c07f13a to
f7f87d3
Compare
There was a problem hiding this comment.
Pull request overview
Improves the Azure CAS e2e testing framework to support both local developer runs and CI (Prow) by reorganizing tests into suites and introducing a shared environment layer with optional Helm-based deployment.
Changes:
- Added a shared
Environmenthelper providing K8s + Azure VMSS clients and optional Helm deploy/update for CI. - Moved the scale-up/down e2e test into
test/suites/and updated it to use the shared environment. - Expanded developer tooling: new local Makefile targets, dynamic config ConfigMap + volume mount for the AKS fork, plus a devcontainer and an agent skill doc.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go |
Converts the scale-up suite to the new suites structure and wires it to the shared environment + optional Helm deploy. |
cluster-autoscaler/cloudprovider/azure/test/pkg/environment/environment.go |
Introduces shared environment setup (K8s + VMSS) and Helm deployment helpers used by suites. |
cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go |
Removes the old monolithic e2e suite implementation. |
cluster-autoscaler/cloudprovider/azure/test/Makefile |
Adds local dev targets and updates CI invocation to run suite-based tests. |
cluster-autoscaler/cloudprovider/azure/examples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl |
Adds e2e-friendly CAS args and mounts the dynamic nodegroup config ConfigMap. |
cluster-autoscaler/cloudprovider/azure/examples/dev/aks-dev-deploy.sh |
Creates the autoscaler-settings ConfigMap containing settings.json for dynamic nodegroup config. |
.devcontainer/devcontainer.json |
Adds a devcontainer definition for local/devcontainer-based e2e work. |
.agents/skills/run-azure-e2e-tests/SKILL.md |
Documents local/CI workflows for running and adding Azure e2e test suites. |
Comments suppressed due to low confidence (1)
cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go:81
- Helm deploy is only enabled when both
-cas-image-repositoryand-cas-image-tagare non-empty. If only one is provided (misconfigured CI/local run), Helm is silently skipped and tests will run assuming CAS is already deployed, which is hard to diagnose. Consider failing fast when exactly one of these flags is set (or when Helm is enabled but required flags like-cluster-name/-client-idare empty).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --output-dir "$(ARTIFACTS)" --junit-report="junit.e2e_suite.1.xml" \ | ||
| ./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \ | ||
| -resource-group="$$(KUBECONFIG= kubectl get managedclusters -n default -o jsonpath='{.items[0].status.nodeResourceGroup}')" \ |
There was a problem hiding this comment.
TEST_SUITE defaults to ..., but the computed package path becomes ./suites/.../..., which likely matches no packages. This makes the default (run all suites) behavior break. Consider special-casing TEST_SUITE=... to use ./suites/... (single ellipsis) or building the suite path with a conditional so the default runs all suites.
| $(if $(LABEL_FILTER),--label-filter="$(LABEL_FILTER)",) \ | ||
| ./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \ | ||
| -resource-group="$(AZURE_RESOURCE_GROUP)" |
There was a problem hiding this comment.
Same issue as CI target: when TEST_SUITE is left at the default ..., this expands to ./suites/.../... and the command likely won't run any packages. Adjust the suite path construction so the default is exactly ./suites/....
| nodes := &corev1.NodeList{} | ||
| g.Expect(env.K8s.List(env.Ctx, nodes)).To(Succeed()) | ||
| g.Expect(nodes.Items).To(SatisfyAll( | ||
| HaveLen(expectedNodes), | ||
| ContainElements(Satisfy(func(node corev1.Node) bool { | ||
| for _, cond := range node.Status.Conditions { | ||
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| })), | ||
| )) |
There was a problem hiding this comment.
AllVMSSStable intends to assert that the number of Ready nodes matches total VMSS capacity, but the matcher ContainElements(Satisfy(...NodeReady...)) only requires that at least one node is Ready. This can pass even when most nodes are NotReady. Consider asserting readiness for every node (e.g., HaveEach(Satisfy(...)) or an explicit loop over nodes.Items).
| // NewEnvironment creates a fully initialized Environment. | ||
| func NewEnvironment(resourceGroup string, helm *HelmConfig) *Environment { | ||
| env := &Environment{ | ||
| Ctx: context.Background(), | ||
| ResourceGroup: resourceGroup, | ||
| SubscriptionID: os.Getenv("AZURE_SUBSCRIPTION_ID"), | ||
| TenantID: os.Getenv("AZURE_TENANT_ID"), | ||
| Helm: helm, | ||
| } | ||
|
|
||
| azCred, err := azidentity.NewDefaultAzureCredential(nil) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| env.VMSS, err = armcompute.NewVirtualMachineScaleSetsClient(env.SubscriptionID, azCred, nil) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| restConfig, err := config.GetConfig() | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| env.K8s, err = client.New(restConfig, client.Options{}) |
There was a problem hiding this comment.
NewEnvironment/EnsureHelmRelease rely on resourceGroup, AZURE_SUBSCRIPTION_ID, and (for Helm) AZURE_TENANT_ID being set, but there’s no early validation. If any are empty, the test will tend to fail later (or retry for minutes in Eventually(AllVMSSStable, ...)) with less actionable errors. Consider adding explicit Expect(resourceGroup).NotTo(BeEmpty()), and when Helm is enabled also require subscription/tenant IDs (and helm-required flags like clusterName/clientID) with clear failure messages.
There was a problem hiding this comment.
I have PR#9410 upstream to add a devcontainer to CAS, we should make sure we're making compatible changes.
Mine omits some Azure specific stuff to keep it relevant across all the providers, was planning to add those in our fork.
There was a problem hiding this comment.
This file will assume a resource group (cluster-autoscaler-test if not running in a devcontainer) but later steps will fail if you don't have AZURE_RESOURCE_GROUP set explicitly.
I suggest modifying this one to require AZURE_RESOURCE_GROUP to be defined, and for it to fail quickly with a good diagnostic if this happens.
| ##@ Cluster Setup (one-time) | ||
|
|
||
| .PHONY: setup-cluster | ||
| setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing |
There was a problem hiding this comment.
I like to add breadcrumbs to Makefiles (and Taskfiles) so that it's easier to see what step is executing at any one time. This helps with troubleshooting issues as the observer (whether human or AI) can go straight to the failing step.
| setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing | |
| setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing | |
| @echo "-- creating cluster --" |
| ##@ Build & Deploy | ||
|
|
||
| .PHONY: deploy-local | ||
| deploy-local: ## Build CAS and deploy to cluster via skaffold |
There was a problem hiding this comment.
| deploy-local: ## Build CAS and deploy to cluster via skaffold | |
| deploy-local: ## Build CAS and deploy to cluster via skaffold | |
| @echo "-- Deploying CAS to cluster --" |
| ##@ E2E Testing | ||
|
|
||
| .PHONY: e2etests | ||
| e2etests: ## Run e2e tests (CAS must already be deployed) |
There was a problem hiding this comment.
| e2etests: ## Run e2e tests (CAS must already be deployed) | |
| e2etests: ## Run e2e tests (CAS must already be deployed) | |
| @echo "-- Running E2E test --" |
What type of PR is this?
/kind test
What this PR does / why we need it:
Reorganizes the Azure CAS e2e test infrastructure to support both local developer execution and CI (Prow), following patterns from karpenter-provider-azure.
test/pkg/environment/environment.goEnvironmentstruct with K8s + VMSS clients, optional Helm deploytest/suites/scaleup/suite_test.gotest/e2e/— Same scale-up/down test, uses shared environmenttest/e2e/suites/+pkg/environment/test/Makefilee2etests,setup-cluster,deploy-local), CI target preservedexamples/dev/aks-dev-deploy.shsettings.jsonconfigmap for fork dynamic configexamples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl.devcontainer/devcontainer.json.agents/skills/run-azure-e2e-tests/SKILL.mdDesign: Helm behind a flag (CI + local compatible)
make test-e2e): Passes-cas-image-repositoryand-cas-image-tagflags → test deploys CAS via Helm inBeforeSuite→ backward compatible with existing Prow jobmake e2etests): No image flags → Helm is skipped → CAS is managed externally via skaffold → no AZURE_CLIENT_ID env var conflictSpecial notes for your reviewer:
Local developer flow: