feat: e2e test framework improvements for local + CI execution by rakechill · Pull Request #41 · Azure/autoscaler

rakechill · 2026-03-24T23:11:13Z

What type of PR is this?

/kind test

What this PR does / why we need it:

Reorganizes the Azure CAS e2e test infrastructure to support both local developer execution and CI (Prow), following patterns from karpenter-provider-azure.

File	Change
`test/pkg/environment/environment.go`	New — Shared `Environment` struct with K8s + VMSS clients, optional Helm deploy
`test/suites/scaleup/suite_test.go`	Moved from `test/e2e/` — Same scale-up/down test, uses shared environment
`test/e2e/`	Deleted — Replaced by `suites/` + `pkg/environment/`
`test/Makefile`	Updated — Local targets (`e2etests`, `setup-cluster`, `deploy-local`), CI target preserved
`examples/dev/aks-dev-deploy.sh`	Updated — Creates `settings.json` configmap for fork dynamic config
`examples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl`	Updated — E2e-friendly CAS args + configmap volume mount
`.devcontainer/devcontainer.json`	New — Dev container with skaffold, ko, az CLI, kubectl, helm
`.agents/skills/run-azure-e2e-tests/SKILL.md`	New — Agent skill documenting how to run/add e2e tests

Design: Helm behind a flag (CI + local compatible)

CI path (make test-e2e): Passes -cas-image-repository and -cas-image-tag flags → test deploys CAS via Helm in BeforeSuite → backward compatible with existing Prow job
Local path (make e2etests): No image flags → Helm is skipped → CAS is managed externally via skaffold → no AZURE_CLIENT_ID env var conflict

Special notes for your reviewer:

Local developer flow:

cd cluster-autoscaler/cloudprovider/azure/test
make setup-cluster      # One-time: creates AKS + ACR + workload identity
make deploy-local       # Builds + deploys CAS via skaffold
make e2etests TEST_SUITE=scaleup   # Runs tests (~12 min)

#### Does this PR introduce a user-facing change?
```release-note
NONE

- Reorganize tests into suites/ structure (Karpenter pattern) - Extract shared environment into pkg/environment/ - Add local dev workflow: make setup-cluster, deploy-local, e2etests - Helm deploy is optional (CI sets image flags, local skips via skaffold) - Add e2e-friendly CAS args to dev deploy template - Add settings.json configmap for fork dynamic config - Add .devcontainer with skaffold, ko, az CLI, kubectl, helm - Add .agents/skills/run-azure-e2e-tests skill Tested: make e2etests TEST_SUITE=scaleup passes locally (~12 min)

Copilot

Pull request overview

Improves the Azure CAS e2e testing framework to support both local developer runs and CI (Prow) by reorganizing tests into suites and introducing a shared environment layer with optional Helm-based deployment.

Changes:

Added a shared Environment helper providing K8s + Azure VMSS clients and optional Helm deploy/update for CI.
Moved the scale-up/down e2e test into test/suites/ and updated it to use the shared environment.
Expanded developer tooling: new local Makefile targets, dynamic config ConfigMap + volume mount for the AKS fork, plus a devcontainer and an agent skill doc.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go`	Converts the scale-up suite to the new suites structure and wires it to the shared environment + optional Helm deploy.
`cluster-autoscaler/cloudprovider/azure/test/pkg/environment/environment.go`	Introduces shared environment setup (K8s + VMSS) and Helm deployment helpers used by suites.
`cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go`	Removes the old monolithic e2e suite implementation.
`cluster-autoscaler/cloudprovider/azure/test/Makefile`	Adds local dev targets and updates CI invocation to run suite-based tests.
`cluster-autoscaler/cloudprovider/azure/examples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl`	Adds e2e-friendly CAS args and mounts the dynamic nodegroup config ConfigMap.
`cluster-autoscaler/cloudprovider/azure/examples/dev/aks-dev-deploy.sh`	Creates the `autoscaler-settings` ConfigMap containing `settings.json` for dynamic nodegroup config.
`.devcontainer/devcontainer.json`	Adds a devcontainer definition for local/devcontainer-based e2e work.
`.agents/skills/run-azure-e2e-tests/SKILL.md`	Documents local/CI workflows for running and adding Azure e2e test suites.

Comments suppressed due to low confidence (1)

cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go:81

Helm deploy is only enabled when both -cas-image-repository and -cas-image-tag are non-empty. If only one is provided (misconfigured CI/local run), Helm is silently skipped and tests will run assuming CAS is already deployed, which is hard to diagnose. Consider failing fast when exactly one of these flags is set (or when Helm is enabled but required flags like -cluster-name/-client-id are empty).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T01:44:21Z

cluster-autoscaler/cloudprovider/azure/test/Makefile

+		--output-dir "$(ARTIFACTS)" --junit-report="junit.e2e_suite.1.xml" \
+		./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
 		-resource-group="$$(KUBECONFIG= kubectl get managedclusters -n default -o jsonpath='{.items[0].status.nodeResourceGroup}')" \


TEST_SUITE defaults to ..., but the computed package path becomes ./suites/.../..., which likely matches no packages. This makes the default (run all suites) behavior break. Consider special-casing TEST_SUITE=... to use ./suites/... (single ellipsis) or building the suite path with a conditional so the default runs all suites.

Copilot · 2026-03-25T01:44:21Z

cluster-autoscaler/cloudprovider/azure/test/Makefile

+		$(if $(LABEL_FILTER),--label-filter="$(LABEL_FILTER)",) \
+		./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
+		-resource-group="$(AZURE_RESOURCE_GROUP)"


Same issue as CI target: when TEST_SUITE is left at the default ..., this expands to ./suites/.../... and the command likely won't run any packages. Adjust the suite path construction so the default is exactly ./suites/....

Copilot · 2026-03-25T01:44:22Z

cluster-autoscaler/cloudprovider/azure/test/pkg/environment/environment.go

+	nodes := &corev1.NodeList{}
+	g.Expect(env.K8s.List(env.Ctx, nodes)).To(Succeed())
+	g.Expect(nodes.Items).To(SatisfyAll(
+		HaveLen(expectedNodes),
+		ContainElements(Satisfy(func(node corev1.Node) bool {
+			for _, cond := range node.Status.Conditions {
+				if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
+					return true
+				}
+			}
+			return false
+		})),
+	))


AllVMSSStable intends to assert that the number of Ready nodes matches total VMSS capacity, but the matcher ContainElements(Satisfy(...NodeReady...)) only requires that at least one node is Ready. This can pass even when most nodes are NotReady. Consider asserting readiness for every node (e.g., HaveEach(Satisfy(...)) or an explicit loop over nodes.Items).

Copilot · 2026-03-25T01:44:22Z

cluster-autoscaler/cloudprovider/azure/test/pkg/environment/environment.go

+// NewEnvironment creates a fully initialized Environment.
+func NewEnvironment(resourceGroup string, helm *HelmConfig) *Environment {
+	env := &Environment{
+		Ctx:            context.Background(),
+		ResourceGroup:  resourceGroup,
+		SubscriptionID: os.Getenv("AZURE_SUBSCRIPTION_ID"),
+		TenantID:       os.Getenv("AZURE_TENANT_ID"),
+		Helm:           helm,
+	}
+
+	azCred, err := azidentity.NewDefaultAzureCredential(nil)
+	Expect(err).NotTo(HaveOccurred())
+
+	env.VMSS, err = armcompute.NewVirtualMachineScaleSetsClient(env.SubscriptionID, azCred, nil)
+	Expect(err).NotTo(HaveOccurred())
+
+	restConfig, err := config.GetConfig()
+	Expect(err).NotTo(HaveOccurred())
+	env.K8s, err = client.New(restConfig, client.Options{})


NewEnvironment/EnsureHelmRelease rely on resourceGroup, AZURE_SUBSCRIPTION_ID, and (for Helm) AZURE_TENANT_ID being set, but there’s no early validation. If any are empty, the test will tend to fail later (or retry for minutes in Eventually(AllVMSSStable, ...)) with less actionable errors. Consider adding explicit Expect(resourceGroup).NotTo(BeEmpty()), and when Helm is enabled also require subscription/tenant IDs (and helm-required flags like clusterName/clientID) with clear failure messages.

theunrepentantgeek · 2026-03-30T19:29:27Z

.devcontainer/devcontainer.json

I have PR#9410 upstream to add a devcontainer to CAS, we should make sure we're making compatible changes.

Mine omits some Azure specific stuff to keep it relevant across all the providers, was planning to add those in our fork.

theunrepentantgeek · 2026-03-30T19:32:11Z

cluster-autoscaler/cloudprovider/azure/examples/dev/aks-dev-deploy.sh

This file will assume a resource group (cluster-autoscaler-test if not running in a devcontainer) but later steps will fail if you don't have AZURE_RESOURCE_GROUP set explicitly.

I suggest modifying this one to require AZURE_RESOURCE_GROUP to be defined, and for it to fail quickly with a good diagnostic if this happens.

theunrepentantgeek · 2026-03-30T19:33:13Z

cluster-autoscaler/cloudprovider/azure/test/Makefile

+##@ Cluster Setup (one-time)
+
+.PHONY: setup-cluster
+setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing


I like to add breadcrumbs to Makefiles (and Taskfiles) so that it's easier to see what step is executing at any one time. This helps with troubleshooting issues as the observer (whether human or AI) can go straight to the failing step.

Suggested change

setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing

setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing

@echo "-- creating cluster --"

theunrepentantgeek · 2026-03-30T19:33:51Z

cluster-autoscaler/cloudprovider/azure/test/Makefile

+##@ Build & Deploy
+
+.PHONY: deploy-local
+deploy-local: ## Build CAS and deploy to cluster via skaffold


Suggested change

deploy-local: ## Build CAS and deploy to cluster via skaffold

deploy-local: ## Build CAS and deploy to cluster via skaffold

@echo "-- Deploying CAS to cluster --"

theunrepentantgeek · 2026-03-30T19:34:18Z

cluster-autoscaler/cloudprovider/azure/test/Makefile

+##@ E2E Testing
+
+.PHONY: e2etests
+e2etests: ## Run e2e tests (CAS must already be deployed)


Suggested change

e2etests: ## Run e2e tests (CAS must already be deployed)

e2etests: ## Run e2e tests (CAS must already be deployed)

@echo "-- Running E2E test --"

rakechill force-pushed the e2e-framework-improvements branch from c07f13a to f7f87d3 Compare March 24, 2026 23:12

rakechill requested a review from Copilot March 25, 2026 01:40

Copilot started reviewing on behalf of rakechill March 25, 2026 01:41 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

theunrepentantgeek reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: e2e test framework improvements for local + CI execution#41

feat: e2e test framework improvements for local + CI execution#41
rakechill wants to merge 1 commit intocluster-autoscaler-release-1.31.5-aksfrom
e2e-framework-improvements

rakechill commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

theunrepentantgeek Mar 30, 2026

Uh oh!

theunrepentantgeek Mar 30, 2026

Uh oh!

theunrepentantgeek Mar 30, 2026 •

edited

Loading

Uh oh!

theunrepentantgeek Mar 30, 2026

Uh oh!

theunrepentantgeek Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
	setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
	@echo "-- creating cluster --"

	deploy-local: ## Build CAS and deploy to cluster via skaffold
	deploy-local: ## Build CAS and deploy to cluster via skaffold
	@echo "-- Deploying CAS to cluster --"

	e2etests: ## Run e2e tests (CAS must already be deployed)
	e2etests: ## Run e2e tests (CAS must already be deployed)
	@echo "-- Running E2E test --"

Conversation

rakechill commented Mar 24, 2026

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theunrepentantgeek Mar 30, 2026 •

edited

Loading