Skip to content

feat: e2e test framework improvements for local + CI execution#41

Open
rakechill wants to merge 1 commit intocluster-autoscaler-release-1.31.5-aksfrom
e2e-framework-improvements
Open

feat: e2e test framework improvements for local + CI execution#41
rakechill wants to merge 1 commit intocluster-autoscaler-release-1.31.5-aksfrom
e2e-framework-improvements

Conversation

@rakechill
Copy link
Copy Markdown

What type of PR is this?

/kind test

What this PR does / why we need it:

Reorganizes the Azure CAS e2e test infrastructure to support both local developer execution and CI (Prow), following patterns from karpenter-provider-azure.

File Change
test/pkg/environment/environment.go New — Shared Environment struct with K8s + VMSS clients, optional Helm deploy
test/suites/scaleup/suite_test.go Moved from test/e2e/ — Same scale-up/down test, uses shared environment
test/e2e/ Deleted — Replaced by suites/ + pkg/environment/
test/Makefile Updated — Local targets (e2etests, setup-cluster, deploy-local), CI target preserved
examples/dev/aks-dev-deploy.sh Updated — Creates settings.json configmap for fork dynamic config
examples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl Updated — E2e-friendly CAS args + configmap volume mount
.devcontainer/devcontainer.json New — Dev container with skaffold, ko, az CLI, kubectl, helm
.agents/skills/run-azure-e2e-tests/SKILL.md New — Agent skill documenting how to run/add e2e tests

Design: Helm behind a flag (CI + local compatible)

  • CI path (make test-e2e): Passes -cas-image-repository and -cas-image-tag flags → test deploys CAS via Helm in BeforeSuite → backward compatible with existing Prow job
  • Local path (make e2etests): No image flags → Helm is skipped → CAS is managed externally via skaffold → no AZURE_CLIENT_ID env var conflict

Special notes for your reviewer:

Local developer flow:

cd cluster-autoscaler/cloudprovider/azure/test
make setup-cluster      # One-time: creates AKS + ACR + workload identity
make deploy-local       # Builds + deploys CAS via skaffold
make e2etests TEST_SUITE=scaleup   # Runs tests (~12 min)

#### Does this PR introduce a user-facing change?
```release-note
NONE

- Reorganize tests into suites/ structure (Karpenter pattern)
- Extract shared environment into pkg/environment/
- Add local dev workflow: make setup-cluster, deploy-local, e2etests
- Helm deploy is optional (CI sets image flags, local skips via skaffold)
- Add e2e-friendly CAS args to dev deploy template
- Add settings.json configmap for fork dynamic config
- Add .devcontainer with skaffold, ko, az CLI, kubectl, helm
- Add .agents/skills/run-azure-e2e-tests skill

Tested: make e2etests TEST_SUITE=scaleup passes locally (~12 min)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the Azure CAS e2e testing framework to support both local developer runs and CI (Prow) by reorganizing tests into suites and introducing a shared environment layer with optional Helm-based deployment.

Changes:

  • Added a shared Environment helper providing K8s + Azure VMSS clients and optional Helm deploy/update for CI.
  • Moved the scale-up/down e2e test into test/suites/ and updated it to use the shared environment.
  • Expanded developer tooling: new local Makefile targets, dynamic config ConfigMap + volume mount for the AKS fork, plus a devcontainer and an agent skill doc.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go Converts the scale-up suite to the new suites structure and wires it to the shared environment + optional Helm deploy.
cluster-autoscaler/cloudprovider/azure/test/pkg/environment/environment.go Introduces shared environment setup (K8s + VMSS) and Helm deployment helpers used by suites.
cluster-autoscaler/cloudprovider/azure/test/e2e/e2e_suite_test.go Removes the old monolithic e2e suite implementation.
cluster-autoscaler/cloudprovider/azure/test/Makefile Adds local dev targets and updates CI invocation to run suite-based tests.
cluster-autoscaler/cloudprovider/azure/examples/dev/cluster-autoscaler-vmss-wi-dynamic.yaml.tpl Adds e2e-friendly CAS args and mounts the dynamic nodegroup config ConfigMap.
cluster-autoscaler/cloudprovider/azure/examples/dev/aks-dev-deploy.sh Creates the autoscaler-settings ConfigMap containing settings.json for dynamic nodegroup config.
.devcontainer/devcontainer.json Adds a devcontainer definition for local/devcontainer-based e2e work.
.agents/skills/run-azure-e2e-tests/SKILL.md Documents local/CI workflows for running and adding Azure e2e test suites.
Comments suppressed due to low confidence (1)

cluster-autoscaler/cloudprovider/azure/test/suites/scaleup/suite_test.go:81

  • Helm deploy is only enabled when both -cas-image-repository and -cas-image-tag are non-empty. If only one is provided (misconfigured CI/local run), Helm is silently skipped and tests will run assuming CAS is already deployed, which is hard to diagnose. Consider failing fast when exactly one of these flags is set (or when Helm is enabled but required flags like -cluster-name/-client-id are empty).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to 35
--output-dir "$(ARTIFACTS)" --junit-report="junit.e2e_suite.1.xml" \
./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
-resource-group="$$(KUBECONFIG= kubectl get managedclusters -n default -o jsonpath='{.items[0].status.nodeResourceGroup}')" \
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST_SUITE defaults to ..., but the computed package path becomes ./suites/.../..., which likely matches no packages. This makes the default (run all suites) behavior break. Consider special-casing TEST_SUITE=... to use ./suites/... (single ellipsis) or building the suite path with a conditional so the default runs all suites.

Copilot uses AI. Check for mistakes.
Comment on lines +87 to +89
$(if $(LABEL_FILTER),--label-filter="$(LABEL_FILTER)",) \
./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
-resource-group="$(AZURE_RESOURCE_GROUP)"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as CI target: when TEST_SUITE is left at the default ..., this expands to ./suites/.../... and the command likely won't run any packages. Adjust the suite path construction so the default is exactly ./suites/....

Copilot uses AI. Check for mistakes.
Comment on lines +193 to +205
nodes := &corev1.NodeList{}
g.Expect(env.K8s.List(env.Ctx, nodes)).To(Succeed())
g.Expect(nodes.Items).To(SatisfyAll(
HaveLen(expectedNodes),
ContainElements(Satisfy(func(node corev1.Node) bool {
for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
return true
}
}
return false
})),
))
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllVMSSStable intends to assert that the number of Ready nodes matches total VMSS capacity, but the matcher ContainElements(Satisfy(...NodeReady...)) only requires that at least one node is Ready. This can pass even when most nodes are NotReady. Consider asserting readiness for every node (e.g., HaveEach(Satisfy(...)) or an explicit loop over nodes.Items).

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +93
// NewEnvironment creates a fully initialized Environment.
func NewEnvironment(resourceGroup string, helm *HelmConfig) *Environment {
env := &Environment{
Ctx: context.Background(),
ResourceGroup: resourceGroup,
SubscriptionID: os.Getenv("AZURE_SUBSCRIPTION_ID"),
TenantID: os.Getenv("AZURE_TENANT_ID"),
Helm: helm,
}

azCred, err := azidentity.NewDefaultAzureCredential(nil)
Expect(err).NotTo(HaveOccurred())

env.VMSS, err = armcompute.NewVirtualMachineScaleSetsClient(env.SubscriptionID, azCred, nil)
Expect(err).NotTo(HaveOccurred())

restConfig, err := config.GetConfig()
Expect(err).NotTo(HaveOccurred())
env.K8s, err = client.New(restConfig, client.Options{})
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewEnvironment/EnsureHelmRelease rely on resourceGroup, AZURE_SUBSCRIPTION_ID, and (for Helm) AZURE_TENANT_ID being set, but there’s no early validation. If any are empty, the test will tend to fail later (or retry for minutes in Eventually(AllVMSSStable, ...)) with less actionable errors. Consider adding explicit Expect(resourceGroup).NotTo(BeEmpty()), and when Helm is enabled also require subscription/tenant IDs (and helm-required flags like clusterName/clientID) with clear failure messages.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have PR#9410 upstream to add a devcontainer to CAS, we should make sure we're making compatible changes.

Mine omits some Azure specific stuff to keep it relevant across all the providers, was planning to add those in our fork.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will assume a resource group (cluster-autoscaler-test if not running in a devcontainer) but later steps will fail if you don't have AZURE_RESOURCE_GROUP set explicitly.

I suggest modifying this one to require AZURE_RESOURCE_GROUP to be defined, and for it to fail quickly with a good diagnostic if this happens.

##@ Cluster Setup (one-time)

.PHONY: setup-cluster
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
Copy link
Copy Markdown
Member

@theunrepentantgeek theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to add breadcrumbs to Makefiles (and Taskfiles) so that it's easier to see what step is executing at any one time. This helps with troubleshooting issues as the observer (whether human or AI) can go straight to the failing step.

Suggested change
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
@echo "-- creating cluster --"

##@ Build & Deploy

.PHONY: deploy-local
deploy-local: ## Build CAS and deploy to cluster via skaffold
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deploy-local: ## Build CAS and deploy to cluster via skaffold
deploy-local: ## Build CAS and deploy to cluster via skaffold
@echo "-- Deploying CAS to cluster --"

##@ E2E Testing

.PHONY: e2etests
e2etests: ## Run e2e tests (CAS must already be deployed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e2etests: ## Run e2e tests (CAS must already be deployed)
e2etests: ## Run e2e tests (CAS must already be deployed)
@echo "-- Running E2E test --"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants