Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions .agents/skills/run-azure-e2e-tests/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
name: run-azure-e2e-tests
description: 'Run Azure CAS end-to-end tests — per-suite execution with focus filtering, background execution, and local/CI workflows. Use when: running e2e tests, debugging test failures, adding new test suites.'
---

# E2E Tests for Azure CAS

## Test Structure

```
cluster-autoscaler/cloudprovider/azure/test/
├── suites/
│ └── scaleup/ # Scale-up/down test
│ └── suite_test.go
├── pkg/
│ └── environment/ # Shared Environment struct + helpers
│ └── environment.go
├── Makefile # Local + CI targets
└── go.mod
```

## Local Developer Workflow

From `cluster-autoscaler/cloudprovider/azure/test/`:

### First-time setup

```bash
az login
make setup-cluster # Creates AKS + ACR + workload identity (~5 min)
make deploy-local # Builds + deploys CAS via skaffold (~1 min)
```

### Running tests

```bash
export AZURE_SUBSCRIPTION_ID="$(az account show --query id -o tsv)"
export AZURE_RESOURCE_GROUP="MC_..." # Node resource group (printed by setup-cluster)

make e2etests # Run all suites
make e2etests TEST_SUITE=scaleup # Run single suite
make e2etests FOCUS="scales up" # Focus filter
```

### After code changes

```bash
make deploy-local # Rebuild + redeploy CAS
make e2etests TEST_SUITE=scaleup
```

### Utility commands

- `make list-suites` — list available test suites
- `make validate-env` — check required env vars
- `make deploy-local-dev` — skaffold watch mode (auto-redeploy on changes)

### Background execution (survives VPN drops)

```bash
nohup make e2etests TEST_SUITE=scaleup > e2e.log 2>&1 &
tail -f e2e.log
```

## CI (Prow)

`make test-e2e` builds the CAS image and deploys via Helm (inside BeforeSuite), using cluster info from CAPZ. The Helm deploy is triggered by `-cas-image-repository` and `-cas-image-tag` flags — when absent (local path), Helm is skipped.

## Monitoring

- **Logs**: `tail -f e2e.log`
- **Cluster**: `kubectl get nodes,pods -w`
- **Events**: `kubectl get events -A --field-selector source=cluster-autoscaler --watch`
- **VMSS**: `az vmss list -g $AZURE_RESOURCE_GROUP -o table`
- **CAS logs**: `kubectl logs -n kube-system deploy/cluster-autoscaler -f`

## Adding a New Test Suite

1. Create `test/suites/<name>/suite_test.go`
2. Import `pkg/environment` for shared helpers
3. Register `-resource-group` flag in `init()`
4. Create `Environment` in `BeforeSuite`, call `EnsureHelmRelease(...)` for CI compatibility
5. Run: `make e2etests TEST_SUITE=<name>`
43 changes: 43 additions & 0 deletions .devcontainer/devcontainer.json
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have PR#9410 upstream to add a devcontainer to CAS, we should make sure we're making compatible changes.

Mine omits some Azure specific stuff to keep it relevant across all the providers, was planning to add those in our fork.

Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"name": "Azure CAS Dev",
"image": "mcr.microsoft.com/devcontainers/go:1.22",
"runArgs": ["--cap-add=SYS_PTRACE", "--security-opt", "seccomp=unconfined"],

"customizations": {
"vscode": {
"settings": {
"go.toolsManagement.checkForUpdates": "local",
"go.useLanguageServer": true,
"go.gopath": "/go",
"chat.useAgentSkills": true
},
"extensions": [
"golang.Go",
"ms-kubernetes-tools.vscode-kubernetes-tools",
"ms-kubernetes-tools.vscode-aks-tools",
"ms-azuretools.vscode-bicep",
"GitHub.vscode-pull-request-github",
"GitHub.copilot-chat"
]
}
},

"features": {
"ghcr.io/devcontainers/features/docker-outside-of-docker:1": {},
"ghcr.io/devcontainers/features/kubectl-helm-minikube:1": {
"helm": "latest",
"minikube": "none"
},
"ghcr.io/devcontainers/features/azure-cli:1": {},
"ghcr.io/devcontainers/features/github-cli:1": {},
"ghcr.io/rio/features/skaffold:2": {}
},

"postCreateCommand": {
"install ko": "go install github.com/google/ko@latest",
"install yq": "go install github.com/mikefarah/yq/v4@latest",
"disable skaffold metrics": "skaffold config set --global collect-metrics false"
},

"remoteUser": "vscode"
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will assume a resource group (cluster-autoscaler-test if not running in a devcontainer) but later steps will fail if you don't have AZURE_RESOURCE_GROUP set explicitly.

I suggest modifying this one to require AZURE_RESOURCE_GROUP to be defined, and for it to fail quickly with a good diagnostic if this happens.

Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,27 @@ yq '(.. | select(tag == "!!str")) |= envsubst(nu)' \
cluster-autoscaler-vmss-wi-dynamic.yaml.tpl > \
cluster-autoscaler-vmss-wi-dynamic.yaml

# create the dynamic node group config (required by the AKS fork)
kubectl apply -f - <<SETTINGS
apiVersion: v1
kind: ConfigMap
metadata:
name: autoscaler-settings
namespace: kube-system
data:
settings.json: |
{
"nodeGroups": [
{
"name": "${VMSS_NAME}",
"minSize": 1,
"maxSize": 10,
"scaleDownPolicy": "Delete"
}
]
}
SETTINGS

# skaffold dev/run/debug

exit
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,11 @@ spec:
- --logtostderr=true
- --cloud-provider=azure
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --scale-down-delay-after-add=10s
- --scale-down-unneeded-time=10s
- --scale-down-candidates-pool-ratio=1.0
- --unremovable-node-recheck-timeout=10s
- --nodes=1:10:${VMSS_NAME}
env:
- name: ARM_SUBSCRIPTION_ID
Expand Down Expand Up @@ -193,6 +198,9 @@ spec:
- mountPath: /etc/ssl/certs/ca-certificates.crt
name: ssl-certs
readOnly: true
- mountPath: /opt/conf/autoscaler
name: autoscaler-settings
readOnly: true
# use system nodepool only
affinity:
nodeAffinity:
Expand All @@ -207,4 +215,7 @@ spec:
- hostPath:
path: /etc/ssl/certs/ca-certificates.crt
type: ""
name: ssl-certs
name: ssl-certs
- configMap:
name: autoscaler-settings
name: autoscaler-settings
89 changes: 85 additions & 4 deletions cluster-autoscaler/cloudprovider/azure/test/Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
REPO_ROOT:=$(shell git rev-parse --show-toplevel)
CAS_ROOT:=$(REPO_ROOT)/cluster-autoscaler
DEV_DIR:=$(CAS_ROOT)/cloudprovider/azure/examples/dev

BUILD_TAGS=azure

Expand All @@ -8,20 +9,100 @@ include $(CAS_ROOT)/Makefile
CLUSTER_AUTOSCALER_NAMESPACE?=default
CLUSTER_AUTOSCALER_SERVICEACCOUNT_NAME?=cluster-autoscaler

# TEST_SUITE selects a specific test suite directory (e.g., TEST_SUITE=scaleup).
# Default "..." runs all suites.
TEST_SUITE?=...
TEST_TIMEOUT?=3h
FOCUS?=
LABEL_FILTER?=
ARTIFACTS?=_artifacts

# --------------------------------------------------------------------------
# CI target (Prow — builds CAS image, discovers cluster info from CAPZ)
# --------------------------------------------------------------------------

.PHONY: build-e2e
build-e2e:
$(MAKE) -C $(CAS_ROOT) build-arch-$(GOARCH) make-image-arch-$(GOARCH) BUILD_TAGS=${BUILD_TAGS}
docker push $(IMAGE)-$(GOARCH):$(TAG)

ARTIFACTS?=_artifacts

.PHONY: test-e2e
test-e2e: build-e2e
go run github.com/onsi/ginkgo/v2/ginkgo --tags e2e -v --trace --output-dir "$(ARTIFACTS)" --junit-report="junit.e2e_suite.1.xml" e2e -- \
test-e2e: build-e2e ## CI: build image + run tests (Prow/CAPZ)
go run github.com/onsi/ginkgo/v2/ginkgo --tags e2e -v --trace \
--timeout $(TEST_TIMEOUT) \
--output-dir "$(ARTIFACTS)" --junit-report="junit.e2e_suite.1.xml" \
./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
-resource-group="$$(KUBECONFIG= kubectl get managedclusters -n default -o jsonpath='{.items[0].status.nodeResourceGroup}')" \
Comment on lines +33 to 35
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST_SUITE defaults to ..., but the computed package path becomes ./suites/.../..., which likely matches no packages. This makes the default (run all suites) behavior break. Consider special-casing TEST_SUITE=... to use ./suites/... (single ellipsis) or building the suite path with a conditional so the default runs all suites.

Copilot uses AI. Check for mistakes.
-cluster-name="$$(KUBECONFIG= kubectl get cluster -n default -o jsonpath='{.items[0].metadata.name}')" \
-client-id="$$(KUBECONFIG= kubectl get userassignedidentities -n default -o jsonpath='{.items[0].status.clientId}')" \
-cas-namespace="$(CLUSTER_AUTOSCALER_NAMESPACE)" \
-cas-serviceaccount-name="$(CLUSTER_AUTOSCALER_SERVICEACCOUNT_NAME)" \
-cas-image-repository="$(IMAGE)-$(GOARCH)" \
-cas-image-tag="$(TAG)"

# --------------------------------------------------------------------------
# Local developer targets
# --------------------------------------------------------------------------
# Prerequisites:
# 1. az login
# 2. make setup-cluster (creates AKS + ACR + identity, one-time)
# 3. make deploy-local (builds + deploys CAS via skaffold)
# 4. make e2etests (runs the tests)
#
# Required env var: AZURE_RESOURCE_GROUP (set automatically by setup-cluster)

.PHONY: help
help: ## Display help
@awk 'BEGIN {FS = ":.*##"; printf "Usage:\n make \033[36m<target>\033[0m\n"} \
/^[a-zA-Z_0-9-]+:.*?##/ { printf " \033[36m%-25s\033[0m %s\n", $$1, $$2 } \
/^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) }' $(MAKEFILE_LIST)

##@ Cluster Setup (one-time)

.PHONY: setup-cluster
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
Copy link
Copy Markdown
Member

@theunrepentantgeek theunrepentantgeek Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to add breadcrumbs to Makefiles (and Taskfiles) so that it's easier to see what step is executing at any one time. This helps with troubleshooting issues as the observer (whether human or AI) can go straight to the failing step.

Suggested change
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
setup-cluster: ## Create AKS cluster + ACR + workload identity for e2e testing
@echo "-- creating cluster --"

cd $(DEV_DIR) && bash ./aks-dev-deploy.sh

##@ Build & Deploy

.PHONY: deploy-local
deploy-local: ## Build CAS and deploy to cluster via skaffold
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deploy-local: ## Build CAS and deploy to cluster via skaffold
deploy-local: ## Build CAS and deploy to cluster via skaffold
@echo "-- Deploying CAS to cluster --"

cd $(CAS_ROOT) && skaffold run --filename cloudprovider/azure/examples/dev/skaffold.yaml

.PHONY: deploy-local-dev
deploy-local-dev: ## Build + deploy CAS in watch mode (auto-redeploy on changes)
cd $(CAS_ROOT) && skaffold dev --filename cloudprovider/azure/examples/dev/skaffold.yaml

##@ E2E Testing

.PHONY: e2etests
e2etests: ## Run e2e tests (CAS must already be deployed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e2etests: ## Run e2e tests (CAS must already be deployed)
e2etests: ## Run e2e tests (CAS must already be deployed)
@echo "-- Running E2E test --"

go run github.com/onsi/ginkgo/v2/ginkgo \
--tags e2e \
-v --trace \
--timeout $(TEST_TIMEOUT) \
--output-dir "$(ARTIFACTS)" \
--junit-report="junit.e2e_suite.1.xml" \
$(if $(FOCUS),--focus="$(FOCUS)",) \
$(if $(LABEL_FILTER),--label-filter="$(LABEL_FILTER)",) \
./suites/$$(echo $(TEST_SUITE) | tr A-Z a-z)/... -- \
-resource-group="$(AZURE_RESOURCE_GROUP)"
Comment on lines +87 to +89
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as CI target: when TEST_SUITE is left at the default ..., this expands to ./suites/.../... and the command likely won't run any packages. Adjust the suite path construction so the default is exactly ./suites/....

Copilot uses AI. Check for mistakes.

##@ Utilities

.PHONY: list-suites
list-suites: ## List available test suites
@find suites -mindepth 1 -maxdepth 1 -type d -printf '%f\n' 2>/dev/null || echo "No suites found."

.PHONY: validate-env
validate-env: ## Check required environment variables
@missing=""; \
for var in AZURE_SUBSCRIPTION_ID AZURE_RESOURCE_GROUP; do \
eval val=\$$$$var; \
if [ -z "$$val" ]; then missing="$$missing $$var"; fi; \
done; \
if [ -n "$$missing" ]; then \
echo "ERROR: Missing required environment variables:$$missing"; \
exit 1; \
fi; \
echo "All required environment variables are set."
Loading
Loading