feat: install DRA nvidia gpu plugin #8797
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for enabling NVIDIA’s DRA-based GPU driver/plugin flow as part of AgentBaker’s managed GPU experience on Linux nodes. It introduces a new NodeBootstrappingConfiguration flag, plumbs it into the CSE command line, installs the required cached package(s), starts the corresponding systemd service, and adds an e2e scenario to validate service/workload behavior.
Changes:
- Add
EnableManagedGPUDRAto the node bootstrapping datamodel and expose it to CSE templates viaIsEnableManagedGPUDRA. - Update Linux CSE scripts (Ubuntu + Mariner/Azure Linux) to install
dra-driver-nvidia-gpufrom cache and startdra-driver-nvidia-gpu.service(alongside existing managed GPU services). - Add e2e coverage for
dra-driver-nvidia-gpubeing enabled/active and for scheduling a DRA workload.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Adds ShellSpec coverage for DRA-related service start behavior (currently targets a non-existent function). |
| pkg/agent/datamodel/types.go | Adds EnableManagedGPUDRA to NodeBootstrappingConfiguration. |
| pkg/agent/baker.go | Exposes EnableManagedGPUDRA to templates via IsEnableManagedGPUDRA. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Includes dra-driver-nvidia-gpu in managed GPU cached package installs (flag-gated) and creates kubelet plugin dirs. |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Same as Ubuntu for Mariner/Azure Linux installs and kubelet plugin dirs. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Plumbs ENABLE_MANAGED_GPU_DRA into CSE flow and defers managed GPU services start until after kubelet. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Adds new error code ERR_DRA_DRIVER_START_FAIL. |
| parts/linux/cloud-init/artifacts/cse_config.sh | Installs managed GPU cached packages when DRA enabled and starts dra-driver-nvidia-gpu service in startNvidiaManagedExpServices. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Adds ENABLE_MANAGED_GPU_DRA environment variable to the generated CSE command script. |
| e2e/validators.go | Adds validators for DRA service state and for scheduling a DRA workload via DeviceClass/ResourceClaim. |
| e2e/scenario_gpu_managed_experience_test.go | Adds an Ubuntu 24.04 GPU scenario enabling the new DRA flag and running the DRA validators. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh:2031
- startNvidiaManagedExpServices() now has a DRA branch (dra-driver-nvidia-gpu) gated by ENABLE_MANAGED_GPU_EXPERIENCE_DRA, but this spec block only asserts the device-plugin path. Adding an explicit DRA-mode unit test here would prevent regressions in the new service start logic.
BeforeEach 'MIG_NODE="false"; ENABLE_MANAGED_GPU_EXPERIENCE="true"; ENABLE_MANAGED_GPU_EXPERIENCE_DRA="false"'
It 'starts the device-plugin blocking but dcgm and dcgm-exporter off the critical path'
When call startNvidiaManagedExpServices
| }) | ||
| } | ||
|
|
||
| func Test_Ubuntu2404_DraDriverNvidiaGpuRunning(t *testing.T) { |
| dcgm-exporter | ||
| ) | ||
|
|
||
| if [ "${ENABLE_MANAGED_GPU_EXPERIENCE:-false}" = "true" ]; then |
There was a problem hiding this comment.
why this change ? my understand was that datacenter-gpu-manager-4-core, datacenter-gpu-manager-4-proprietary and dcgm-exporter were also part of the of Managed Experience. did that change recently ?
There was a problem hiding this comment.
ok I get it now, only one of the two can be enabled at the same time.
It would be clearer is this was a switch case or if/else since this would make it clearer that it's one of the other.
| mkdir -p "$(dirname "${managed_gpu_marker}")" | ||
| touch "${managed_gpu_marker}" | ||
| elif [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then | ||
| logs_to_events "AKS.CSE.installNvidiaManagedExpPkgFromCache" "installNvidiaManagedExpPkgFromCache" || exit $ERR_NVIDIA_DCGM_INSTALL |
There was a problem hiding this comment.
this is all code duplication from the previous IF section, please clean up
|
|
||
| # defer starting DRA driver services after kubelet. | ||
| if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then | ||
| logs_to_events "AKS.CSE.startNvidiaManagedExpServices" "startNvidiaManagedExpServices" || exit $? |
There was a problem hiding this comment.
why only the _DRA flavor logic is present under this file. I would have expeced that the normal MAanged GPU flow would also call. startNvidiaManagedExpServices
| # Reload systemd to pick up the override | ||
| systemctl daemon-reload | ||
| # 2. Start the dra-driver-nvidia-gpu service. | ||
| if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then |
There was a problem hiding this comment.
same since both ENABLE_MANAGED_GPU_EXPERIENCE and ENABLE_MANAGED_GPU_EXPERIENCE_DRA cannot be true at the same time I would make it clear from the code base. within this function, right now it looks like both can be enabled independantly.
What this PR does / why we need it:
The unit test doesn't work because it needs the VHD changes in this PR.
Here is my local test result.
go test -run "^Test_Ubuntu2404_DraDriverNvidiaGpuRunning$" -v
Error loading .env file: open .env: no such file or directory
2026/06/30 16:52:38 using E2E environment configuration:
ACR_SECRET_NAME=acr-secret-code2
ACR_TARGET_REPOSITORY=aks-managed-repository/*
BLOB_CONTAINER=abe2e
BLOB_STORAGE_ACCOUNT_PREFIX=abe2e
BUILD_ID=local
DEFAULT_POLL_INTERVAL=1s
DEFAULT_SUBNET_NAME=aks-subnet
DEFAULT_VM_SKU=Standard_D2ds_v5
DISABLE_SCRIPTLESS=false
DISABLE_SCRIPTLESS_COMPILATION=false
E2E_LOCATION=westus3
ENABLE_SECURE_TLS_BOOTSTRAPPING=true
EXTENDED_TESTS=
GALLERY_NAME=PackerSigGalleryEastUS
GALLERY_NAME=PackerSigGalleryEastUS
GALLERY_RESOURCE_GROUP=aksvhdtestbuildrg
GALLERY_RESOURCE_GROUP=aksvhdtestbuildrg
GALLERY_SUBSCRIPTION_ID=c4c3550e-a965-4993-a50c-628fd38cd3e1
GALLERY_SUBSCRIPTION_ID=c4c3550e-a965-4993-a50c-628fd38cd3e1
IGNORE_SCENARIOS_WITH_MISSING_VHD=false
KEEP_VMSS=true
LOGGING_DIR=scenario-logs
NETWORK_ISOLATED_NSG_NAME=abe2e-networkisolated-securityGroup
SIG_VERSION_TAG_NAME=branch
SIG_VERSION_TAG_VALUE=refs/heads/main
SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=false
SUBSCRIPTION_ID=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
SYS_SSH_PRIVATE_KEY_B64=
SYS_SSH_PUBLIC_KEY=
TAGS_TO_RUN=
TAGS_TO_SKIP=
TEST_GALLERY_IMAGE_PREFIX=abe2etest
TEST_GALLERY_NAME_PREFIX=abe2etest
TEST_PRE_PROVISION=false
TEST_TIMEOUT=50m0s
TEST_TIMEOUT_CLUSTER=30m0s
TEST_TIMEOUT_VMSS=17m0s
WINDOWS_ADMIN_PASSWORD=
=== RUN Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== PAUSE Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== CONT Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== RUN Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
=== PAUSE Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
=== CONT Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
azure.go:667: [0.000s] Looking up images for gallery subscription c4c3550e-a965-4993-a50c-628fd38cd3e1 resource group aksvhdtestbuildrg gallery name PackerSigGalleryEastUS image name 2404gen2containerd version 1.1782799017.7457
azure.go:567: [1.014s] Image version /subscriptions/c4c3550e-a965-4993-a50c-628fd38cd3e1/resourceGroups/aksvhdtestbuildrg/providers/Microsoft.Compute/galleries/PackerSigGalleryEastUS/images/2404gen2containerd/versions/1.1782799017.7457 is already in region westus3
vhd.go:339: [1.014s] Got image by version: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/c4c3550e-a965-4993-a50c-628fd38cd3e1/resourceGroups/aksvhdtestbuildrg/providers/Microsoft.Compute/galleries/PackerSigGalleryEastUS/images/aks-ubuntu-containerd-24.04-gen2/versions/1.1782799017.7457/overview
test_helpers.go:411: [1.014s] TAGS {Name:Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default ImageName:2404gen2containerd OS:ubuntu Arch:amd64 NetworkIsolated:false NonAnonymousACR:false GPU:true WASM:false BootstrapTokenFallback:false KubeletCustomConfig:false Scriptless:false VHDCaching:false MockAzureChinaCloud:false VMSeriesCoverageTest:false}
test_helpers.go:222: [7.477s] → running scenario...
cluster.go:79: [7.477s] → preparing cluster...
shared_infra.go:51: [7.477s] → ensuring shared infrastructure...
shared_infra.go:80: [12.396s] ✓ ensuring shared infrastructure done (4.9s)
cluster.go:314: [12.534s] → get or create cluster abe2e-kubenet-v5-98f29...
cluster.go:322: [13.684s] ✓ get or create cluster abe2e-kubenet-v5-98f29 done (1.1s)
cluster.go:709: [13.990s] using shared bastion abe2e-shared-bastion in abe2e-westus3
aks_model.go:302: [13.990s] → adding firewall rules...
aks_model.go:348: [14.375s] Adding route "vnet-local" to AKS route table "aks-agentpool-19541795-routetable"
cluster.go:878: [15.149s] → setting up private DNS for API server...
cluster.go:917: [15.917s] private DNS zone "abe2e-kubenet-v5-t2kmtiej.hcp.westus3.azmk8s.io" already up to date
cluster.go:918: [15.917s] ✓ setting up private DNS for API server done (0.8s)
aks_model.go:348: [16.178s] Adding route "default-route-to-firewall" to AKS route table "aks-agentpool-19541795-routetable"
aks_model.go:359: [16.871s] Successfully added firewall routes to AKS route table "aks-agentpool-19541795-routetable"
aks_model.go:360: [16.871s] ✓ adding firewall rules done (2.9s)
cluster.go:733: [16.871s] → collecting garbage VMSS...
kube.go:387: [16.871s] Creating daemonset debug-mariner-tolerated with image mcr.microsoft.com/cbl-mariner/base/core:2.0
cluster.go:805: [16.987s] → collecting garbage K8s nodes...
cluster.go:846: [17.296s] ✓ collecting garbage K8s nodes done (0.3s)
cluster.go:797: [17.296s] ✓ collecting garbage VMSS done (0.4s)
kube.go:387: [17.411s] Creating daemonset debugnonhost-mariner-tolerated with image mcr.microsoft.com/cbl-mariner/base/core:2.0
kube.go:567: [17.633s] Creating proxy daemonset e2e-proxy with image mcr.microsoft.com/cbl-mariner/base/python:3
cluster.go:197: [18.033s] ✓ preparing cluster done (10.6s)
test_helpers.go:239: [18.033s] using cluster abe2e-kubenet-v5-98f29 in rg=abe2e-westus3 sub=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
test_helpers.go:240: [18.033s] portal: https://portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-98f29/overview
test_helpers.go:272: [18.036s] → preparing AKS node...
vmss.go:531: [28.652s] → creating VMSS p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul...
vmss.go:435: [28.841s] VMSS portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul/overview
vmss.go:441: [28.841s] Managed cluster portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-98f29/overview
2026/06/30 16:53:08 Using VM extension version 1.462 for extension type Compute.AKS.Linux.AKSNode in region westus3
vmss.go:562: [35.519s] VM will be preserved after the test finishes, PLEASE MANUALLY DELETE THE VMSS. Set KEEP_VMSS=false to delete it automatically after the test finishes
vmss.go:568: [35.519s] SSH Instructions: (may take a few minutes for the VM to be ready for SSH)
========================
az network bastion ssh --target-resource-id "/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul/virtualMachines/0" --name "abe2e-shared-bastion" --resource-group abe2e-westus3 --auth-type ssh-key --username azureuser --ssh-key /tmp/private-key-2914816484
--- PASS: Test_Ubuntu2404_DraDriverNvidiaGpuRunning (0.00s)
--- PASS: Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default (475.27s)
PASS
ok github.com/Azure/agentbaker/e2e 475.761s
Which issue(s) this PR fixes:
Fixes #