Skip to content

feat: install DRA nvidia gpu plugin #8797

Open
runzhen wants to merge 14 commits into
mainfrom
runzhen/dra3
Open

feat: install DRA nvidia gpu plugin #8797
runzhen wants to merge 14 commits into
mainfrom
runzhen/dra3

Conversation

@runzhen

@runzhen runzhen commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

The unit test doesn't work because it needs the VHD changes in this PR.
Here is my local test result.

go test -run "^Test_Ubuntu2404_DraDriverNvidiaGpuRunning$" -v
Error loading .env file: open .env: no such file or directory
2026/06/30 16:52:38 using E2E environment configuration:
ACR_SECRET_NAME=acr-secret-code2
ACR_TARGET_REPOSITORY=aks-managed-repository/*
BLOB_CONTAINER=abe2e
BLOB_STORAGE_ACCOUNT_PREFIX=abe2e
BUILD_ID=local
DEFAULT_POLL_INTERVAL=1s
DEFAULT_SUBNET_NAME=aks-subnet
DEFAULT_VM_SKU=Standard_D2ds_v5
DISABLE_SCRIPTLESS=false
DISABLE_SCRIPTLESS_COMPILATION=false
E2E_LOCATION=westus3
ENABLE_SECURE_TLS_BOOTSTRAPPING=true
EXTENDED_TESTS=
GALLERY_NAME=PackerSigGalleryEastUS
GALLERY_NAME=PackerSigGalleryEastUS
GALLERY_RESOURCE_GROUP=aksvhdtestbuildrg
GALLERY_RESOURCE_GROUP=aksvhdtestbuildrg
GALLERY_SUBSCRIPTION_ID=c4c3550e-a965-4993-a50c-628fd38cd3e1
GALLERY_SUBSCRIPTION_ID=c4c3550e-a965-4993-a50c-628fd38cd3e1
IGNORE_SCENARIOS_WITH_MISSING_VHD=false
KEEP_VMSS=true
LOGGING_DIR=scenario-logs
NETWORK_ISOLATED_NSG_NAME=abe2e-networkisolated-securityGroup
SIG_VERSION_TAG_NAME=branch
SIG_VERSION_TAG_VALUE=refs/heads/main
SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE=false
SUBSCRIPTION_ID=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
SYS_SSH_PRIVATE_KEY_B64=
SYS_SSH_PUBLIC_KEY=
TAGS_TO_RUN=
TAGS_TO_SKIP=
TEST_GALLERY_IMAGE_PREFIX=abe2etest
TEST_GALLERY_NAME_PREFIX=abe2etest
TEST_PRE_PROVISION=false
TEST_TIMEOUT=50m0s
TEST_TIMEOUT_CLUSTER=30m0s
TEST_TIMEOUT_VMSS=17m0s
WINDOWS_ADMIN_PASSWORD=
=== RUN Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== PAUSE Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== CONT Test_Ubuntu2404_DraDriverNvidiaGpuRunning
=== RUN Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
=== PAUSE Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
=== CONT Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
azure.go:667: [0.000s] Looking up images for gallery subscription c4c3550e-a965-4993-a50c-628fd38cd3e1 resource group aksvhdtestbuildrg gallery name PackerSigGalleryEastUS image name 2404gen2containerd version 1.1782799017.7457
azure.go:567: [1.014s] Image version /subscriptions/c4c3550e-a965-4993-a50c-628fd38cd3e1/resourceGroups/aksvhdtestbuildrg/providers/Microsoft.Compute/galleries/PackerSigGalleryEastUS/images/2404gen2containerd/versions/1.1782799017.7457 is already in region westus3
vhd.go:339: [1.014s] Got image by version: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/c4c3550e-a965-4993-a50c-628fd38cd3e1/resourceGroups/aksvhdtestbuildrg/providers/Microsoft.Compute/galleries/PackerSigGalleryEastUS/images/aks-ubuntu-containerd-24.04-gen2/versions/1.1782799017.7457/overview
test_helpers.go:411: [1.014s] TAGS {Name:Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default ImageName:2404gen2containerd OS:ubuntu Arch:amd64 NetworkIsolated:false NonAnonymousACR:false GPU:true WASM:false BootstrapTokenFallback:false KubeletCustomConfig:false Scriptless:false VHDCaching:false MockAzureChinaCloud:false VMSeriesCoverageTest:false}
test_helpers.go:222: [7.477s] → running scenario...
cluster.go:79: [7.477s] → preparing cluster...
shared_infra.go:51: [7.477s] → ensuring shared infrastructure...
shared_infra.go:80: [12.396s] ✓ ensuring shared infrastructure done (4.9s)
cluster.go:314: [12.534s] → get or create cluster abe2e-kubenet-v5-98f29...
cluster.go:322: [13.684s] ✓ get or create cluster abe2e-kubenet-v5-98f29 done (1.1s)
cluster.go:709: [13.990s] using shared bastion abe2e-shared-bastion in abe2e-westus3
aks_model.go:302: [13.990s] → adding firewall rules...
aks_model.go:348: [14.375s] Adding route "vnet-local" to AKS route table "aks-agentpool-19541795-routetable"
cluster.go:878: [15.149s] → setting up private DNS for API server...
cluster.go:917: [15.917s] private DNS zone "abe2e-kubenet-v5-t2kmtiej.hcp.westus3.azmk8s.io" already up to date
cluster.go:918: [15.917s] ✓ setting up private DNS for API server done (0.8s)
aks_model.go:348: [16.178s] Adding route "default-route-to-firewall" to AKS route table "aks-agentpool-19541795-routetable"
aks_model.go:359: [16.871s] Successfully added firewall routes to AKS route table "aks-agentpool-19541795-routetable"
aks_model.go:360: [16.871s] ✓ adding firewall rules done (2.9s)
cluster.go:733: [16.871s] → collecting garbage VMSS...
kube.go:387: [16.871s] Creating daemonset debug-mariner-tolerated with image mcr.microsoft.com/cbl-mariner/base/core:2.0
cluster.go:805: [16.987s] → collecting garbage K8s nodes...
cluster.go:846: [17.296s] ✓ collecting garbage K8s nodes done (0.3s)
cluster.go:797: [17.296s] ✓ collecting garbage VMSS done (0.4s)
kube.go:387: [17.411s] Creating daemonset debugnonhost-mariner-tolerated with image mcr.microsoft.com/cbl-mariner/base/core:2.0
kube.go:567: [17.633s] Creating proxy daemonset e2e-proxy with image mcr.microsoft.com/cbl-mariner/base/python:3
cluster.go:197: [18.033s] ✓ preparing cluster done (10.6s)
test_helpers.go:239: [18.033s] using cluster abe2e-kubenet-v5-98f29 in rg=abe2e-westus3 sub=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
test_helpers.go:240: [18.033s] portal: https://portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-98f29/overview
test_helpers.go:272: [18.036s] → preparing AKS node...
vmss.go:531: [28.652s] → creating VMSS p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul...
vmss.go:435: [28.841s] VMSS portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul/overview
vmss.go:441: [28.841s] Managed cluster portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-98f29/overview
2026/06/30 16:53:08 Using VM extension version 1.462 for extension type Compute.AKS.Linux.AKSNode in region westus3
vmss.go:562: [35.519s] VM will be preserved after the test finishes, PLEASE MANUALLY DELETE THE VMSS. Set KEEP_VMSS=false to delete it automatically after the test finishes
vmss.go:568: [35.519s] SSH Instructions: (may take a few minutes for the VM to be ready for SSH)
========================
az network bastion ssh --target-resource-id "/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-98f29_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul/virtualMachines/0" --name "abe2e-shared-bastion" --resource-group abe2e-westus3 --auth-type ssh-key --username azureuser --ssh-key /tmp/private-key-2914816484

bastionssh.go:304: [350.967s] Attempt 1/5 establishing SSH over bastion to 10.114.192.6
vmss.go:618: [352.828s] VM reached running state
vmss.go:588: [352.828s] ✓ creating VMSS p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul done (324.2s)
kube.go:160: [352.828s] → waiting for node p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul to be ready...
kube.go:182: [353.135s] node p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 is ready. Taints: [{"key":"node.kubernetes.io/network-unavailable","effect":"NoSchedule","timeAdded":"2026-06-30T16:57:55Z"}] Conditions: [{"type":"NetworkUnavailable","status":"True","lastHeartbeatTime":"2026-06-30T16:57:55Z","lastTransitionTime":"2026-06-30T16:57:55Z","reason":"NodeInitialization","message":"Waiting for cloud routes"},{"type":"FrequentDockerRestart","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NoFrequentDockerRestart","message":"docker is functioning properly"},{"type":"XIDError","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"XIDErrorIsNotPresent","message":"XID error status is good"},{"type":"UnhealthyNvidiaDevicePlugin","status":"Unknown","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"HealthyNvidiaDevicePlugin","message":"Systemd service nvidia-device-plugin not found (may not be installed)."},{"type":"GPUClockThrottling","status":"True","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"GPUClockThrottlingIsPresent"},{"type":"ContainerRuntimeProblem","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"ContainerRuntimeIsUp","message":"container runtime service is up"},{"type":"FrequentUnregisterNetDevice","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NoFrequentUnregisterNetDevice","message":"node is functioning properly"},{"type":"VMEventScheduled","status":"Unknown","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:58:18Z","reason":"NoVMEventScheduled","message":"Unable to verify scheduled events (IMDS unavailable)"},{"type":"KubeletProblem","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"KubeletIsUp","message":"kubelet service is up"},{"type":"KernelDeadlock","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"KernelHasNoDeadlock","message":"kernel has no deadlock"},{"type":"NVIDIAGRIDStatusInvalid","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NVIDIAGRIDStatusValid","message":"NVIDIA Grid Status Valid"},{"type":"GPUMissing","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NoGPUMissing","message":"All GPUs are present"},{"type":"UnhealthyNvidiaDCGMServices","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"HealthyNvidiaDCGMServices","message":"NVIDIA DCGM services are running properly"},{"type":"FilesystemCorruptionProblem","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"FilesystemIsOK","message":"Filesystem is healthy"},{"type":"ReadonlyFilesystem","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"FilesystemIsNotReadOnly","message":"Filesystem is not read-only"},{"type":"FrequentContainerdRestart","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NoFrequentContainerdRestart","message":"containerd is functioning properly"},{"type":"FrequentKubeletRestart","status":"False","lastHeartbeatTime":"2026-06-30T16:58:18Z","lastTransitionTime":"2026-06-30T16:57:54Z","reason":"NoFrequentKubeletRestart","message":"kubelet is functioning properly"},{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2026-06-30T16:58:22Z","lastTransitionTime":"2026-06-30T16:57:52Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2026-06-30T16:58:22Z","lastTransitionTime":"2026-06-30T16:57:52Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","status":"False","lastHeartbeatTime":"2026-06-30T16:58:22Z","lastTransitionTime":"2026-06-30T16:57:52Z","reason":"KubeletHasSufficientPID","message":"kubelet has sufficient PID available"},{"type":"Ready","status":"True","lastHeartbeatTime":"2026-06-30T16:58:22Z","lastTransitionTime":"2026-06-30T16:57:52Z","reason":"KubeletReady","message":"kubelet is posting ready status"}]
kube.go:203: [353.135s] ✓ waiting for node p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul to be ready done (0.3s)
test_helpers.go:367: [353.135s] ⚠️ ##vso[task.logissue type=warning;] Node p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul took 5m24.175885788s to be created and 306.512266ms to be ready
test_helpers.go:370: [353.135s] ✓ preparing AKS node done (335.1s)
test_helpers.go:265: [353.135s] Choosing the private ACR "abe2eprivatewestus3" for the vm validation
test_helpers.go:431: [353.135s] → validating VM...
test_helpers.go:991: [356.057s] SSH connectivity to 10.114.192.6 verified successfully
validators.go:2722: [356.057s] truncated pod name "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000-test-pod" to "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
validation.go:148: [356.057s] creating pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
kube.go:108: [356.174s] → waiting for pod  metadata.name=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace...
kube.go:156: [362.254s] ✓ waiting for pod  metadata.name=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace done (6.1s)
validation.go:172: [362.254s] Time for pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" to get ready was 6.196902917s
validation.go:173: [362.254s] node health validation: test pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" is running on node "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
validators.go:58: [362.960s] will validate bootstrapping mode: secure TLS bootstrapping
validators.go:127: [364.649s] will validate linux KSCR enablement
validation.go:47: [371.479s] VM has 6 CPUs, expecting rx buffer size: 2048
validation.go:47: [372.691s] NICs to configure:
    === NICs to Configure ===

validation.go:47: [372.691s] Parsed NICs list: [] (count: 1)
validation.go:47: [372.692s] No PCI devices (NICs) with enP* slot pattern found - skipping network interface config validation
validation.go:48: [374.329s] No critical kernel issues found
validation.go:49: [374.871s] waagent.log validation passed: WALinuxAgent-2.15.0.1 running correctly with no ExtHandler errors
validation.go:53: [375.662s] skip_vhd_node_exporter sentinel file found, validating node-exporter installation
validation.go:53: [378.581s] Validating node-exporter is listening on port 19100 and serving metrics
validation.go:53: [379.006s] node-exporter validation passed
validation.go:83: [380.648s] node "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" has label "kubernetes.azure.com/localdns-exporter" — proceeding with full exporter validation
validation.go:83: [391.492s] localdns exporter metrics validation output:
    === LocalDNS Metrics Exporter Validation ===

    0. Generating DNS load through localdns to prime resource accounting...
       ✓ Sent ~1000 DNS queries through localdns (10 workers × 100 queries)

    1. Checking if port 9353 is listening...
       ✓ Port 9353 is listening on 10.114.192.6:9353
       Drop-in directory contents:
    total 12
    drwxr-xr-x  2 root root 4096 Jun 30 16:57 .
    drwxr-xr-x 37 root root 4096 Jun 30 16:57 ..
    -rw-r--r--  1 root root   54 Jun 30 16:57 10-listen-address.conf
       Drop-in config:
         [Socket]
         ListenStream=
         ListenStream=10.114.192.6:9353
       Effective systemd Listen property: Listen=10.114.192.6:9353 (Stream)

    2. Verifying systemd resource accounting is enabled and working...
       Raw CPUUsageNSec: 1647961000
       Raw MemoryCurrent: 55361536 bytes
       ✓ Both non-zero — systemd resource accounting is working

    3. Checking HTTP status from http://10.114.192.6:9353/metrics...
       ✓ HTTP 200 OK received

    4. Fetching metrics body...
       ✓ Metrics fetched successfully

    5. Validating CPU usage metric...
       Raw metric: localdns_cpu_usage_seconds_total 1.313297000
       Value: 1.313297000
       ✓ localdns_cpu_usage_seconds_total=1.313297000 (valid, non-zero)

    6. Validating memory usage metric...
       Raw metric: localdns_memory_usage_bytes 54718464
       Value: 54718464
       ✓ localdns_memory_usage_bytes=54718464 (valid, non-zero)

    6b. Validating metrics staleness timestamp...
       Raw metric: localdns_metrics_last_update_timestamp_seconds 1782838739
       ✓ localdns_metrics_last_update_timestamp_seconds=1782838739 (recent, within 120s of now)

    7. Validating VnetDNS forward IP metric...
       ✓ VnetDNS forward metric present
    8. Validating KubeDNS forward IP metric...
       ✓ KubeDNS forward metric present

    9. Validating VnetDNS forward IP entries...

    10. Validating KubeDNS forward IP entries...

    === ✓ All LocalDNS Metrics Validation Checks Passed ===
    VnetDNS forwards:
      localdns_vnetdns_forward_info{ip="168.63.129.16",block=".:53",status="ok"} 1
      localdns_vnetdns_forward_info{ip="10.0.0.10",block="cluster.local:53",status="ok"} 1
      localdns_vnetdns_forward_info{ip="10.0.0.10",block="testdomain456.com:53",status="ok"} 1
    KubeDNS forwards:
      localdns_kubedns_forward_info{ip="10.0.0.10",block=".:53",status="ok"} 1
      localdns_kubedns_forward_info{ip="10.0.0.10",block="cluster.local:53",status="ok"} 1
      localdns_kubedns_forward_info{ip="168.63.129.16",block="testdomain567.com:53",status="ok"} 1

    === ✓ All localdns exporter functional validations passed ===
validation.go:112: [391.891s] skip_vhd_ig sentinel file found, validating Inspektor Gadget installation
validation.go:112: [393.616s] Validating imported gadgets tracking file is not empty
validation.go:112: [393.749s] Validating ig image list shows imported gadgets
validation.go:112: [393.969s] ig image list output:
    REPOSITORY TAG DIGEST CREATED
    advise_networkpolicy v0.53.2 sha256:cb73f67468c549275e4dc92be7386b37fc8b74fb13fee08b2378cb5b13bb9d9a 2026-06-17T14:59:57Z
    advise_seccomp v0.53.2 sha256:178eb5c6e50b372ec519aea1164839d70e3d319715d30b7f40ec6f464a2fb9e7 2026-06-17T15:00:04Z
    audit_seccomp v0.53.2 sha256:6c2e2909eb5bcdab43870ae51f9f27820b98fece865493c451bb3ff2ac35b0f7 2026-06-17T15:00:06Z
    bpfstats v0.53.2 sha256:94d4927afff2589a2b91e3a7eac0a4bbfe4fb50bcd51256ef028305f9da3f4e6 2026-06-17T15:00:06Z
    deadlock v0.53.2 sha256:3f49431f2202a16177d37b03fb368de61bdf617eb111e8e07ad8f07a1e0f677c 2026-06-17T15:00:12Z
    fdpass v0.53.2 sha256:f37b666271652c40d02475a394195fff9e6a4c9032a72c6f85466fedc3297a27 2026-06-17T15:00:15Z
    fsnotify v0.53.2 sha256:cacb6b582952ac40aece32b8c45d9ca636d0ff0e6d5b7b481bbc4c75278f387e 2026-06-17T15:00:22Z
    profile_blockio v0.53.2 sha256:407fb96ffd43385614b2570c29c35a58b92e45e3f6fd81d6d0e916cb40a16ff6 2026-06-17T15:00:24Z
    profile_cpu v0.53.2 sha256:d9f5b1e323fcb3f2f257cf6deabe580a2db47ad5fe8262f4ec45ba7be23f8d45 2026-06-17T15:00:26Z
    profile_cuda v0.53.2 sha256:02ff49d15e3d3f8dcda3037a4cf6cd49c99631ca2dcbe348bc7f0cabded69f63 2026-06-17T15:00:28Z
    profile_qdisc_latency v0.53.2 sha256:21c1304ba6650dd86c008ac7df2006c86cb836ddac6c8162a7d553a235eee432 2026-06-17T15:00:30Z
    profile_tcprtt v0.53.2 sha256:f63ed209dabd7ce2d243c9c6ac1adcaca7b236007f4f8c858e83fcc430eed4a7 2026-06-17T15:00:31Z
    snapshot_file v0.53.2 sha256:8bedf4a53f957355344ea49046beb1e103271ba1790da9310bab0d690953247a 2026-06-17T15:00:34Z
    snapshot_process v0.53.2 sha256:0cee955d149ba10ed295e68f79087ed940a1f90122c2d7f6368086a174f07176 2026-06-17T15:00:36Z
    snapshot_socket v0.53.2 sha256:9a21bdb420315cc4d4d2015e3d3f5fe6105dba9a242ea7323b60c03d32973e26 2026-06-17T15:00:38Z
    tcpdump v0.53.2 sha256:b9f38f070e0708873af1fc26fc80b1b283d2018841f96c1fdda7c0b4637ff977 2026-06-17T15:00:41Z
    top_blockio v0.53.2 sha256:ffa238fc1360e49fa7819de1fdbd41c265ff2967c63e4ec925a76c38950ce10d 2026-06-17T15:00:42Z
    top_cpu_throttle v0.53.2 sha256:981aa140e6108155bcbf777692558348f335aa236300755b6be868fe1917b3fc 2026-06-17T15:00:43Z
    top_cuda_memory v0.53.2 sha256:57d876ad88c3cb8c94ff662bf81d20239b54cbd6c180f588b0aeb8b98a8f0831 2026-06-17T15:00:45Z
    top_file v0.53.2 sha256:cfa1d2eec3016547a0fcac17f96488b953caa44845e2b61f5f5dca56ece0cbdf 2026-06-17T15:00:48Z
    top_process v0.53.2 sha256:b048e6a2357affa8629861e7909733dfa186c15d8f7f60f0c40a05375ed7086a 2026-06-17T15:00:48Z
    top_tcp v0.53.2 sha256:168dbb16fc087adc44bd8326d4cfad9a73c8e6c92186c39518846623670ea21b 2026-06-17T15:00:49Z
    trace_bind v0.53.2 sha256:b6b3b7e45f844ec8fa7ec08251ea6388ac996bc2dbfc7c2e9fda96c867aa156b 2026-06-17T15:00:51Z
    trace_capabilities v0.53.2 sha256:2453e7e26ba2044944544edeb2f74dc3bd76dc97076727a2eb1423301a1f49be 2026-06-17T15:00:53Z
    trace_dns v0.53.2 sha256:088e3841f7c6fc1adad1aa5f96d307f8232e136e0ddc1ac448cf6be4e1aa979e 2026-06-17T15:01:00Z
    trace_exec v0.53.2 sha256:4a6b948c78739d8a6956f57aa925f50815f8af31c2bad057a5842a11b93db7bb 2026-06-17T15:01:11Z
    trace_fsslower v0.53.2 sha256:fdce67a6e90170f7e0ffee166ffba282c3189c192a148e66e2b3f436df3c3ea1 2026-06-17T15:01:18Z
    trace_init_module v0.53.2 sha256:63fe72480c9a559b2233663025cf0afc62b56a756d5bfcc72948be1b84fd3c7a 2026-06-17T15:01:20Z
    trace_lsm v0.53.2 sha256:b4f88ef1bb6ac98877703a0ff5153f90c5bf488da133982dbb9d61ee0e61ef62 2026-06-17T15:01:26Z
    trace_malloc v0.53.2 sha256:14948f6e93c1b224f073fbc4bd3006a3468186317b8de5a4282ebcbcd18b8982 2026-06-17T15:01:28Z
    trace_oomkill v0.53.2 sha256:ab644b1099f8e59c6f256ed3aef9551c2422bd093f73c68f56d26d3ddad5c95f 2026-06-17T15:01:30Z
    trace_open v0.53.2 sha256:37e308b7f2d9820679baf49fc1f268a6bdfd3e74008bd377008e702972726bc3 2026-06-17T15:01:32Z
    trace_signal v0.53.2 sha256:e905141c39969b7ff12f040736f0ab4e306e95766e207a972dac7cf51d7ecea1 2026-06-17T15:01:34Z
    trace_sni v0.53.2 sha256:616c6946667df9607df7ebd33e411e4b6e3ec8cbc8ed198e03edd50f7926e341 2026-06-17T15:01:35Z
    trace_ssl v0.53.2 sha256:3a2ee0ac6de9b458568121be16f6310c56674dc2de1b23dff5ffa6a642adc376 2026-06-17T15:01:37Z
    trace_tcp v0.53.2 sha256:7b6a016e84fb2c4523a6a70906cc629e0971a94616e894dbbd90071dc3688c72 2026-06-17T15:01:39Z
    trace_tcpdrop v0.53.2 sha256:de4924f9b635166d75e84e90a53447b8bda87399940355983b626d098a9edf6f 2026-06-17T15:01:41Z
    trace_tcpretrans v0.53.2 sha256:ff245f8f69c44fa438b81a6fc9d7bc698a476783a77e7018f03a2b05f51baafb 2026-06-17T15:01:42Z
    traceloop v0.53.2 sha256:67c9bc474e0983dc8dd8245537a22662f62db13afd5666c407294faa2d37b0ac 2026-06-17T15:01:49Z
    ttysnoop v0.53.2 sha256:7af2ea0e472ccea5cf49ca2b6082fac286341a24567bd15634aa1114e496366d 2026-06-17T15:01:51Z
validation.go:112: [393.969s] Running functional test with trace_exec gadget
validation.go:112: [397.753s] Inspektor Gadget functional validation passed
validation.go:293: [398.153s] → validating wireserver is blocked from unprivileged pods...
kube.go:108: [398.154s] → waiting for pod app=debugnonhost-mariner-tolerated spec.nodeName=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace...
kube.go:156: [398.225s] ✓ waiting for pod app=debugnonhost-mariner-tolerated spec.nodeName=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace done (0.1s)
validation.go:355: [406.977s] ✓ validating wireserver is blocked from unprivileged pods done (8.8s)
scenario_gpu_managed_experience_test.go:826: [422.498s] found moby-containerd 2.3.2-ubuntu24.04u1 in the installed packages
kube.go:108: [422.498s] → waiting for pod app=debugnonhost-mariner-tolerated spec.nodeName=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace...
kube.go:156: [422.571s] ✓ waiting for pod app=debugnonhost-mariner-tolerated spec.nodeName=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace done (0.1s)
scenario_gpu_managed_experience_test.go:827: [423.431s] found moby-runc 1.4.3-ubuntu24.04u1 in the installed packages
scenario_gpu_managed_experience_test.go:829: [423.834s] validating DRA driver NVIDIA GPU systemd service is running
scenario_gpu_managed_experience_test.go:830: [424.239s] validating that DRA workloads can be scheduled
validators.go:2722: [444.385s] truncated pod name "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000-dra-test" to "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
validators.go:2722: [444.385s] truncated pod name "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" to "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
validation.go:148: [444.385s] creating pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
kube.go:108: [444.514s] → waiting for pod  metadata.name=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace...
kube.go:156: [471.589s] ✓ waiting for pod  metadata.name=p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000 in "default" namespace done (27.1s)
validation.go:172: [471.589s] Time for pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" to get ready was 27.203559004s
validation.go:173: [471.589s] node health validation: test pod "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000" is running on node "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul000000"
scenario_gpu_managed_experience_test.go:830: [471.671s] GPU workload is schedulable and runs successfully
test_helpers.go:465: [471.671s] VM validation succeeded
test_helpers.go:467: [471.671s] ✓ validating VM done (118.5s)
test_helpers.go:268: [471.671s] ✓ running scenario done (464.2s)
vmss.go:751: [474.195s] extracted VM logs to scenario-logs/Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default
vmss.go:1029: [474.990s] vmss "p7z8-2026-06-30-ubuntu2404dradrivernvidiagpurunningdefaul" will be retained for debugging purposes, please make sure to manually delete it later

--- PASS: Test_Ubuntu2404_DraDriverNvidiaGpuRunning (0.00s)
--- PASS: Test_Ubuntu2404_DraDriverNvidiaGpuRunning/default (475.27s)
PASS
ok github.com/Azure/agentbaker/e2e 475.761s

Which issue(s) this PR fixes:

Fixes #

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for enabling NVIDIA’s DRA-based GPU driver/plugin flow as part of AgentBaker’s managed GPU experience on Linux nodes. It introduces a new NodeBootstrappingConfiguration flag, plumbs it into the CSE command line, installs the required cached package(s), starts the corresponding systemd service, and adds an e2e scenario to validate service/workload behavior.

Changes:

  • Add EnableManagedGPUDRA to the node bootstrapping datamodel and expose it to CSE templates via IsEnableManagedGPUDRA.
  • Update Linux CSE scripts (Ubuntu + Mariner/Azure Linux) to install dra-driver-nvidia-gpu from cache and start dra-driver-nvidia-gpu.service (alongside existing managed GPU services).
  • Add e2e coverage for dra-driver-nvidia-gpu being enabled/active and for scheduling a DRA workload.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds ShellSpec coverage for DRA-related service start behavior (currently targets a non-existent function).
pkg/agent/datamodel/types.go Adds EnableManagedGPUDRA to NodeBootstrappingConfiguration.
pkg/agent/baker.go Exposes EnableManagedGPUDRA to templates via IsEnableManagedGPUDRA.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Includes dra-driver-nvidia-gpu in managed GPU cached package installs (flag-gated) and creates kubelet plugin dirs.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Same as Ubuntu for Mariner/Azure Linux installs and kubelet plugin dirs.
parts/linux/cloud-init/artifacts/cse_main.sh Plumbs ENABLE_MANAGED_GPU_DRA into CSE flow and defers managed GPU services start until after kubelet.
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds new error code ERR_DRA_DRIVER_START_FAIL.
parts/linux/cloud-init/artifacts/cse_config.sh Installs managed GPU cached packages when DRA enabled and starts dra-driver-nvidia-gpu service in startNvidiaManagedExpServices.
parts/linux/cloud-init/artifacts/cse_cmd.sh Adds ENABLE_MANAGED_GPU_DRA environment variable to the generated CSE command script.
e2e/validators.go Adds validators for DRA service state and for scheduling a DRA workload via DeviceClass/ResourceClaim.
e2e/scenario_gpu_managed_experience_test.go Adds an Ubuntu 24.04 GPU scenario enabling the new DRA flag and running the DRA validators.

Comment thread spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Copilot AI review requested due to automatic review settings June 30, 2026 17:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
Comment thread e2e/validators.go Outdated
Comment thread e2e/validators.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 30, 2026 17:57
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh
Comment thread parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh
Copilot AI review requested due to automatic review settings June 30, 2026 19:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_main.sh
Comment thread parts/linux/cloud-init/artifacts/cse_main.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh
runzhen and others added 2 commits June 30, 2026 20:15
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 30, 2026 20:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh:2031

  • startNvidiaManagedExpServices() now has a DRA branch (dra-driver-nvidia-gpu) gated by ENABLE_MANAGED_GPU_EXPERIENCE_DRA, but this spec block only asserts the device-plugin path. Adding an explicit DRA-mode unit test here would prevent regressions in the new service start logic.
        BeforeEach 'MIG_NODE="false"; ENABLE_MANAGED_GPU_EXPERIENCE="true"; ENABLE_MANAGED_GPU_EXPERIENCE_DRA="false"'

        It 'starts the device-plugin blocking but dcgm and dcgm-exporter off the critical path'
            When call startNvidiaManagedExpServices

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh
Comment thread parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh
})
}

func Test_Ubuntu2404_DraDriverNvidiaGpuRunning(t *testing.T) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not supported on AZL ?

dcgm-exporter
)

if [ "${ENABLE_MANAGED_GPU_EXPERIENCE:-false}" = "true" ]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change ? my understand was that datacenter-gpu-manager-4-core, datacenter-gpu-manager-4-proprietary and dcgm-exporter were also part of the of Managed Experience. did that change recently ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I get it now, only one of the two can be enabled at the same time.

It would be clearer is this was a switch case or if/else since this would make it clearer that it's one of the other.

mkdir -p "$(dirname "${managed_gpu_marker}")"
touch "${managed_gpu_marker}"
elif [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then
logs_to_events "AKS.CSE.installNvidiaManagedExpPkgFromCache" "installNvidiaManagedExpPkgFromCache" || exit $ERR_NVIDIA_DCGM_INSTALL

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all code duplication from the previous IF section, please clean up


# defer starting DRA driver services after kubelet.
if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then
logs_to_events "AKS.CSE.startNvidiaManagedExpServices" "startNvidiaManagedExpServices" || exit $?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only the _DRA flavor logic is present under this file. I would have expeced that the normal MAanged GPU flow would also call. startNvidiaManagedExpServices

# Reload systemd to pick up the override
systemctl daemon-reload
# 2. Start the dra-driver-nvidia-gpu service.
if [ "${ENABLE_MANAGED_GPU_EXPERIENCE_DRA}" = "true" ]; then

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same since both ENABLE_MANAGED_GPU_EXPERIENCE and ENABLE_MANAGED_GPU_EXPERIENCE_DRA cannot be true at the same time I would make it clear from the code base. within this function, right now it looks like both can be enabled independantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants