Skip to content

feat(linux): add AMD MI300X ROCm bootstrap#8824

Draft
wenhug wants to merge 1 commit into
mainfrom
wenhug/amd-mi300x-rocm-bootstrap
Draft

feat(linux): add AMD MI300X ROCm bootstrap#8824
wenhug wants to merge 1 commit into
mainfrom
wenhug/amd-mi300x-rocm-bootstrap

feat(linux): add AMD MI300X ROCm bootstrap

129990c
Select commit
Loading
Failed to load commit list.
Azure Pipelines / Agentbaker GPU E2E failed Jul 3, 2026 in 39m 10s

Build #20260702.42 had test failures

Details

Tests

  • Failed: 39 (17.33%)
  • Passed: 186 (82.67%)
  • Other: 0 (0.00%)
  • Total: 225

Annotations

Check failure on line 4084 in Build log

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker GPU E2E

Build log #L4084

Script failed with exit code: 1

Check failure on line 1 in Test_Ubuntu2204_GPUGridDriver/scriptless_nbc

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker GPU E2E

Test_Ubuntu2204_GPUGridDriver/scriptless_nbc

Failed
Raw output
=== RUN   Test_Ubuntu2204_GPUGridDriver/scriptless_nbc
=== PAUSE Test_Ubuntu2204_GPUGridDriver/scriptless_nbc
=== CONT  Test_Ubuntu2204_GPUGridDriver/scriptless_nbc
    test_helpers.go:418: [10.347s] TAGS {Name:Test_Ubuntu2204_GPUGridDriver/scriptless_nbc ImageName:2204gen2containerd OS:ubuntu Arch:amd64 NetworkIsolated:false NonAnonymousACR:false GPU:true WASM:false BootstrapTokenFallback:false KubeletCustomConfig:false Scriptless:false VHDCaching:false MockAzureChinaCloud:false VMSeriesCoverageTest:false}
    test_helpers.go:229: [10.352s] → running scenario...
    test_helpers.go:246: [10.352s] using cluster abe2e-kubenet-v5-150ee in rg=abe2e-westus3 sub=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
    test_helpers.go:247: [10.352s] portal: https://portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    test_helpers.go:279: [10.375s] → preparing AKS node...
    vmss.go:531: [10.375s] → creating VMSS lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc...
    vmss.go:435: [11.260s] VMSS portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc/overview
    vmss.go:441: [11.260s] Managed cluster portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    vmss.go:564: [32.439s] VM will be automatically deleted after the test finishes, to preserve it for debugging purposes set KEEP_VMSS=true or pause the test with a breakpoint before the test finishes or failed
    vmss.go:568: [32.440s] SSH Instructions: (may take a few minutes for the VM to be ready for SSH)
        ========================
        az network bastion ssh --target-resource-id "/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc/virtualMachines/0" --name "abe2e-shared-bastion" --resource-group abe2e-westus3 --auth-type ssh-key --username azureuser --ssh-key /tmp/private-key-2766443897
        
    bastionssh.go:304: [344.472s] Attempt 1/5 establishing SSH over bastion to 10.220.112.51
    vmss.go:618: [345.472s] VM reached running state
    vmss.go:588: [345.473s] ✓ creating VMSS lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc done (335.1s)
    kube.go:160: [345.473s] → waiting for node lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc to be ready...
    kube.go:182: [345.595s] node lmfk-2026-07-03-ubuntu2204gpugriddriverscriptlessnbc000000 is ready. Taints: [{"key":"node.kubernetes.io/network-unavailable","effect":"NoSchedule","timeAdded":"2026-07-03T00:06:04Z"}] Conditions: [{"type":"NetworkUnavailable","status":"True","lastHeartbeatTime":"2026-07-03T00:06:04Z","lastTransitionTime":"2026-07-03T00:06:04Z","reason":"NodeInitialization","message":"Waiting for cloud routes"},{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:06:28Z","lastTransitionTime":"2026-07-03T00:05:57Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:06:28Z","lastTransitionTime":"2026-07-03T00:05:57Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","s
... [The stack trace has been truncated as it exceeded the maximum allowed size. Please refer to the complete log available in the Test Run attachments for full details.]

Check failure on line 1 in Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker GPU E2E

Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default

Failed
Raw output
=== RUN   Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default
=== PAUSE Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default
=== CONT  Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default
    test_helpers.go:418: [10.344s] TAGS {Name:Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset/default ImageName:2204gen2containerd OS:ubuntu Arch:amd64 NetworkIsolated:false NonAnonymousACR:false GPU:true WASM:false BootstrapTokenFallback:false KubeletCustomConfig:false Scriptless:false VHDCaching:false MockAzureChinaCloud:false VMSeriesCoverageTest:false}
    test_helpers.go:229: [10.347s] → running scenario...
    test_helpers.go:246: [10.347s] using cluster abe2e-kubenet-v5-150ee in rg=abe2e-westus3 sub=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
    test_helpers.go:247: [10.347s] portal: https://portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    test_helpers.go:279: [10.382s] → preparing AKS node...
    vmss.go:531: [10.382s] → creating VMSS d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa...
    vmss.go:435: [11.483s] VMSS portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa/overview
    vmss.go:441: [11.483s] Managed cluster portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    vmss.go:564: [30.367s] VM will be automatically deleted after the test finishes, to preserve it for debugging purposes set KEEP_VMSS=true or pause the test with a breakpoint before the test finishes or failed
    vmss.go:568: [30.367s] SSH Instructions: (may take a few minutes for the VM to be ready for SSH)
        ========================
        az network bastion ssh --target-resource-id "/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa/virtualMachines/0" --name "abe2e-shared-bastion" --resource-group abe2e-westus3 --auth-type ssh-key --username azureuser --ssh-key /tmp/private-key-2766443897
        
    bastionssh.go:304: [402.878s] Attempt 1/5 establishing SSH over bastion to 10.220.112.44
    vmss.go:618: [404.757s] VM reached running state
    vmss.go:588: [404.757s] ✓ creating VMSS d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa done (394.4s)
    kube.go:160: [404.757s] → waiting for node d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa to be ready...
    kube.go:182: [404.886s] node d7ff-2026-07-03-ubuntu2204nvidiadeviceplugindaemonsetdefa000000 is ready. Taints: [{"key":"node.kubernetes.io/network-unavailable","effect":"NoSchedule","timeAdded":"2026-07-03T00:07:12Z"}] Conditions: [{"type":"NetworkUnavailable","status":"True","lastHeartbeatTime":"2026-07-03T00:07:11Z","lastTransitionTime":"2026-07-03T00:07:11Z","reason":"NodeInitialization","message":"Waiting for cloud routes"},{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:07:37Z","lastTransitionTime":"2026-07-03T00:07:06Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:07:37Z","lastTransitionTime":"2026-07-03T00:07:06Z","reason":"KubeletHasNoDiskPressure","mes
... [The stack trace has been truncated as it exceeded the maximum allowed size. Please refer to the complete log available in the Test Run attachments for full details.]

Check failure on line 1 in Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker GPU E2E

Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset

Failed
Raw output
=== RUN   Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset
=== PAUSE Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset
=== CONT  Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset
--- FAIL: Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset (0.00s)

Check failure on line 1 in Test_Ubuntu2204_GPUA10/default

See this annotation in the file changed.

@azure-pipelines azure-pipelines / Agentbaker GPU E2E

Test_Ubuntu2204_GPUA10/default

Failed
Raw output
=== RUN   Test_Ubuntu2204_GPUA10/default
=== PAUSE Test_Ubuntu2204_GPUA10/default
=== CONT  Test_Ubuntu2204_GPUA10/default
    test_helpers.go:418: [10.346s] TAGS {Name:Test_Ubuntu2204_GPUA10/default ImageName:2204gen2containerd OS:ubuntu Arch:amd64 NetworkIsolated:false NonAnonymousACR:false GPU:true WASM:false BootstrapTokenFallback:false KubeletCustomConfig:false Scriptless:false VHDCaching:false MockAzureChinaCloud:false VMSeriesCoverageTest:false}
    test_helpers.go:229: [10.346s] → running scenario...
    test_helpers.go:246: [10.346s] using cluster abe2e-kubenet-v5-150ee in rg=abe2e-westus3 sub=8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8
    test_helpers.go:247: [10.346s] portal: https://portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/abe2e-westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    test_helpers.go:279: [10.381s] → preparing AKS node...
    vmss.go:531: [10.382s] → creating VMSS ojep-2026-07-03-ubuntu2204gpua10default...
    vmss.go:435: [11.377s] VMSS portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/ojep-2026-07-03-ubuntu2204gpua10default/overview
    vmss.go:441: [11.389s] Managed cluster portal link: https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.ContainerService/managedClusters/abe2e-kubenet-v5-150ee/overview
    vmss.go:564: [30.460s] VM will be automatically deleted after the test finishes, to preserve it for debugging purposes set KEEP_VMSS=true or pause the test with a breakpoint before the test finishes or failed
    vmss.go:568: [30.460s] SSH Instructions: (may take a few minutes for the VM to be ready for SSH)
        ========================
        az network bastion ssh --target-resource-id "/subscriptions/8ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8/resourceGroups/MC_abe2e-westus3_abe2e-kubenet-v5-150ee_westus3/providers/Microsoft.Compute/virtualMachineScaleSets/ojep-2026-07-03-ubuntu2204gpua10default/virtualMachines/0" --name "abe2e-shared-bastion" --resource-group abe2e-westus3 --auth-type ssh-key --username azureuser --ssh-key /tmp/private-key-2766443897
        
    bastionssh.go:304: [403.286s] Attempt 1/5 establishing SSH over bastion to 10.220.112.50
    vmss.go:618: [405.282s] VM reached running state
    vmss.go:588: [405.282s] ✓ creating VMSS ojep-2026-07-03-ubuntu2204gpua10default done (394.9s)
    kube.go:160: [405.283s] → waiting for node ojep-2026-07-03-ubuntu2204gpua10default to be ready...
    kube.go:182: [405.403s] node ojep-2026-07-03-ubuntu2204gpua10default000000 is ready. Taints: [{"key":"node.kubernetes.io/network-unavailable","effect":"NoSchedule","timeAdded":"2026-07-03T00:07:26Z"}] Conditions: [{"type":"NetworkUnavailable","status":"True","lastHeartbeatTime":"2026-07-03T00:07:26Z","lastTransitionTime":"2026-07-03T00:07:26Z","reason":"NodeInitialization","message":"Waiting for cloud routes"},{"type":"MemoryPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:07:19Z","lastTransitionTime":"2026-07-03T00:07:19Z","reason":"KubeletHasSufficientMemory","message":"kubelet has sufficient memory available"},{"type":"DiskPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:07:19Z","lastTransitionTime":"2026-07-03T00:07:19Z","reason":"KubeletHasNoDiskPressure","message":"kubelet has no disk pressure"},{"type":"PIDPressure","status":"False","lastHeartbeatTime":"2026-07-03T00:07:19Z","lastTransitionTime":"2026-07-03T00:07:19Z","reason":"KubeletHasSufficientPI
... [The stack trace has been truncated as it exceeded the maximum allowed size. Please refer to the complete log available in the Test Run attachments for full details.]