feat(gpu): use aks-gpu-cuda-lts (R580 LTS) for the managed CUDA driver#8811
Conversation
Switch the managed CUDA GPU driver image from aks-gpu-cuda to the R580 LTS variant aks-gpu-cuda-lts (580.159.04-20260629214430). Why: enabling the CUDA driver prebake (#8786 / #8803) needs an aks-gpu image that supports the `build-only` action (aks-gpu #162). The only build-only-capable aks-gpu-cuda images are on the R595 line, which drops NVIDIA Volta/V100 support -- and ~487 managed-GPU nodes across ~293 subscriptions still run on V100 (NC*_v3, ND40rs_v2). aks-gpu already ships aks-gpu-cuda-lts: the NVIDIA R580 Long Term Support branch (supported through Aug 2028), which keeps V100 AND whose post-#162 builds have build-only, and which also covers every other managed CUDA SKU (T4/A100/H100/H200). So this keeps V100 working and unblocks the prebake with no aks-gpu change. It is a move within the 580 line (580.126.09 -> 580.159.04), not a driver-branch jump. Wiring: - components.json: aks-gpu-cuda -> aks-gpu-cuda-lts (repo, renovateTag, version). - gpu_components.go: LoadConfig case -> aks-gpu-cuda-lts (drives the CUDA driver version/suffix used by both VHD build and runtime install). - baker.go GetGPUDriverType: modern CUDA SKUs -> "cuda-lts" (selects the aks-gpu-cuda-lts image via cse_helpers.sh `aks-gpu-${type}`); legacy NCv1 (K80) stays on "cuda" with its pinned R470 driver. - cse_config.sh logGPUDriverPrebakeReadiness: map the driver-type to the aks-gpu marker's driver_kind (cuda-lts -> cuda, grid-v20 -> grid) so a CUDA-prebaked marker matches a cuda-lts node. - install-dependencies.sh: VHD-build prebake/caching image selection -> aks-gpu-cuda-lts. - ACL/Mariner comments updated; their grid-vs-non-grid sysext logic already handles "cuda-lts". ACL/AzureLinux install drivers from OS sysext images (not the aks-gpu container), so this change is effectively Ubuntu-scoped. Validation: go test, make generate-testdata (no drift), shellcheck, shellspec (751/0), make validate-components all pass. Supersedes #8810 (which bumped aks-gpu-cuda to R595 and would have dropped V100). Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the managed NVIDIA CUDA driver container image selection so GPU nodes use the R580 LTS–based aks-gpu-cuda-lts image (to retain Volta/V100 support while enabling build-only prebake), and wires the new driver type through AgentBaker’s Go selection logic, VHD build caching, and CSE observability/tests.
Changes:
- Switch GPU container image metadata in
parts/common/components.jsonfromaks-gpu-cudatoaks-gpu-cuda-lts(including renovate tag + version). - Update AgentBaker Go GPU component loading and driver-type selection to use
cuda-ltsfor modern CUDA SKUs (while keeping legacy NCv1 oncuda). - Update VHD build pre-pull logic and add ShellSpec coverage for the prebake marker kind mapping.
Package Update Analysis: aks-gpu-cuda-lts
Version change: 580.126.09-20260126030251 → 580.159.04-20260629214430 (minor update within the 580 driver branch)
OS variants affected: Ubuntu VHD build/runtime managed-driver path (container-based install)
OS variants NOT updated: AzureLinux/Mariner/ACL (they install via sysext/RPM paths, not the aks-gpu CUDA container)
Changes between versions: Upstream, version-specific release-note diffs for these exact driver point releases were not found in a reliably citable form during this review. Manual validation (e2e GPU scenarios + targeted V100/NCv3 coverage) is recommended before merge.
Overall Risk: Medium
Justification: Although the driver stays on the R580 line (not a major branch jump), the PR also changes the image repository and introduces a new driver-type string (cuda-lts) that must remain compatible with legacy selection paths (notably NCv1/K80).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
parts/common/components.json |
Switch CUDA GPU image metadata to aks-gpu-cuda-lts and bump version. |
pkg/agent/datamodel/gpu_components.go |
Load CUDA driver version/suffix from aks-gpu-cuda-lts repo entry. |
pkg/agent/datamodel/gpu_components_test.go |
Update repo parsing test to expect aks-gpu-cuda-lts. |
pkg/agent/baker.go |
Return cuda-lts for modern CUDA SKUs; keep legacy NCv1 on cuda. |
pkg/agent/baker_test.go |
Update expectations for cuda-lts and add explicit NCv1 legacy coverage. |
vhdbuilder/packer/install-dependencies.sh |
Pre-pull aks-gpu-cuda-lts during Ubuntu VHD build and update related error text. |
parts/linux/cloud-init/artifacts/cse_config.sh |
Map driver-type to marker driver_kind for prebake readiness logging (cuda-lts→cuda). |
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh |
Add ShellSpec test for cuda-lts marker-kind matching behavior. |
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh |
Update comments describing cuda-lts vs legacy cuda behavior (no logic change). |
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh |
Update comments describing cuda-lts vs legacy cuda behavior (no logic change). |
The custom versioning rule that lets Renovate parse the driver image's "<major.minor.patch>-<timestamp>" tag matched "aks/aks-gpu-cuda"; after moving the managed CUDA driver to aks-gpu-cuda-lts, retarget the rule (and groupName) so the LTS repo is version-tracked. Also flip automerge to false: this is now the V100-critical managed driver, so driver bumps should be reviewed (matching the aks-gpu-grid / grid-v20 rules) rather than auto-merged. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
#8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts (R580 LTS, V100-capable). This keeps the pre-LTS aks-gpu-cuda image tracked in the component manifest during the transition, pinned to the R580 line. Minimal footprint -- components.json + renovate only; no Go/behavior change: - components.json: add an aks-gpu-cuda entry pinned to the R580 line (580.126.09), NOT the R595 line that drops Volta/V100. LoadConfig has no case for it, so it touches no driver-version global (avoiding a clobber of the aks-gpu-cuda-lts render values -- they share the NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix globals). - renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to the V100-dropping R595 line. Deliberately inert: not baked into the VHD (install-dependencies.sh only pre-pulls aks-gpu-cuda-lts) and not a render target (GetGPUDriverType returns "cuda-lts"). Old-VHD / version-skewed nodes that target aks-gpu-cuda already resolve it at boot via the hardened registry pull (#8821), served by required-MCR egress or the wildcard network-isolated ACR cache. This just keeps it a recognized, V100-safe, renovate-managed reference during the transition. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
…a-lts #8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts, reusing the NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix globals for the LTS image. This restores a first-class `case "aks-gpu-cuda"` in LoadConfig so the pre-LTS image's version is loaded and available if a SKU is ever routed to the "cuda" image in CSE again -- without disturbing today's render. - components.json: add an aks-gpu-cuda entry pinned to the R580 line (580.126.09), NOT the R595 line that drops Volta/V100. - gpu_components.go: aks-gpu-cuda reclaims NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix (its pre-#8811 names); aks-gpu-cuda-lts moves to NvidiaCudaLTSDriverVersion / AKSGPUCudaLTSVersionSuffix. Mirrors the existing base-vs-variant naming (NvidiaGridDriverVersion vs NvidiaGridV20DriverVersion) and avoids clobbering a shared global. - baker.go: GetGPUDriverVersion / GetAKSGPUImageSHA render the LTS globals for modern CUDA SKUs, so rendered output is byte-identical (verified: zero testdata drift). aks-gpu-cuda is loaded but not the default render target. - renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to R595. Still not baked into the VHD (install-dependencies.sh only pre-pulls aks-gpu-cuda-lts). Old-VHD / skewed nodes that target aks-gpu-cuda resolve it at boot via the hardened pull (#8821), served by required-MCR or the wildcard network-isolated ACR cache. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
What this PR does / why we need it:
Switches the managed CUDA GPU driver image from
aks-gpu-cudato the R580 LTS variantaks-gpu-cuda-lts(580.159.04-20260629214430).Why: Enabling the CUDA driver prebake (bake #8786, flag #8803) requires an aks-gpu image that supports the
build-onlyaction (aks-gpu #162). The onlybuild-only-capableaks-gpu-cudabuilds are on the R595 line — and NVIDIA R595 drops Volta/V100 (R590+ deprecation). Fleet telemetry (AKSprodAgentPoolSnapshot) shows that a significant number of managed-GPU nodes, spanning many customer subscriptions, still run on V100 (NC*_v3,ND40rs_v2) — and nearly all of them rely on the managed driver (only a small fraction opt out via--gpu-driver None). So bumping to R595 (the approach in #8810) would break them.aks-gpu already publishes exactly what we need:
aks-gpu-cuda-lts= the NVIDIA R580 Long Term Support branch (driver_config.yml: "supported through Aug 2028"), which keeps V100 and whose post-#162 builds already havebuild-only. It also covers every other managed CUDA SKU (T4/A100/H100/H200). This keeps V100 working and unblocks the prebake — with no aks-gpu change. It's a move within the 580 line (580.126.09 → 580.159.04), not a driver-branch jump.Wiring (traced end-to-end):
parts/common/components.jsonaks-gpu-cuda→aks-gpu-cuda-lts(repo,renovateTag, version)pkg/agent/datamodel/gpu_components.goLoadConfigcase →aks-gpu-cuda-lts(sets CUDA driver version/suffix for VHD build and runtime)pkg/agent/baker.goGetGPUDriverType"cuda-lts"(selectsaks-gpu-cuda-ltsviacse_helpers.shaks-gpu-${type}); legacy NCv1/K80 stays"cuda"with its pinned R470 driverparts/.../cse_config.shlogGPUDriverPrebakeReadinessdriver_kind(cuda-lts→cuda,grid-v20→grid) so a CUDA-prebaked marker matches acuda-ltsnodevhdbuilder/packer/install-dependencies.shaks-gpu-cuda-ltscse_install_*.shcuda-ltsScope note: ACL/AzureLinux install GPU drivers from OS sysext images, not the aks-gpu container, so this is effectively Ubuntu-scoped.
Validation:
go test✅ ·make generate-testdata(no drift) ✅ ·shellcheck(changed scripts clean) ✅ ·shellspec751/0 (incl. newcuda-lts→cudamatch test) ✅ ·make validate-components✅Relationship to other PRs:
aks-gpu-cuda→ R595; would have dropped V100). I'll close it.build-onlyprebake works while V100 stays supported.Which issue(s) this PR fixes: Fixes #
Requirements
Azure/AgentBakermake generate-testdataclean;go test,shellspec,validate-componentspass