feat(gpu): bump aks-gpu-cuda to 595.71.05 (build-only-capable; prereq for CUDA prebake)#8810
feat(gpu): bump aks-gpu-cuda to 595.71.05 (build-only-capable; prereq for CUDA prebake)#8810ganeshkumarashok wants to merge 1 commit into
Conversation
… for CUDA prebake) Bump the cached CUDA driver image aks-gpu-cuda from 580.126.09-20260126030251 (Jan 2026) to 595.71.05-20260623180420. Prerequisite for the CUDA driver prebake (install-dependencies.sh, merged in #8786; flag enabled in #8803). The prebake runs the aks-gpu container with `/entrypoint.sh build-only`, an action added in aks-gpu #162 (June 2026). No 580.126.09 image supports build-only (newest 580 build is 2026-04-30, pre-#162); only the 595.71.05 family (May-June builds) does. Without this bump the prebake fails at VHD build: install.sh:4 `source /opt/gpu/config.sh: No such file`. This exact image was validated green via the prebake GPU e2e. WARNING -- drops V100/Volta support. NVIDIA R590/R595 branches no longer load on Tesla V100 (Volta); R580 is the last branch supporting it. AgentBaker e2e still exercises V100 via Standard_NC6s_v3 (Test_Ubuntu2204_GPUNC, Test_ACL_GPUNC, GPUNoDriver, AzureLinux GPU), so this will break those scenarios and any NC-v3/ V100 nodes on the shared image. Do not merge without V100/NC-v3 EOL sign-off (or an aks-gpu build-only backport to a V100-capable 580.x image). Kept as draft. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
There was a problem hiding this comment.
Pull request overview
This PR bumps the cached AKS CUDA GPU driver container image tag (aks-gpu-cuda) in parts/common/components.json, intended to unblock the Ubuntu VHD-time CUDA driver prebake path (which requires the newer build-only entrypoint behavior in the aks-gpu image family).
Changes:
- Update
aks-gpu-cudaimage tag from580.126.09-20260126030251to595.71.05-20260623180420.
Package Update Analysis: aks-gpu-cuda
Version change: 580.126.09 → 595.71.05 (major driver-branch bump)
OS variants affected: Ubuntu 22.04 gen2 x86, Ubuntu 24.04 gen2 x86 (this image is pre-pulled for Ubuntu during VHD build)
OS variants NOT updated: None (single shared tag in components.json)
Changes between 580.126.09 and 595.71.05
| Change | Description | Risk |
|---|---|---|
| Breaking | R595/R590 branches no longer support Tesla V100 (Volta); R580 is the legacy branch for Volta | 🔴 High |
| Feature | build-only / related entrypoint modes needed for VHD prebake workflows |
🟢 Low |
| Driver branch update | Newer datacenter driver branch with additional fixes/features vs R580 | 🟡 Medium |
Overall Risk: 🔴 High
Justification: The repo still has active V100/NCsv3 coverage (e.g., Standard_NC6s_v3 GPU E2E scenarios), and an R595 bump on the shared CUDA driver image risks breaking those nodes/tests without an explicit V100/NCsv3 support/EOL decision or a split/backport strategy.
Recommendation: Hold until V100 disposition is confirmed (EOL vs backport vs split driver selection).
| "downloadURL": "mcr.microsoft.com/aks/aks-gpu-cuda:*", | ||
| "gpuVersion": { | ||
| "renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-cuda", | ||
| "latestVersion": "580.126.09-20260126030251" | ||
| "latestVersion": "595.71.05-20260623180420" | ||
| } |
|
Superseded by #8811. Rather than bump |
What this PR does / why we need it:
Bumps the cached CUDA driver image
aks-gpu-cudainparts/common/components.json:This is the prerequisite that unblocks the CUDA driver prebake (bake merged in #8786, flag enabled in #8803). The prebake runs the aks-gpu container with
/entrypoint.sh build-only, an action added in aks-gpu #162 (merged June 2026). I verified against MCR that no580.126.09image supportsbuild-only(newest 580 build is2026-04-30, pre-#162) — only the595.71.05family (May–June builds) does. Without this bump the prebake fails at VHD build:(observed in PR-gate build
170469320for #8803).595.71.05-20260623180420is the exact image validated green in the prebake GPU e2e.Only
aks-gpu-cudais bumped;aks-gpu-grid/aks-gpu-grid-v20are not baked by the enablement (#8803) and are left unchanged.🔴 BLOCKER — this drops V100 / Volta support (why it's a draft)
Version change:
aks-gpu-cuda580.126.09 → 595.71.05 (major driver-branch bump)OS variants affected: Ubuntu 22.04 / 24.04 gen2 x86 (shared image used by all CUDA GPU SKUs)
build-onlyentrypoint (aks-gpu #162) → enables VHD-time DKMS prebakeAgentBaker e2e still actively tests V100 via
Standard_NC6s_v3:Test_Ubuntu2204_GPUNC,Test_ACL_GPUNC,Test_Ubuntu2204_GPUNoDriver(_Scriptless),Test_AzureLinuxV3_GPU. This bump will break those scenarios and any NC-v3 / V100 customer nodes on the shared image.This is almost certainly why
components.jsonhas been held at 580.126.09 despite 595.71.05 being available since May.Do NOT merge until one of these is resolved
build-onlyto a V100-capable580.ximage — keeps V100 and enables prebake (no driver-branch jump), orOverall Risk: 🔴 High
Justification: Major GPU driver-branch bump on the shared image that removes support for a still-tested, still-shipping GPU architecture (V100/NC-v3).
Recommendation: Hold as draft pending the V100 disposition above. Sequenced before #8803.
Which issue(s) this PR fixes: Fixes #
Requirements
Azure/AgentBakermake generate-testdataclean (GPU image version isn't snapshotted);make validate-componentspasses