feat(gpu): remove dead NCv1/K80 R470 driver support#8816
Open
ganeshkumarashok wants to merge 1 commit into
Open
feat(gpu): remove dead NCv1/K80 R470 driver support#8816ganeshkumarashok wants to merge 1 commit into
ganeshkumarashok wants to merge 1 commit into
Conversation
NCv1 (K80, Kepler) was special-cased to the aks-gpu-cuda R470 driver (cuda-470.82.01), but AKS has never published an aks-gpu-cuda R470 image -- that tag 404s for every suffix (the repo only carries 550.x-595.x builds), so the managed driver install for K80 has always failed on this path. R470 is also the last NVIDIA branch to support Kepler data-center GPUs, so there is no forward-compatible managed driver for the SKU, and the hardware is EOL (only a couple of ancient pre-container-scheme nodes remain in the fleet, all on k8s 1.20-1.25). Remove the special-casing so NCv1 falls through to the default cuda-lts path like every other compute SKU: - drop the Nvidia470CudaDriverVersion constant - drop the isStandardNCv1 helper and its branches in GetGPUDriverType / GetGPUDriverVersion - refresh the GetGPUDriverType doc comment + ACL/Mariner comments - update the GPU driver tests to assert the cuda-lts fall-through Behaviorally neutral for working SKUs. NCv1 changes from a 404 (image not found) to attempting cuda-lts (R580), which cannot drive a K80 either -- but the SKU has no working managed driver regardless. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Removes dead NCv1/K80-specific GPU driver mapping logic so legacy NCv1 no longer attempts to use a non-existent AKS R470 (aks-gpu-cuda) image, and instead follows the default cuda-lts (R580 LTS) path like other compute GPU SKUs. This simplifies GPU driver selection behavior in the AgentBaker service and aligns provisioning-script comments and unit tests with the actual supported image set.
Changes:
- Removed the unused
Nvidia470CudaDriverVersionconstant and theisStandardNCv1special-casing inGetGPUDriverType/GetGPUDriverVersion. - Updated
GetGPUDriverType’s doc comment to explicitly document NCv1/K80’s intentional fall-through behavior. - Updated Go tests and Linux CSE script comments to reflect the new behavior (NCv1 →
cuda-lts).
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/agent/datamodel/gpu_components.go | Removes the unused R470 constant from GPU component metadata. |
| pkg/agent/baker.go | Deletes NCv1 detection and special-casing; updates driver-type documentation to match actual behavior. |
| pkg/agent/baker_test.go | Updates test expectations to assert NCv1 now falls through to cuda-lts behavior. |
| parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh | Updates comments describing NVIDIA_GPU_DRIVER_TYPE mapping for NCv1/non-grid SKUs. |
| parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh | Updates comments describing NVIDIA_GPU_DRIVER_TYPE mapping for NCv1/non-grid SKUs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Removes the special-casing that mapped legacy NCv1 (K80, Kepler) SKUs to the
aks-gpu-cudaR470 driver (cuda-470.82.01). NCv1 now falls through to the defaultcuda-lts(R580 LTS) path like every other compute GPU SKU.Nvidia470CudaDriverVersionconstant (pkg/agent/datamodel/gpu_components.go)isStandardNCv1helper and its branches inGetGPUDriverType/GetGPUDriverVersion(pkg/agent/baker.go)GetGPUDriverTypedoc comment and the ACL / Marinercse_installcomments to matchGetGPUDriverType/GetGPUDriverVersiontests to assert thecuda-ltsfall-throughWhy
The NCv1 → R470 path has never worked and cannot be made to work:
aks-gpu-cudaR470 image.mcr.microsoft.com/aks/aks-gpu-cuda:cuda-470.82.01-<suffix>404s for every suffix (verified for the pre-existing suffix, the current suffix, the barecuda-470.82.01, and470.82.01). The repo only carries 550.x-595.x tags (0 of 74 contain470). So the managed driver install for K80 has always failed on this path.The constant, helper, and comments describing a "pinned R470 driver" were therefore dead code / misleading documentation.
Behavior change (intentional)
standard_nc6/nc12/nc24[r]+_promo)aks-gpu-cuda:cuda-470.82.01-*-> 404, CSE failsaks-gpu-cuda-lts:580.159.04-*-> pulls & installs R580R580 cannot drive a K80 either, so neither state yields a working K80 - this is not a regression in capability. It trades a hard 404 for the same default path all other CUDA SKUs take, and removes the fiction that AKS ships an R470 driver. All working SKUs (T4, V100, A100, H100, H200, GRID, GRID-v20) are unaffected.
Testing
go build ./pkg/agent/...+aks-node-controller(reusesagent.GetGPUDriverType/Versionviareplace => ../) - OKgo vet ./pkg/agent/- OKGENERATE_TEST_DATA=true go test ./pkg/agent/...- pass, zero golden drift (no testdata renders an NCv1 SKU)GetGPUDriver*Ginkgo specs - passshellcheckon the two edited shell files - clean on changed lines (only pre-existing warnings elsewhere)Nvidia470CudaDriverVersion/isStandardNCv1references in the treeRisk
Low. Contained to AgentBaker (no RP consumers of these functions). Only NCv1/K80 behavior changes, and that SKU has no working managed driver in either state.