Skip to content

feat(gpu): remove dead NCv1/K80 R470 driver support#8816

Open
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/gpu-drop-470-ncv1
Open

feat(gpu): remove dead NCv1/K80 R470 driver support#8816
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/gpu-drop-470-ncv1

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What

Removes the special-casing that mapped legacy NCv1 (K80, Kepler) SKUs to the aks-gpu-cuda R470 driver (cuda-470.82.01). NCv1 now falls through to the default cuda-lts (R580 LTS) path like every other compute GPU SKU.

  • Drop the Nvidia470CudaDriverVersion constant (pkg/agent/datamodel/gpu_components.go)
  • Drop the isStandardNCv1 helper and its branches in GetGPUDriverType / GetGPUDriverVersion (pkg/agent/baker.go)
  • Refresh the GetGPUDriverType doc comment and the ACL / Mariner cse_install comments to match
  • Update the GetGPUDriverType / GetGPUDriverVersion tests to assert the cuda-lts fall-through

Why

The NCv1 → R470 path has never worked and cannot be made to work:

  1. No image exists. AKS has never published an aks-gpu-cuda R470 image. mcr.microsoft.com/aks/aks-gpu-cuda:cuda-470.82.01-<suffix> 404s for every suffix (verified for the pre-existing suffix, the current suffix, the bare cuda-470.82.01, and 470.82.01). The repo only carries 550.x-595.x tags (0 of 74 contain 470). So the managed driver install for K80 has always failed on this path.
  2. No forward-compatible driver. Per NVIDIA, R470 is the last branch to support Kepler data-center GPUs; R525+ (including R580) dropped Kepler. There is no newer managed driver that can drive a K80.
  3. EOL & effectively unused. Only a handful of legacy NCv1 nodes remain, all on very old Kubernetes versions and predating the current container-driver scheme; no new NCv1 node provisions a managed driver successfully today.

The constant, helper, and comments describing a "pinned R470 driver" were therefore dead code / misleading documentation.

Behavior change (intentional)

Before After
NCv1 (standard_nc6/nc12/nc24[r] + _promo) aks-gpu-cuda:cuda-470.82.01-* -> 404, CSE fails falls through to aks-gpu-cuda-lts:580.159.04-* -> pulls & installs R580

R580 cannot drive a K80 either, so neither state yields a working K80 - this is not a regression in capability. It trades a hard 404 for the same default path all other CUDA SKUs take, and removes the fiction that AKS ships an R470 driver. All working SKUs (T4, V100, A100, H100, H200, GRID, GRID-v20) are unaffected.

Testing

  • go build ./pkg/agent/... + aks-node-controller (reuses agent.GetGPUDriverType/Version via replace => ../) - OK
  • go vet ./pkg/agent/ - OK
  • GENERATE_TEST_DATA=true go test ./pkg/agent/... - pass, zero golden drift (no testdata renders an NCv1 SKU)
  • Focused GetGPUDriver* Ginkgo specs - pass
  • shellcheck on the two edited shell files - clean on changed lines (only pre-existing warnings elsewhere)
  • No remaining Nvidia470CudaDriverVersion / isStandardNCv1 references in the tree

Risk

Low. Contained to AgentBaker (no RP consumers of these functions). Only NCv1/K80 behavior changes, and that SKU has no working managed driver in either state.

NCv1 (K80, Kepler) was special-cased to the aks-gpu-cuda R470 driver
(cuda-470.82.01), but AKS has never published an aks-gpu-cuda R470
image -- that tag 404s for every suffix (the repo only carries
550.x-595.x builds), so the managed driver install for K80 has always
failed on this path. R470 is also the last NVIDIA branch to support
Kepler data-center GPUs, so there is no forward-compatible managed
driver for the SKU, and the hardware is EOL (only a couple of ancient
pre-container-scheme nodes remain in the fleet, all on k8s 1.20-1.25).

Remove the special-casing so NCv1 falls through to the default cuda-lts
path like every other compute SKU:
- drop the Nvidia470CudaDriverVersion constant
- drop the isStandardNCv1 helper and its branches in GetGPUDriverType /
  GetGPUDriverVersion
- refresh the GetGPUDriverType doc comment + ACL/Mariner comments
- update the GPU driver tests to assert the cuda-lts fall-through

Behaviorally neutral for working SKUs. NCv1 changes from a 404 (image
not found) to attempting cuda-lts (R580), which cannot drive a K80
either -- but the SKU has no working managed driver regardless.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes dead NCv1/K80-specific GPU driver mapping logic so legacy NCv1 no longer attempts to use a non-existent AKS R470 (aks-gpu-cuda) image, and instead follows the default cuda-lts (R580 LTS) path like other compute GPU SKUs. This simplifies GPU driver selection behavior in the AgentBaker service and aligns provisioning-script comments and unit tests with the actual supported image set.

Changes:

  • Removed the unused Nvidia470CudaDriverVersion constant and the isStandardNCv1 special-casing in GetGPUDriverType / GetGPUDriverVersion.
  • Updated GetGPUDriverType’s doc comment to explicitly document NCv1/K80’s intentional fall-through behavior.
  • Updated Go tests and Linux CSE script comments to reflect the new behavior (NCv1 → cuda-lts).

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/agent/datamodel/gpu_components.go Removes the unused R470 constant from GPU component metadata.
pkg/agent/baker.go Deletes NCv1 detection and special-casing; updates driver-type documentation to match actual behavior.
pkg/agent/baker_test.go Updates test expectations to assert NCv1 now falls through to cuda-lts behavior.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Updates comments describing NVIDIA_GPU_DRIVER_TYPE mapping for NCv1/non-grid SKUs.
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Updates comments describing NVIDIA_GPU_DRIVER_TYPE mapping for NCv1/non-grid SKUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants