Skip to content

feat(gpu): pre-bake CUDA driver into the VHD + tear it down on non-GPU nodes#8786

Merged
ganeshkumarashok merged 3 commits into
mainfrom
gpu-prebake-bake-teardown
Jun 30, 2026
Merged

feat(gpu): pre-bake CUDA driver into the VHD + tear it down on non-GPU nodes#8786
ganeshkumarashok merged 3 commits into
mainfrom
gpu-prebake-bake-teardown

Conversation

@ganeshkumarashok

@ganeshkumarashok ganeshkumarashok commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

Splits the bake + teardown half of #8661 into its own PR (#8661 stays open for reference). The consume / skip-build half is #8787.

Why bake + teardown are coupled

Baking the driver into the shared Ubuntu VHD installs it: libs on the ld.so path, binaries on PATH (incl. nvidia-modprobe, dropped setuid-root), and a DKMS-registered module. On any node that doesn't install the AKS-managed driver that's unused, attack-surface-expanding dead weight. So bake-without-cleanup must not be reachable — the teardown is marker-gated and travels with the bake.

Which nodes tear down

The cleanUpGPUDrivers path — GPU_NODE != true or skip_nvidia_driver_install = true — i.e. not only non-GPU nodes: also GPU VMs that opt out (--gpu-driver NoneEnableNvidia/GPU_NODE=false, or the skip toggle/tag). For BYO/GPU-operator that's correct: it strips the unused managed driver so the customer's own installs onto a clean node.

Changes

  • vhdbuilder/packer/install-dependencies.sh — opt-in (FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE, default off) build-only bake + libc6-dev toolchain.
  • parts/.../ubuntu/cse_install_ubuntu.shcleanUpPrebakedGPUDriver removes the installed driver via a fast deregister (no slow dkms remove --all; ~0.49 s measured on a GPUNoDriver node, ~1.7% of CSE).
  • parts/.../cse_config.sh — observability only (logGPUDriverPrebakeReadiness, see below).

Stage-1 observability (for the staged rollout)

Greppable AKS_GPU_PREBAKE log lines so the bake+teardown stage can be monitored before enabling stage-2 (skip-build):

  • Teardown: event=teardown status=cleaned|incomplete … — fleet-wide security-coverage signal; incomplete means a setuid nvidia binary / DKMS registration lingered (alert).
  • Managed GPU nodes: event=managed_gpu marker_present=… driver_kind_match=… — confirms CUDA GPU nodes are ready for stage-2 skip-build (go/no-go: ≈100% marker_present=true driver_kind_match=true).
  • Latency/failure spikes are covered by existing telemetry (AKS.CSE.cleanUpGPUDrivers duration + CSE exit codes), baseline ~0.5 s.

Validation

  • shellspec 739/0, make generate-testdata clean, shellcheck clean.
  • VHD bake proven on the production image (595.71.05-20260623180420) + current main via the real [TEST All VHDs] pipeline; full GPU e2e was green on the equivalent code in build 168658727 (V100 skip-build, GRID, GPUNoDriver teardown).

Reference (full original, kept open): #8661.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in (feature-flagged) mechanism to pre-bake the NVIDIA CUDA driver kernel module into the Ubuntu VHD at build time, and introduces a corresponding teardown path so non-GPU nodes inheriting a prebaked shared VHD remove the installed driver artifacts and DKMS registration.

Changes:

  • Add FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE gated logic in VHD build to run the aks-gpu-cuda container in build-only mode and validate the prebake marker exists.
  • Add cleanUpPrebakedGPUDriver in Ubuntu CSE to remove DKMS state, kernel modules, user-space libs/binaries, ld.so config, and the prebake marker on non-GPU / skip-install nodes.
  • Add ShellSpec coverage for the new Ubuntu teardown behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
vhdbuilder/packer/install-dependencies.sh Adds opt-in CUDA module prebake during VHD build and records prebake info in VHD logs.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Implements teardown of prebaked NVIDIA driver artifacts on non-GPU nodes and wires it into cleanUpGPUDrivers.
spec/parts/linux/cloud-init/artifacts/cse_install_ubuntu_spec.sh Adds ShellSpec tests validating marker-gated no-op and teardown behavior.

Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
@ganeshkumarashok ganeshkumarashok force-pushed the gpu-prebake-bake-teardown branch from 871336e to bb7a1f7 Compare June 29, 2026 20:24
Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh
Copilot AI review requested due to automatic review settings June 29, 2026 23:23
@ganeshkumarashok ganeshkumarashok force-pushed the gpu-prebake-bake-teardown branch from bb7a1f7 to 78aff40 Compare June 29, 2026 23:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread spec/parts/linux/cloud-init/artifacts/cse_install_ubuntu_spec.sh
Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Outdated
@ganeshkumarashok ganeshkumarashok force-pushed the gpu-prebake-bake-teardown branch from 78aff40 to a09aa53 Compare June 29, 2026 23:38
Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh
…isn't installed

Splits the bake + teardown half of #8661 into its own PR (the consume/skip-build
half ships separately). Bake and teardown are intentionally COUPLED: baking the
driver into the shared Ubuntu VHD installs userspace libs + (setuid) binaries + a
DKMS-registered module, so any node that does NOT install the AKS-managed driver
must tear it down -- non-GPU VMs AND GPU VMs that opt out via --gpu-driver None or
the skip toggle/tag (the cleanUpGPUDrivers path). Marker-gated: no-op on
non-prebaked VHDs, never decoupled from the bake.

- install-dependencies.sh: opt-in (FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE, default off)
  build-only bake of the NVIDIA kernel module + libc6-dev toolchain.
- cse_install_ubuntu.sh: cleanUpPrebakedGPUDriver removes the installed driver
  (libs, setuid nvidia-* binaries, DKMS reg, ld config, marker) via a fast deregister.

Stage-1 observability (greppable AKS_GPU_PREBAKE log lines, for the staged rollout):
- teardown emits event=teardown status=cleaned|incomplete (fleet-wide security-coverage
  signal; incomplete = a setuid nvidia binary / DKMS registration lingered).
- managed GPU nodes emit event=managed_gpu marker_present/driver_kind_match, so the
  rollout can confirm CUDA GPU nodes are ready before enabling stage-2 skip-build.

Validation: shellspec 739/0, generate-testdata clean, shellcheck clean.
Reference (full original, kept open): #8661.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Copilot AI review requested due to automatic review settings June 30, 2026 17:39
@ganeshkumarashok ganeshkumarashok force-pushed the gpu-prebake-bake-teardown branch from a09aa53 to a0a9882 Compare June 30, 2026 17:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh Outdated
- teardown status=incomplete now also flags a lingering setuid /usr/bin/nvidia-modprobe
  (not just marker/DKMS state), so the security-coverage signal can't report cleaned while
  the priv-esc surface remains. (Copilot)
- logGPUDriverPrebakeReadiness requires the marker driver_kind AND NVIDIA_GPU_DRIVER_TYPE
  to both be non-empty before reporting driver_kind_match=true (no empty==empty false
  positive). (Copilot)

shellspec 740/0, generate-testdata clean, shellcheck clean.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
…r-absent tests

- cleanUpPrebakedGPUDriver drops the marker only after a clean teardown; if the DKMS
  registration or the setuid nvidia-modprobe lingered it KEEPS the marker so the next
  provision re-runs the cleanup (the marker is the "still needs cleanup" flag). (Copilot)
- de-flake the two "marker absent" specs: use a created-then-removed temp path instead of
  a predictable /tmp/...$$ path that could already exist. (Copilot)

shellspec 740/0, generate-testdata clean, shellcheck clean.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Copilot AI review requested due to automatic review settings June 30, 2026 18:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread vhdbuilder/packer/install-dependencies.sh
Comment thread vhdbuilder/packer/install-dependencies.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants