feat(gpu): pre-bake CUDA driver into the VHD + tear it down on non-GPU nodes#8786
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in (feature-flagged) mechanism to pre-bake the NVIDIA CUDA driver kernel module into the Ubuntu VHD at build time, and introduces a corresponding teardown path so non-GPU nodes inheriting a prebaked shared VHD remove the installed driver artifacts and DKMS registration.
Changes:
- Add
FEATURE_FLAGS=NVIDIA_CUDA_PREBAKEgated logic in VHD build to run theaks-gpu-cudacontainer inbuild-onlymode and validate the prebake marker exists. - Add
cleanUpPrebakedGPUDriverin Ubuntu CSE to remove DKMS state, kernel modules, user-space libs/binaries, ld.so config, and the prebake marker on non-GPU / skip-install nodes. - Add ShellSpec coverage for the new Ubuntu teardown behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vhdbuilder/packer/install-dependencies.sh |
Adds opt-in CUDA module prebake during VHD build and records prebake info in VHD logs. |
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh |
Implements teardown of prebaked NVIDIA driver artifacts on non-GPU nodes and wires it into cleanUpGPUDrivers. |
spec/parts/linux/cloud-init/artifacts/cse_install_ubuntu_spec.sh |
Adds ShellSpec tests validating marker-gated no-op and teardown behavior. |
871336e to
bb7a1f7
Compare
bb7a1f7 to
78aff40
Compare
78aff40 to
a09aa53
Compare
Devinwong
approved these changes
Jun 30, 2026
sulixu
reviewed
Jun 30, 2026
awesomenix
approved these changes
Jun 30, 2026
…isn't installed Splits the bake + teardown half of #8661 into its own PR (the consume/skip-build half ships separately). Bake and teardown are intentionally COUPLED: baking the driver into the shared Ubuntu VHD installs userspace libs + (setuid) binaries + a DKMS-registered module, so any node that does NOT install the AKS-managed driver must tear it down -- non-GPU VMs AND GPU VMs that opt out via --gpu-driver None or the skip toggle/tag (the cleanUpGPUDrivers path). Marker-gated: no-op on non-prebaked VHDs, never decoupled from the bake. - install-dependencies.sh: opt-in (FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE, default off) build-only bake of the NVIDIA kernel module + libc6-dev toolchain. - cse_install_ubuntu.sh: cleanUpPrebakedGPUDriver removes the installed driver (libs, setuid nvidia-* binaries, DKMS reg, ld config, marker) via a fast deregister. Stage-1 observability (greppable AKS_GPU_PREBAKE log lines, for the staged rollout): - teardown emits event=teardown status=cleaned|incomplete (fleet-wide security-coverage signal; incomplete = a setuid nvidia binary / DKMS registration lingered). - managed GPU nodes emit event=managed_gpu marker_present/driver_kind_match, so the rollout can confirm CUDA GPU nodes are ready before enabling stage-2 skip-build. Validation: shellspec 739/0, generate-testdata clean, shellcheck clean. Reference (full original, kept open): #8661. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
a09aa53 to
a0a9882
Compare
- teardown status=incomplete now also flags a lingering setuid /usr/bin/nvidia-modprobe (not just marker/DKMS state), so the security-coverage signal can't report cleaned while the priv-esc surface remains. (Copilot) - logGPUDriverPrebakeReadiness requires the marker driver_kind AND NVIDIA_GPU_DRIVER_TYPE to both be non-empty before reporting driver_kind_match=true (no empty==empty false positive). (Copilot) shellspec 740/0, generate-testdata clean, shellcheck clean. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
…r-absent tests - cleanUpPrebakedGPUDriver drops the marker only after a clean teardown; if the DKMS registration or the setuid nvidia-modprobe lingered it KEEPS the marker so the next provision re-runs the cleanup (the marker is the "still needs cleanup" flag). (Copilot) - de-flake the two "marker absent" specs: use a created-then-removed temp path instead of a predictable /tmp/...$$ path that could already exist. (Copilot) shellspec 740/0, generate-testdata clean, shellcheck clean. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
cameronmeissner
approved these changes
Jun 30, 2026
sulixu
approved these changes
Jun 30, 2026
This was referenced Jun 30, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Splits the bake + teardown half of #8661 into its own PR (#8661 stays open for reference). The consume / skip-build half is #8787.
Why bake + teardown are coupled
Baking the driver into the shared Ubuntu VHD installs it: libs on the
ld.sopath, binaries onPATH(incl.nvidia-modprobe, dropped setuid-root), and a DKMS-registered module. On any node that doesn't install the AKS-managed driver that's unused, attack-surface-expanding dead weight. So bake-without-cleanup must not be reachable — the teardown is marker-gated and travels with the bake.Which nodes tear down
The
cleanUpGPUDriverspath —GPU_NODE != trueorskip_nvidia_driver_install = true— i.e. not only non-GPU nodes: also GPU VMs that opt out (--gpu-driver None→EnableNvidia/GPU_NODE=false, or the skip toggle/tag). For BYO/GPU-operator that's correct: it strips the unused managed driver so the customer's own installs onto a clean node.Changes
vhdbuilder/packer/install-dependencies.sh— opt-in (FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE, default off) build-only bake +libc6-devtoolchain.parts/.../ubuntu/cse_install_ubuntu.sh—cleanUpPrebakedGPUDriverremoves the installed driver via a fast deregister (no slowdkms remove --all; ~0.49 s measured on a GPUNoDriver node, ~1.7% of CSE).parts/.../cse_config.sh— observability only (logGPUDriverPrebakeReadiness, see below).Stage-1 observability (for the staged rollout)
Greppable
AKS_GPU_PREBAKElog lines so the bake+teardown stage can be monitored before enabling stage-2 (skip-build):event=teardown status=cleaned|incomplete …— fleet-wide security-coverage signal;incompletemeans a setuid nvidia binary / DKMS registration lingered (alert).event=managed_gpu marker_present=… driver_kind_match=…— confirms CUDA GPU nodes are ready for stage-2 skip-build (go/no-go: ≈100%marker_present=true driver_kind_match=true).AKS.CSE.cleanUpGPUDriversduration + CSE exit codes), baseline ~0.5 s.Validation
make generate-testdataclean, shellcheck clean.595.71.05-20260623180420) + current main via the real[TEST All VHDs]pipeline; full GPU e2e was green on the equivalent code in build 168658727 (V100 skip-build, GRID, GPUNoDriver teardown).Reference (full original, kept open): #8661.