feat(gpu): enable CUDA driver prebake on shared Ubuntu gen2 VHD builds#8803
Merged
Conversation
Turn on the NVIDIA_CUDA_PREBAKE feature flag (added dark in #8786) for the two shared x86 gen2 Ubuntu images that GPU CUDA nodes boot -- 2204gen2containerd and 2404gen2containerd -- in both the release (.vsts-vhd-builder-release.yaml) and PR/test (.vsts-vhd-builder.yaml) VHD pipelines. With the flag set, install-dependencies.sh pre-builds the NVIDIA CUDA kernel module into the VHD at build time, so GPU nodes skip the ~80-150s DKMS compile at boot. Non-GPU and --gpu-driver None nodes tear the module down during provisioning (cleanUpPrebakedGPUDriver, also from #8786), so the shared image carries no extra attack surface on those nodes. Scope rationale: - Only 22.04/24.04 gen2 x86: these are the images GPU CUDA SKUs (A10/A100/H100) boot, confirmed via e2e GPU scenarios that pin VHDUbuntu2204Gen2Containerd / VHDUbuntu2404Gen2Containerd. AzureLinux GPU is out of scope (the bake is Ubuntu-only); gen1/FIPS/TL/arm64 are not used by supported CUDA GPU SKUs. - The Copy CIS Reports step keys off an exact-match FEATURE_FLAGS allowlist, so NVIDIA_CUDA_PREBAKE is added to that list to preserve report publishing. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Enables the NVIDIA_CUDA_PREBAKE feature flag on the shared Ubuntu x86 Gen2 VHD build jobs (22.04/24.04) so the VHD build can pre-bake the CUDA driver kernel module, and updates the release template gating so CIS reports are still copied for the new flag value.
Changes:
- Flip
FEATURE_FLAGStoNVIDIA_CUDA_PREBAKEfor thebuild2204gen2containerdandbuild2404gen2containerdjobs in both PR/test and release VHD pipelines. - Extend the “Copy CIS Reports” allowlist to include
NVIDIA_CUDA_PREBAKE.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| .pipelines/templates/.builder-release-template.yaml | Adds NVIDIA_CUDA_PREBAKE to the CIS-report copy condition allowlist so these builds keep producing CIS artifacts. |
| .pipelines/.vsts-vhd-builder.yaml | Sets FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE for Ubuntu 22.04/24.04 Gen2 shared image jobs in the PR/test pipeline. |
| .pipelines/.vsts-vhd-builder-release.yaml | Sets FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE for Ubuntu 22.04/24.04 Gen2 shared image jobs in the release pipeline. |
…me path Address review feedback: the inline rationale said GPU nodes "skip the DKMS compile at boot", but this PR only enables the bake -- the boot-time skip requires the configGPUDrivers skip-build path (PR #8787), which is not yet in main. Reword to "can later skip ... via the configGPUDrivers skip-build path" so pipeline maintainers aren't misled, and note the teardown covers non-GPU and --gpu-driver None nodes. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
sulixu
approved these changes
Jul 1, 2026
This was referenced Jul 1, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Activates the NVIDIA CUDA driver prebake that was merged dark in #8786. It flips
FEATURE_FLAGSfromNonetoNVIDIA_CUDA_PREBAKEon the two shared x86 gen2 Ubuntu images that GPU CUDA nodes boot, in both VHD pipelines:build2204gen2containerdbuild2404gen2containerdWith the flag set,
install-dependencies.shpre-builds the NVIDIA CUDA kernel module into the VHD at build time, so GPU nodes can skip the ~80–150s DKMS compile at boot.Why only these images
The supported CUDA GPU SKUs (A10/A100/H100) boot the gen2 22.04/24.04 Ubuntu images — confirmed by the e2e GPU scenarios, which pin
VHDUbuntu2204Gen2Containerd/VHDUbuntu2404Gen2Containerd. AzureLinux GPU is out of scope (the bake lives in the Ubuntu path); gen1 / FIPS / TL / arm64 are not used by these SKUs, so enabling there would only add build time + teardown surface for no GPU benefit.Safety
--gpu-driver Nonenodes remove the module during provisioning viacleanUpPrebakedGPUDriver(merged in feat(gpu): pre-bake CUDA driver into the VHD + tear it down on non-GPU nodes #8786, runs unconditionally). No extra attack surface on those nodes.FEATURE_FLAGSconsumer is an additive substringgrep(cvm/NVIDIA_GB/kata) thatNVIDIA_CUDA_PREBAKEdoesn't match; there's nocase/allowlist that rejects unknown values;SKU_NAMEis unaffected so the published image keeps its name.Copy CIS Reports,in(FEATURE_FLAGS, …)) is extended to includeNVIDIA_CUDA_PREBAKE.Rollout / sequencing
AKS_GPU_PREBAKE event=managed_gpu marker_present=…readiness logs let us confirm Stage-2 readiness before enabling consume.None.Validation
2404gen2containerdbuild (production base image), which produced/opt/azure/aks-gpu/dkms-marker. The PR/test pipeline here will exercise it again through the GPU e2e scenarios.Which issue(s) this PR fixes:
Fixes #
Requirements
Azure/AgentBaker(not a fork)