Skip to content

feat(gpu): enable CUDA driver prebake on shared Ubuntu gen2 VHD builds#8803

Merged
ganeshkumarashok merged 2 commits into
mainfrom
ganesh/gpu-prebake-enable
Jul 1, 2026
Merged

feat(gpu): enable CUDA driver prebake on shared Ubuntu gen2 VHD builds#8803
ganeshkumarashok merged 2 commits into
mainfrom
ganesh/gpu-prebake-enable

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Activates the NVIDIA CUDA driver prebake that was merged dark in #8786. It flips FEATURE_FLAGS from None to NVIDIA_CUDA_PREBAKE on the two shared x86 gen2 Ubuntu images that GPU CUDA nodes boot, in both VHD pipelines:

Job Release pipeline PR/test pipeline
build2204gen2containerd
build2404gen2containerd

With the flag set, install-dependencies.sh pre-builds the NVIDIA CUDA kernel module into the VHD at build time, so GPU nodes can skip the ~80–150s DKMS compile at boot.

Why only these images

The supported CUDA GPU SKUs (A10/A100/H100) boot the gen2 22.04/24.04 Ubuntu images — confirmed by the e2e GPU scenarios, which pin VHDUbuntu2204Gen2Containerd / VHDUbuntu2404Gen2Containerd. AzureLinux GPU is out of scope (the bake lives in the Ubuntu path); gen1 / FIPS / TL / arm64 are not used by these SKUs, so enabling there would only add build time + teardown surface for no GPU benefit.

Safety

  • Coupled teardown ships it safe. The bake lands on a shared image, but non-GPU and --gpu-driver None nodes remove the module during provisioning via cleanUpPrebakedGPUDriver (merged in feat(gpu): pre-bake CUDA driver into the VHD + tear it down on non-GPU nodes #8786, runs unconditionally). No extra attack surface on those nodes.
  • No flag collisions. Every other FEATURE_FLAGS consumer is an additive substring grep (cvm/NVIDIA_GB/kata) that NVIDIA_CUDA_PREBAKE doesn't match; there's no case/allowlist that rejects unknown values; SKU_NAME is unaffected so the published image keeps its name.
  • CIS reports preserved. The one exact-match gate (Copy CIS Reports, in(FEATURE_FLAGS, …)) is extended to include NVIDIA_CUDA_PREBAKE.

Rollout / sequencing

Validation

  • All three pipeline YAMLs parse clean; only the 4 intended jobs (2 per pipeline) carry the new flag.
  • The bake itself was already proven end-to-end on a 2404gen2containerd build (production base image), which produced /opt/azure/aks-gpu/dkms-marker. The PR/test pipeline here will exercise it again through the GPU e2e scenarios.

Which issue(s) this PR fixes:

Fixes #

Requirements

  • PR title follows conventional commits
  • Commit is DCO signed-off and GPG-verified
  • Branch lives in Azure/AgentBaker (not a fork)

Turn on the NVIDIA_CUDA_PREBAKE feature flag (added dark in #8786) for the
two shared x86 gen2 Ubuntu images that GPU CUDA nodes boot --
2204gen2containerd and 2404gen2containerd -- in both the release
(.vsts-vhd-builder-release.yaml) and PR/test (.vsts-vhd-builder.yaml) VHD
pipelines.

With the flag set, install-dependencies.sh pre-builds the NVIDIA CUDA kernel
module into the VHD at build time, so GPU nodes skip the ~80-150s DKMS compile
at boot. Non-GPU and --gpu-driver None nodes tear the module down during
provisioning (cleanUpPrebakedGPUDriver, also from #8786), so the shared image
carries no extra attack surface on those nodes.

Scope rationale:
- Only 22.04/24.04 gen2 x86: these are the images GPU CUDA SKUs (A10/A100/H100)
  boot, confirmed via e2e GPU scenarios that pin VHDUbuntu2204Gen2Containerd /
  VHDUbuntu2404Gen2Containerd. AzureLinux GPU is out of scope (the bake is
  Ubuntu-only); gen1/FIPS/TL/arm64 are not used by supported CUDA GPU SKUs.
- The Copy CIS Reports step keys off an exact-match FEATURE_FLAGS allowlist, so
  NVIDIA_CUDA_PREBAKE is added to that list to preserve report publishing.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables the NVIDIA_CUDA_PREBAKE feature flag on the shared Ubuntu x86 Gen2 VHD build jobs (22.04/24.04) so the VHD build can pre-bake the CUDA driver kernel module, and updates the release template gating so CIS reports are still copied for the new flag value.

Changes:

  • Flip FEATURE_FLAGS to NVIDIA_CUDA_PREBAKE for the build2204gen2containerd and build2404gen2containerd jobs in both PR/test and release VHD pipelines.
  • Extend the “Copy CIS Reports” allowlist to include NVIDIA_CUDA_PREBAKE.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
.pipelines/templates/.builder-release-template.yaml Adds NVIDIA_CUDA_PREBAKE to the CIS-report copy condition allowlist so these builds keep producing CIS artifacts.
.pipelines/.vsts-vhd-builder.yaml Sets FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE for Ubuntu 22.04/24.04 Gen2 shared image jobs in the PR/test pipeline.
.pipelines/.vsts-vhd-builder-release.yaml Sets FEATURE_FLAGS=NVIDIA_CUDA_PREBAKE for Ubuntu 22.04/24.04 Gen2 shared image jobs in the release pipeline.

Comment thread .pipelines/.vsts-vhd-builder.yaml Outdated
Comment thread .pipelines/.vsts-vhd-builder.yaml Outdated
Comment thread .pipelines/.vsts-vhd-builder-release.yaml Outdated
Comment thread .pipelines/.vsts-vhd-builder-release.yaml Outdated
…me path

Address review feedback: the inline rationale said GPU nodes "skip the DKMS
compile at boot", but this PR only enables the bake -- the boot-time skip
requires the configGPUDrivers skip-build path (PR #8787), which is not yet in
main. Reword to "can later skip ... via the configGPUDrivers skip-build path"
so pipeline maintainers aren't misled, and note the teardown covers non-GPU and
--gpu-driver None nodes.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
@ganeshkumarashok ganeshkumarashok merged commit f6fb1d1 into main Jul 1, 2026
22 of 32 checks passed
@ganeshkumarashok ganeshkumarashok deleted the ganesh/gpu-prebake-enable branch July 1, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants