Skip to content

feat(gpu): consume the VHD-prebuilt CUDA kernel module (skip in-CSE recompile)#8787

Open
ganeshkumarashok wants to merge 1 commit into
mainfrom
gpu-prebake-consume
Open

feat(gpu): consume the VHD-prebuilt CUDA kernel module (skip in-CSE recompile)#8787
ganeshkumarashok wants to merge 1 commit into
mainfrom
gpu-prebake-consume

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What

Splits the consume / skip-build half of #8661 into its own PR (#8661 stays open for reference). The bake + teardown half is #8786 and must land first to produce the marker.

Changes

  • parts/linux/cloud-init/artifacts/cse_config.sh — when the aks-gpu prebake marker is present and its driver_kind matches this node, configGPUDrivers asks the aks-gpu container for install-skip-build (device-dependent steps only), skipping the ~80–150s in-CSE DKMS recompile. aks-gpu independently re-validates the marker (kernel + driver_version + driver_kind) and falls back to a full build on any mismatch.

Safety

  • No marker (non-prebaked VHD) → install, i.e. today's behavior. Defense-in-depth: gated by marker presence and driver_kind match, with aks-gpu's own re-validation + full-build fallback underneath.

Validation

  • shellspec 736/0, make generate-testdata clean, shellcheck clean.
  • e2e on the real [TEST All VHDs] pipeline (bake + consume together): in progress.

Reference (full original, kept open): #8661.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Linux CSE GPU driver install flow to consume a VHD-prebaked NVIDIA CUDA DKMS module when a prebake marker is present, by invoking the aks-gpu container with install-skip-build (instead of install) to avoid the expensive in-CSE recompile. It fits into the AgentBaker provisioning scripts under parts/linux/cloud-init/artifacts/ and adds ShellSpec coverage to validate the marker-driven action selection (including the driver_kind guard).

Changes:

  • Extend installGPUDriverImage to accept an install action (install vs install-skip-build) and pass the selected action from configGPUDrivers on Ubuntu.
  • Add marker parsing and a driver_kind match guard to ensure skip-build only happens when the prebaked driver kind matches the node’s NVIDIA_GPU_DRIVER_TYPE.
  • Add ShellSpec tests validating action selection: no marker → install, matching marker → install-skip-build, mismatched kind → install.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Select install vs install-skip-build based on prebake marker presence and driver_kind match, and plumb the action into the aks-gpu container invocation.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Add ShellSpec coverage for marker-driven action selection, including the CUDA-vs-GRID guard behavior.

End
End

Describe 'configGPUDrivers'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — renamed this block to Describe 'configGPUDrivers prebake marker action selection' so it's distinct from the existing timing-event configGPUDrivers suite, keeping shellspec output unambiguous (commit 5b72619).


Response generated by GitHub Copilot.

…ecompile)

Splits the consume / skip-build half of #8661 into its own PR (the bake + teardown
half ships separately and must land first to produce the marker). When the aks-gpu
prebake marker is present AND its driver_kind matches this node, configGPUDrivers
asks the aks-gpu container for `install-skip-build` -- running only the
device-dependent steps and skipping the ~80-150s in-CSE DKMS recompile. aks-gpu
independently re-validates the marker (kernel + driver_version + driver_kind) and
falls back to a full build on any mismatch, so this is safe on non-prebaked VHDs
(no marker -> `install`, today's behavior).

Validation: shellspec 736/0, generate-testdata clean, shellcheck clean.
Depends on the bake PR landing first to be exercised. Reference (full original, kept open): #8661.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
ganeshkumarashok added a commit that referenced this pull request Jun 30, 2026
…me path

Address review feedback: the inline rationale said GPU nodes "skip the DKMS
compile at boot", but this PR only enables the bake -- the boot-time skip
requires the configGPUDrivers skip-build path (PR #8787), which is not yet in
main. Reword to "can later skip ... via the configGPUDrivers skip-build path"
so pipeline maintainers aren't misled, and note the teardown covers non-GPU and
--gpu-driver None nodes.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants