feat(gpu): consume the VHD-prebuilt CUDA kernel module (skip in-CSE recompile)#8787
feat(gpu): consume the VHD-prebuilt CUDA kernel module (skip in-CSE recompile)#8787ganeshkumarashok wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the Linux CSE GPU driver install flow to consume a VHD-prebaked NVIDIA CUDA DKMS module when a prebake marker is present, by invoking the aks-gpu container with install-skip-build (instead of install) to avoid the expensive in-CSE recompile. It fits into the AgentBaker provisioning scripts under parts/linux/cloud-init/artifacts/ and adds ShellSpec coverage to validate the marker-driven action selection (including the driver_kind guard).
Changes:
- Extend
installGPUDriverImageto accept an install action (installvsinstall-skip-build) and pass the selected action fromconfigGPUDriverson Ubuntu. - Add marker parsing and a
driver_kindmatch guard to ensure skip-build only happens when the prebaked driver kind matches the node’sNVIDIA_GPU_DRIVER_TYPE. - Add ShellSpec tests validating action selection: no marker →
install, matching marker →install-skip-build, mismatched kind →install.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_config.sh | Select install vs install-skip-build based on prebake marker presence and driver_kind match, and plumb the action into the aks-gpu container invocation. |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Add ShellSpec coverage for marker-driven action selection, including the CUDA-vs-GRID guard behavior. |
| End | ||
| End | ||
|
|
||
| Describe 'configGPUDrivers' |
There was a problem hiding this comment.
Good call — renamed this block to Describe 'configGPUDrivers prebake marker action selection' so it's distinct from the existing timing-event configGPUDrivers suite, keeping shellspec output unambiguous (commit 5b72619).
Response generated by GitHub Copilot.
…ecompile) Splits the consume / skip-build half of #8661 into its own PR (the bake + teardown half ships separately and must land first to produce the marker). When the aks-gpu prebake marker is present AND its driver_kind matches this node, configGPUDrivers asks the aks-gpu container for `install-skip-build` -- running only the device-dependent steps and skipping the ~80-150s in-CSE DKMS recompile. aks-gpu independently re-validates the marker (kernel + driver_version + driver_kind) and falls back to a full build on any mismatch, so this is safe on non-prebaked VHDs (no marker -> `install`, today's behavior). Validation: shellspec 736/0, generate-testdata clean, shellcheck clean. Depends on the bake PR landing first to be exercised. Reference (full original, kept open): #8661. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
dda3dbc to
5b72619
Compare
…me path Address review feedback: the inline rationale said GPU nodes "skip the DKMS compile at boot", but this PR only enables the bake -- the boot-time skip requires the configGPUDrivers skip-build path (PR #8787), which is not yet in main. Reword to "can later skip ... via the configGPUDrivers skip-build path" so pipeline maintainers aren't misled, and note the teardown covers non-GPU and --gpu-driver None nodes. Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
What
Splits the consume / skip-build half of #8661 into its own PR (#8661 stays open for reference). The bake + teardown half is #8786 and must land first to produce the marker.
Changes
parts/linux/cloud-init/artifacts/cse_config.sh— when the aks-gpu prebake marker is present and itsdriver_kindmatches this node,configGPUDriversasks the aks-gpu container forinstall-skip-build(device-dependent steps only), skipping the ~80–150s in-CSE DKMS recompile. aks-gpu independently re-validates the marker (kernel + driver_version + driver_kind) and falls back to a full build on any mismatch.Safety
install, i.e. today's behavior. Defense-in-depth: gated by marker presence and driver_kind match, with aks-gpu's own re-validation + full-build fallback underneath.Validation
make generate-testdataclean, shellcheck clean.[TEST All VHDs]pipeline (bake + consume together): in progress.Reference (full original, kept open): #8661.