feat(linux): reduce prefetch optimization runtime within VHD builds and add retry logic to handle AIB rate limits#8802
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes the Linux VHD prefetch optimization flow in vhdbuilder/prefetch/ to speed up VHD builds by having Azure Image Builder distribute the optimized output as a managed image, then converting that managed image into a VHD blob in the target storage account via a single azcopy operation.
Changes:
- Update the Image Builder template to distribute to a
ManagedImageinstead of aVHDblob. - Add a managed-image → managed-disk → VHD conversion step in
optimize.shafter the Image Builder run. - Adjust idempotency logic to treat the target VHD as “complete” only after a metadata marker is written.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
vhdbuilder/prefetch/templates/optimize.json |
Switch Image Builder distribution output from VHD blob to managed image output. |
vhdbuilder/prefetch/scripts/optimize.sh |
Convert the distributed managed image into a VHD blob and mark completion via blob metadata. |
Failed gateRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=170737959 Failed job/stage/task: $(System.Collections.Hashtable.job) (logId 481). Detective summaryKnown CIS-CAT scanner failure in vhd-scanning: CIS-CAT Pro Assessor v4.57.1 reports Likely cause / signatureLikely known VHD scan/CIS-CAT gate tooling failure, not this PR. Signature: AB-GATE-LINUX-VHD-SCAN-CISCAT-EXIT122. Confidence: High. Strongest alternative: The prefetch/AIB optimization changes affected image generation; less likely because two unrelated component-update PRs in the same cycle failed with the identical scanner signature across the same VHD legs. Recommended actionNo PR-specific remediation recommended unless future evidence shows a distinct CIS rule failure; continue repair item #38671557. Evidence
|
What this PR does / why we need it:
reduces prefetch optimization runtime within VHD builds by 40-60% from initial testing by manually copying AIB's output VHD to our destination storage container manually via azcopy, rather than relying on AIB's mechanism to distribute the blob automatically, which seems to work much slower
Across every per-SKU execution (main n=257, feature n=59):
┌────────────┬──────────┬──────────┬──────────────┐
│ Percentile │ main │ feature │ Reduction │
├────────────┼──────────┼──────────┼──────────────┤
│ Median │ 66.6 min │ 26.7 min │ −60% (2.49×) │
├────────────┼──────────┼──────────┼──────────────┤
│ P90 │ 82.0 min │ 33.4 min │ −59% │
├────────────┼──────────┼──────────┼──────────────┤
│ P95 │ 86.4 min │ 35.9 min │ −58% │
├────────────┼──────────┼──────────┼──────────────┤
│ P99 │ 99.9 min │ 42.2 min │ −58% │
├────────────┼──────────┼──────────┼──────────────┤
│ mean │ 67.6 min │ 27.8 min │ −59% │
└────────────┴──────────┴──────────┴──────────────┘
Per-SKU speedup is consistent across all 26 SKUs (1.74×–3.03×, ~2.4× avg). Crucially, the two full feature builds (all 30 SKUs, same parallel contention as main) show the same ~2.5× gain — so it's a real optimization, not a small-build artifact.
┌────────────┬─────────────┬───────────────┐
│ Percentile │ main (n=10) │ feature (n=5) │
├────────────┼─────────────┼───────────────┤
│ Median │ 130.3 min │ 82.6 min │
├────────────┼─────────────┼───────────────┤
│ P95 │ 250.1 min │ 94.1 min │
├────────────┼─────────────┼───────────────┤
│ max │ 290.7 min │ 95.2 min │
├────────────┼─────────────┼───────────────┤
│ mean │ 153.3 min │ 81.2 min │
└────────────┴─────────────┴───────────────┘
Caveat: 3 of 5 feature runs were partial (5/5/1 SKUs). Fair full-vs-full: feature 89.5 & 95.2 min (median ~92) vs main median ~130 min → ~30–40% lower, and main's long tail (200/291 min) disappears.
Which issue(s) this PR fixes:
Fixes #