test(e2e): serialize gallery image-version replication to fix GalleryImageNotFound#8823
Open
ganeshkumarashok wants to merge 1 commit into
Open
test(e2e): serialize gallery image-version replication to fix GalleryImageNotFound#8823ganeshkumarashok wants to merge 1 commit into
ganeshkumarashok wants to merge 1 commit into
Conversation
…ImageNotFound Scenarios in different regions resolve the same freshly-built SIG image version in parallel (CachedPrepareVHD is keyed per image+region, so the regions run concurrently). Adding a region is a read-modify-write PUT of the version's TargetRegions, performed with no locking, so a stale PUT dropped a region another goroutine had just added -> the image became unavailable there and VMSS creation failed with GalleryImageNotFound. Observed on main: 20 failures, all "2404gen2containerd/1.1783016979.17372 is not available in southeastasia", after southeastasia and uaenorth replicated the same version concurrently. Serialize replication per image-version ID with an in-process mutex and re-read the live TargetRegions inside the lock before each update, so every replication builds on the current region set instead of clobbering concurrent additions. Distinct versions still replicate in parallel; only same-version region additions serialize.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses flaky GPU E2E failures caused by concurrent SIG gallery image-version replication updates clobbering TargetRegions, which can leave a freshly-built image version unavailable in a requested region (GalleryImageNotFound). The change is confined to the E2E Azure helper client and focuses on making region replication updates deterministic under concurrency.
Changes:
- Introduces a per-image-version in-process mutex to serialize
TargetRegionsupdates for the same gallery image version. - Re-reads the live gallery image version inside the lock to ensure each update is based on the current
TargetRegionsstate (avoiding lost updates). - Adds a helper to refresh the image version object from Azure before replication decisions/actions.
Comment on lines
+592
to
+596
| // different regions resolve the same freshly-built version in parallel, so without | ||
| // serialization a stale PUT drops a region another goroutine just added and the image | ||
| // becomes unavailable there (GalleryImageNotFound). Lock per version ID and re-read the live | ||
| // TargetRegions inside the lock so every update builds on the current region set. | ||
| mu := versionReplicationLock(*version.ID) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The GPU E2E suite has been failing with
GalleryImageNotFound(20 failures in the latestmainrun) — e.g.:All 20 were the same version in the same region (southeastasia).
Root cause — a concurrent lost-update race on
TargetRegionsCachedPrepareVHDmemoizes VHD resolution per (image, region), so different regions resolve the same freshly-built SIG image version in parallel. Adding a region is a read-modify-write PUT of the version'sTargetRegions(replicateImageVersionToCurrentRegion), and there was no locking:[eastus]), appends → PUT[eastus, southeastasia].[eastus], before step 1 lands), appends → PUT[eastus, uaenorth].GalleryImageNotFound.The run log confirms it: southeastasia and uaenorth replicated version
1.1783016979.17372concurrently.Fix
Serialize replication per image-version ID with an in-process mutex, and re-read the live
TargetRegionsinside the lock before each update, so every PUT builds on the current region set instead of clobbering a concurrent addition. Distinct versions still replicate in parallel — only same-version region additions serialize (which Azure requires anyway, since a version resource only accepts one update at a time).Scope / testing
e2e/config/azure.go); no provisioning/runtime code touched.go build ./config/+go vet ./config/pass.