Skip to content

test(e2e): serialize gallery image-version replication to fix GalleryImageNotFound#8823

Open
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/e2e-fix-gallery-replication-race
Open

test(e2e): serialize gallery image-version replication to fix GalleryImageNotFound#8823
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/e2e-fix-gallery-replication-race

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

Problem

The GPU E2E suite has been failing with GalleryImageNotFound (20 failures in the latest main run) — e.g.:

The gallery image .../images/2404gen2containerd/versions/1.1783016979.17372
is not available in southeastasia region. Please contact image owner to
replicate to this region, or change your requested region.

All 20 were the same version in the same region (southeastasia).

Root cause — a concurrent lost-update race on TargetRegions

CachedPrepareVHD memoizes VHD resolution per (image, region), so different regions resolve the same freshly-built SIG image version in parallel. Adding a region is a read-modify-write PUT of the version's TargetRegions (replicateImageVersionToCurrentRegion), and there was no locking:

  1. southeastasia reads the version (regions [eastus]), appends → PUT [eastus, southeastasia].
  2. uaenorth reads the version (also [eastus], before step 1 lands), appends → PUT [eastus, uaenorth].
  3. uaenorth's stale PUT drops southeastasia → the image is no longer replicated there → every cached southeastasia scenario fails with GalleryImageNotFound.

The run log confirms it: southeastasia and uaenorth replicated version 1.1783016979.17372 concurrently.

Fix

Serialize replication per image-version ID with an in-process mutex, and re-read the live TargetRegions inside the lock before each update, so every PUT builds on the current region set instead of clobbering a concurrent addition. Distinct versions still replicate in parallel — only same-version region additions serialize (which Azure requires anyway, since a version resource only accepts one update at a time).

Scope / testing

  • e2e-only change (e2e/config/azure.go); no provisioning/runtime code touched.
  • go build ./config/ + go vet ./config/ pass.
  • Trade-off: same-version replications now run sequentially (only on the first run after a new VHD build, for the ~2–3 regions a version is used in) — correctness over a small first-run latency; the tests already block on replication today.

…ImageNotFound

Scenarios in different regions resolve the same freshly-built SIG image
version in parallel (CachedPrepareVHD is keyed per image+region, so the
regions run concurrently). Adding a region is a read-modify-write PUT of
the version's TargetRegions, performed with no locking, so a stale PUT
dropped a region another goroutine had just added -> the image became
unavailable there and VMSS creation failed with GalleryImageNotFound.

Observed on main: 20 failures, all "2404gen2containerd/1.1783016979.17372
is not available in southeastasia", after southeastasia and uaenorth
replicated the same version concurrently.

Serialize replication per image-version ID with an in-process mutex and
re-read the live TargetRegions inside the lock before each update, so
every replication builds on the current region set instead of clobbering
concurrent additions. Distinct versions still replicate in parallel; only
same-version region additions serialize.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flaky GPU E2E failures caused by concurrent SIG gallery image-version replication updates clobbering TargetRegions, which can leave a freshly-built image version unavailable in a requested region (GalleryImageNotFound). The change is confined to the E2E Azure helper client and focuses on making region replication updates deterministic under concurrency.

Changes:

  • Introduces a per-image-version in-process mutex to serialize TargetRegions updates for the same gallery image version.
  • Re-reads the live gallery image version inside the lock to ensure each update is based on the current TargetRegions state (avoiding lost updates).
  • Adds a helper to refresh the image version object from Azure before replication decisions/actions.

Comment thread e2e/config/azure.go
Comment on lines +592 to +596
// different regions resolve the same freshly-built version in parallel, so without
// serialization a stale PUT drops a region another goroutine just added and the image
// becomes unavailable there (GalleryImageNotFound). Lock per version ID and re-read the live
// TargetRegions inside the lock so every update builds on the current region set.
mu := versionReplicationLock(*version.ID)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants