Skip to content

ci(e2e): gate staging e2e on critical staging-instance config drift#8757

Open
jacekradko wants to merge 3 commits into
mainfrom
jacek/staging-e2e-validate-gate
Open

ci(e2e): gate staging e2e on critical staging-instance config drift#8757
jacekradko wants to merge 3 commits into
mainfrom
jacek/staging-e2e-validate-gate

Conversation

@jacekradko

@jacekradko jacekradko commented Jun 5, 2026

Copy link
Copy Markdown
Member

Follow-up to #8756. The validate-staging-instances script already compares prod vs staging /v1/environment and prints a diff, but it always exited 0, so a drifted staging mirror (like the missing WhatsApp channel that makes whatsapp-phone-code time out) blocked nothing and stayed invisible until tests failed 200-deep.

This gives the script teeth without flipping any behavior yet. It gains a tight CRITICAL_PATHS allowlist (attribute enabled toggles, phone_number.channels, auth factors, social enable/disable, password policy) plus an ACCEPTED_DRIFT escape hatch, so a known and tracked gap doesn't block while new drift does. In strict mode it exits non-zero on a blocking mismatch; fetch failures and cosmetic drift never fail the build.

Strictness is driven by the STAGING_VALIDATE_STRICT repo variable and defaults to report-only, and integration-tests now depends on validate-instances. So nothing changes until someone sets the variable: today it just logs the blocking drift and the gate it would apply. The piece worth a look is the CRITICAL_PATHS set, that is the policy of what is worth blocking a run over.

Before enabling strict, run the validator against current staging to confirm the only blocking drift is expected, and add ACCEPTED_DRIFT entries for anything intentionally tolerated. Stacked on #8756.

Update: the branch is rebased onto main (dropping the stale pre-squash copy of #8756 and keeping main's TURBO_FORCE and report-path fixes), and captcha_enabled is now in CRITICAL_PATHS. An enabled captcha blocks every in-browser sign-up in headless CI, which is what kept the staging generic leg red for a week (legal-consent vs Turnstile). The captcha ignore-list removal is included here too so the gate works standalone; it overlaps with #8832 by design and merges cleanly in either order, with a pipeline test pinning that critical paths cannot be swallowed by the ignore filter. Also, the report job now notifies Slack when the strict gate itself fails, since a gate failure skips the test legs rather than failing them and would otherwise be silent. Still report-only until the repo var is set; bring instances to parity first using #8832's report.

Summary by CodeRabbit

Release Notes

  • Chores

    • Enhanced E2E staging validation with optional strict mode for improved deployment gating
    • Updated staging workflow with improved job sequencing and longer test artifact retention
    • Added critical configuration drift detection for staging environments
  • Tests

    • Expanded test coverage for staging validation scenarios

@changeset-bot

changeset-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 9353469

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 0 packages

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds critical-path validation with strict gating to staging instance configuration checks, wires this validation into the E2E workflow to skip tests on configuration drift, and provides comprehensive test coverage for the new validation logic.

Changes

Staging E2E Validation Strict Gating

Layer / File(s) Summary
Critical path policy and mismatch classification
scripts/validate-staging-instances.mjs
Defines which configuration paths are critical (factor changes, provider toggles, password settings, captcha gating) and provides isCriticalPath(), isAcceptedDrift(), and classifyMismatches() helpers to distinguish blocking mismatches from informational ones.
Strict gating enforcement in validation script
scripts/validate-staging-instances.mjs
main() now accepts a strict parameter (from STAGING_VALIDATE_STRICT env var or --strict CLI flag), accumulates blocking critical mismatches per instance, and exits with code 1 in strict mode when blocking mismatches exist; exports the new classification helpers.
Validation script test coverage
scripts/validate-staging-instances.test.mjs
Comprehensive tests assert critical-path detection for authentication factors, social providers, password settings, and captcha toggles; verify blocking vs informational classification with instance-scoped accepted drift and regex allowlists; and confirm strict-mode gating behavior (exit code 1 on critical drift, report-only in non-strict mode).
Workflow orchestration and validation gating
.github/workflows/e2e-staging.yml
Adds STAGING_VALIDATE_STRICT env var sourced from repo variables, makes integration-tests conditional on validate-instances succeeding or skipped, adds validate-instances to the final report job dependencies, and extends Slack failure notifications to trigger on validation failures.
Changeset documentation
.changeset/staging-e2e-validate-gate.md
Changeset file declaring the staging E2E validation gating feature.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • clerk/javascript#8766: Both PRs update the staging E2E workflow's Playwright JSON report artifact handling, including integration/playwright-report/results.json artifact uploads and integration-test execution/reporting steps.

Suggested Reviewers

  • tmilewski

🐰 A rabbit hops through staging gates so fine,
Gating critical paths with each config line,
When drift appears, the strict mode takes its stand,
With blocking checks across the e2e land,
Tests prove the logic works just as planned! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a gate for staging e2e tests based on critical configuration drift detection between production and staging instances.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@vercel

vercel Bot commented Jun 5, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
clerk-js-sandbox Ready Ready Preview, Comment Jun 11, 2026 5:21pm
swingset Ready Ready Preview, Comment Jun 11, 2026 5:21pm

Request Review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/validate-staging-instances.mjs (1)

24-32: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Critical captcha drift is filtered out before strict gating.

user_settings.sign_up.captcha_enabled is marked critical (Line 61), but isIgnored still drops *.captcha_enabled (Line 28) before classification (Line 452). That makes this critical path non-blocking in practice.

Suggested fix
 const IGNORED_PATHS = [
   /\.id$/,
   /^auth_config\.id$/,
   /\.logo_url$/,
-  /\.captcha_enabled$/,
-  /\.captcha_widget_type$/,
   /\.enforce_hibp_on_sign_in$/,
   /\.disable_hibp$/,
 ];

Also applies to: 47-62, 452-457

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate-staging-instances.mjs` around lines 24 - 32, The
IGNORED_PATHS array contains a /\.captcha_enabled$/ pattern which causes
isIgnored to drop captcha flags before they can be classified as critical
(specifically user_settings.sign_up.captcha_enabled); remove or narrow that
pattern in IGNORED_PATHS (or change the order so classification runs before
isIgnored) so that user_settings.sign_up.captcha_enabled is evaluated by the
existing critical-path logic; locate IGNORED_PATHS and the isIgnored call in
scripts/validate-staging-instances.mjs and ensure *.captcha_enabled is not
globally filtered out prior to the criticality check.
🧹 Nitpick comments (1)
scripts/validate-staging-instances.test.mjs (1)

647-717: ⚡ Quick win

Add a strict-mode main() regression test for captcha_enabled drift.

Current tests validate captcha at classifier level, but not through the full main() pipeline. Add one case where user_settings.sign_up.captcha_enabled differs and main({ strict: true }) must exit with 1.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate-staging-instances.test.mjs` around lines 647 - 717, Add a
new regression test in scripts/validate-staging-instances.test.mjs that mirrors
the existing strict-mode cases but uses a drift in
user_settings.sign_up.captcha_enabled: call setPair(), mockEnvPair() with one
env having user_settings: {...emptyUserSettings(), sign_up: { captcha_enabled:
true }} and the other having sign_up: { captcha_enabled: false }, then await
expect(main({ strict: true })).rejects.toThrow('process.exit(1)') and assert
exitCode === 1 and consoleLogs contains the blocking mismatch message; follow
the pattern used in the other tests (e.g., the "exits non-zero in strict mode
when a critical config path drifts" test) to place and name the new it(...)
block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@scripts/validate-staging-instances.mjs`:
- Around line 24-32: The IGNORED_PATHS array contains a /\.captcha_enabled$/
pattern which causes isIgnored to drop captcha flags before they can be
classified as critical (specifically user_settings.sign_up.captcha_enabled);
remove or narrow that pattern in IGNORED_PATHS (or change the order so
classification runs before isIgnored) so that
user_settings.sign_up.captcha_enabled is evaluated by the existing critical-path
logic; locate IGNORED_PATHS and the isIgnored call in
scripts/validate-staging-instances.mjs and ensure *.captcha_enabled is not
globally filtered out prior to the criticality check.

---

Nitpick comments:
In `@scripts/validate-staging-instances.test.mjs`:
- Around line 647-717: Add a new regression test in
scripts/validate-staging-instances.test.mjs that mirrors the existing
strict-mode cases but uses a drift in user_settings.sign_up.captcha_enabled:
call setPair(), mockEnvPair() with one env having user_settings:
{...emptyUserSettings(), sign_up: { captcha_enabled: true }} and the other
having sign_up: { captcha_enabled: false }, then await expect(main({ strict:
true })).rejects.toThrow('process.exit(1)') and assert exitCode === 1 and
consoleLogs contains the blocking mismatch message; follow the pattern used in
the other tests (e.g., the "exits non-zero in strict mode when a critical config
path drifts" test) to place and name the new it(...) block.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 91ce88f1-9bd4-4352-b093-5a8ffded5427

📥 Commits

Reviewing files that changed from the base of the PR and between 5bf43a2 and bf10aa7.

📒 Files selected for processing (8)
  • .changeset/staging-e2e-resilience-p0.md
  • .changeset/staging-e2e-validate-gate.md
  • .github/workflows/e2e-staging.yml
  • integration/playwright.config.ts
  • integration/tests/custom-pages.test.ts
  • integration/tests/whatsapp-phone-code.test.ts
  • scripts/validate-staging-instances.mjs
  • scripts/validate-staging-instances.test.mjs

validate-staging-instances.mjs already diffs prod vs staging /v1/environment
but every exit path returned 0, so detected drift blocked nothing and the
job was not a dependency of the test matrix. A drifted staging mirror (e.g. a
missing phone_number WhatsApp channel) therefore surfaced only as opaque test
timeouts 200 tests deep.

Add a tight CRITICAL_PATHS allowlist (attribute enabled toggles,
phone_number.channels, auth factors/strategies, social enable/disable,
password settings) and an ACCEPTED_DRIFT escape hatch so known gaps don't
block while new drift does. In strict mode the script exits non-zero on a
blocking mismatch; fetch failures and cosmetic drift never fail the build.

Wire integration-tests to need validate-instances, and drive strictness from
the STAGING_VALIDATE_STRICT repo variable (default report-only). So this is a
no-op until the team opts in: it logs blocking drift and the proposed gate
without failing anything. Flip the variable to make it enforce.
@pkg-pr-new

pkg-pr-new Bot commented Jun 11, 2026

Copy link
Copy Markdown

Open in StackBlitz

@clerk/astro

npm i https://pkg.pr.new/@clerk/astro@8757

@clerk/backend

npm i https://pkg.pr.new/@clerk/backend@8757

@clerk/chrome-extension

npm i https://pkg.pr.new/@clerk/chrome-extension@8757

@clerk/clerk-js

npm i https://pkg.pr.new/@clerk/clerk-js@8757

@clerk/expo

npm i https://pkg.pr.new/@clerk/expo@8757

@clerk/expo-passkeys

npm i https://pkg.pr.new/@clerk/expo-passkeys@8757

@clerk/express

npm i https://pkg.pr.new/@clerk/express@8757

@clerk/fastify

npm i https://pkg.pr.new/@clerk/fastify@8757

@clerk/hono

npm i https://pkg.pr.new/@clerk/hono@8757

@clerk/localizations

npm i https://pkg.pr.new/@clerk/localizations@8757

@clerk/nextjs

npm i https://pkg.pr.new/@clerk/nextjs@8757

@clerk/nuxt

npm i https://pkg.pr.new/@clerk/nuxt@8757

@clerk/react

npm i https://pkg.pr.new/@clerk/react@8757

@clerk/react-router

npm i https://pkg.pr.new/@clerk/react-router@8757

@clerk/shared

npm i https://pkg.pr.new/@clerk/shared@8757

@clerk/tanstack-react-start

npm i https://pkg.pr.new/@clerk/tanstack-react-start@8757

@clerk/testing

npm i https://pkg.pr.new/@clerk/testing@8757

@clerk/ui

npm i https://pkg.pr.new/@clerk/ui@8757

@clerk/upgrade

npm i https://pkg.pr.new/@clerk/upgrade@8757

@clerk/vue

npm i https://pkg.pr.new/@clerk/vue@8757

commit: 9353469

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/e2e-staging.yml (1)

378-379: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Slack failure text is now misleading for validation-gate failures.

At Line 378, the message always says “Staging E2E tests failed”, but this notification can now also fire when validate-instances fails and test legs are skipped. Please update wording to cover both failure sources.

Suggested update
-                    "text": "*:red_circle: Staging E2E tests failed*\n*Repo:* `${{ github.repository }}`\n*Ref:* `${{ steps.inputs.outputs.ref }}`\n*SDK:* `${{ steps.inputs.outputs.sdk-source }}`\n*clerk_go commit:* `${{ steps.inputs.outputs.clerk-go-commit-sha || 'N/A' }}`\n*Run:* <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View logs>"
+                    "text": "*:red_circle: Staging E2E workflow failed*\n*Failure source:* `${{ needs.validate-instances.result == 'failure' && 'staging instance validation gate' || 'integration tests' }}`\n*Repo:* `${{ github.repository }}`\n*Ref:* `${{ steps.inputs.outputs.ref }}`\n*SDK:* `${{ steps.inputs.outputs.sdk-source }}`\n*clerk_go commit:* `${{ steps.inputs.outputs.clerk-go-commit-sha || 'N/A' }}`\n*Run:* <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View logs>"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/e2e-staging.yml around lines 378 - 379, Update the Slack
notification "text" field so it no longer always reads "Staging E2E tests
failed" and instead covers both failure cases (E2E test failures and
validation-gate failures that skip test legs). Locate the Slack step that sets
the "text" property (the multiline string starting with "*:red_circle: Staging
E2E tests failed*") and change the message to a neutral combined message such as
"*:red_circle: Staging E2E tests or instance validation failed*" (or similar
wording) while preserving the existing repo/ref/SDK/commit/run placeholders and
link formatting; ensure the modified "text" string still interpolates the same
GitHub action output variables (`${{ steps.inputs.outputs.* }}`).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/e2e-staging.yml:
- Around line 378-379: Update the Slack notification "text" field so it no
longer always reads "Staging E2E tests failed" and instead covers both failure
cases (E2E test failures and validation-gate failures that skip test legs).
Locate the Slack step that sets the "text" property (the multiline string
starting with "*:red_circle: Staging E2E tests failed*") and change the message
to a neutral combined message such as "*:red_circle: Staging E2E tests or
instance validation failed*" (or similar wording) while preserving the existing
repo/ref/SDK/commit/run placeholders and link formatting; ensure the modified
"text" string still interpolates the same GitHub action output variables (`${{
steps.inputs.outputs.* }}`).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Repository UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: bc91a59f-9b2d-4a4f-af14-9d8b212691ef

📥 Commits

Reviewing files that changed from the base of the PR and between bf10aa7 and 9353469.

📒 Files selected for processing (2)
  • .changeset/staging-e2e-validate-gate.md
  • .github/workflows/e2e-staging.yml
✅ Files skipped from review due to trivial changes (1)
  • .changeset/staging-e2e-validate-gate.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants