Skip to content

[CLI] Pin file-locking test suite to 3 workers#3521

Merged
mho22 merged 1 commit intotrunkfrom
fix/file-locking-test-worker-count
Apr 23, 2026
Merged

[CLI] Pin file-locking test suite to 3 workers#3521
mho22 merged 1 commit intotrunkfrom
fix/file-locking-test-worker-count

Conversation

@pento
Copy link
Copy Markdown
Member

@pento pento commented Apr 22, 2026

Motivation for the change, related issues

Follow-up to #3504.

test-playground-cli (macos-latest) started failing consistently after #3504 merged. Every failing run showed the same five timeouts in packages/playground/cli/tests/file-locking.spec.ts:

  • PHP flock() > should grant multiple shared locks on a file
  • PHP flock() > should release a shared lock when its associated file descriptor is closed
  • PHP flock() > should release an exclusive lock when its associated file descriptor is closed
  • PHP flock() > should release a shared lock when the owning process exits
  • PHP flock() > should release an exclusive lock when the owning process exits

Root cause

The multi shared locks test at file-locking.spec.ts:1090 fires three concurrent PHP requests with Promise.all. Each script busy-waits on a coordination file that only the next script in the chain can write, so all three must run concurrently.

Before #3504 the CLI hardcoded 6 workers, which was always enough. After #3504 the default dropped to min(6, max(1, cpus - 1)). GitHub's macos-latest runners report 3 CPUs, resolving to 2 workers — so PHP3 is permanently queued and the test deadlocks.

Why four unrelated tests also fail

The runCLI server is created once in beforeAll and shared across every test in the describe. When the multi-shared test times out, its two workers are still spinning inside PHP-level while (file_get_contents(...) \!== ...) loops — vitest can abort the JS await, but there's no hook to unblock PHP. The four two-script tests that follow multi-shared in file order each do Promise.all([fetchScript(php1), fetchScript(php2)]), inherit a pool with zero free workers, and time out at 60 s each.

Evidence this is the right model:

  • The two-script tests that run before multi-shared pass (deny-exclusive-with-shared, deny-shared-with-exclusive). The two-script tests after fail. Failures aren't interleaved among the earlier ones.
  • Every failure is a 60 s timeout, never an assertion error. Independent flock races would surface as expect(lock_acquired).toBe(false)-style failures.
  • A 200-run audit of test-playground-cli (macos-latest) across all branches shows no consistent failure pattern before [CLI] Add --workers=<n|auto> flag to configure worker thread count #3504 merged.

Implementation details

Pass workers: 3 to runCLI() in the suite's beforeAll. Three is the minimum the multi-shared test needs; anything lower re-introduces the cascade. Explicit integer values bypass the min(6, cpus-1) default clamp (confirmed by the honors an explicit --workers=3 test at run-cli.spec.ts:1886), so the suite is now robust to host CPU count.

No production code changes. Only the test file is touched.

Testing Instructions (or ideally a Blueprint)

# Passes on macOS-like hosts with 3 CPUs:
npx nx test-playground-cli playground-cli --testFile=file-locking.spec.ts

# To reproduce the original failure, temporarily change workers: 3 to workers: 2:
# expect 5 timeouts in the PHP flock() describe.

On CI, the test-playground-cli (macos-latest) job should turn green. Ubuntu and Windows runners were already passing (their default worker count was already ≥3) and should remain so.

AI disclosure

Per WordPress AI Guidelines:

AI assistance: Yes
Tool: Claude Code (Claude Opus 4.7)
Used for: Root-cause analysis of the worker-starvation cascade, drafting the fix and this PR description. All code was reviewed and tested by the author.

The `multi shared locks` test at file-locking.spec.ts:1090 fires three
concurrent PHP requests with `Promise.all`, each busy-waiting on a
coordination file that only the next script in the chain can write.
It therefore needs three PHP workers running concurrently.

Before #3504 the CLI hardcoded 6 workers, which was always enough.
After #3504 the default dropped to `min(6, max(1, cpus - 1))`. On
`macos-latest` GitHub runners (3 CPUs) that resolves to 2 workers,
leaving PHP3 permanently queued and deadlocking the test.

The cascade is worse than a single timeout: the `runCLI` server is
shared across the whole describe via `beforeAll`, and when the
multi-shared test times out its two workers are still spinning
inside PHP-level `while (file_get_contents(...) \!== ...)` loops —
vitest can abort the JS `await`, but there's no hook to unblock PHP.
The four two-script tests that follow in file order then inherit a
pool with zero free workers and each times out at 60 s.

Pin the suite to `workers: 3` so it's robust to CI host CPU counts.
Three is the minimum the multi-shared test needs; anything lower
re-introduces the cascade.

Follow-up to #3504.
@pento pento requested review from a team, Copilot and mho22 April 22, 2026 23:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Pins the Playground CLI file-locking test suite’s CLI worker pool size to prevent worker starvation deadlocks on low-CPU CI runners (notably macos-latest).

Changes:

  • Passes an explicit workers: 3 option to the shared runCLI() server created in beforeAll.
  • Documents why fewer than 3 workers leads to deadlock and cascading timeouts in subsequent tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chubes4
Copy link
Copy Markdown

chubes4 commented Apr 23, 2026

Independently reproduced and verified the fix locally. Came at this from the other direction (looking at the remaining intermittent failure on #3494) and landed on the same root cause and the same fix.

Repro on macOS (M-series), same PR-branch-merged-with-trunk tree CI runs against:

workers result
workers: 2 (GitHub macos-latest default after #3504) same 5 tests hang at 60s each, in the same order as CI
workers: 3 (this PR) all 16 pass in ~9s

The cascade you describe matches what I see locally too — the multi shared locks test is the one that actually deadlocks (needs 3 concurrent workers), and the 4 tests that follow in file order inherit a pool with zero free workers because PHP is still spinning inside while (file_get_contents(...) !== ...) with no way for vitest to interrupt it.

Confirms this is independent of #3494; #3494 fixes a different class of concurrency bug (early worker release + dead-worker eviction). 👍 on shipping this.

@pento
Copy link
Copy Markdown
Member Author

pento commented Apr 23, 2026

While looking at the failing test on this PR (should be able to follow external symlinks in primary and secondary PHP instances), I noticed that test is also only flaky under 2 workers. It does become more reliable if pinned to 3 workers, but that doesn't address the underlying causes, which appears to be the same as #3494.

@mho22 mho22 merged commit 7d08b1f into trunk Apr 23, 2026
91 of 93 checks passed
@mho22 mho22 deleted the fix/file-locking-test-worker-count branch April 23, 2026 07:56
@mho22
Copy link
Copy Markdown
Collaborator

mho22 commented Apr 23, 2026

@pento Thanks again for your contribution. @chubes4 Sorry for the delay, I will focus on your pull request today.

@chubes4
Copy link
Copy Markdown

chubes4 commented Apr 23, 2026

@mho22 No problem! Thanks for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants