Fix intermittent 500 errors from worker pool concurrency issues#3494
Fix intermittent 500 errors from worker pool concurrency issues#3494chubes4 wants to merge 11 commits intoWordPress:trunkfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes intermittent HTTP 500s under concurrent load by aligning worker-pool acquisition/release with the full request lifecycle and evicting crashed workers from the pool.
Changes:
- Updates the server
handleRequestcontract to stream responses via a callback that remains in-scope until streaming completes. - Switches CLI request handling away from early-releasing streaming calls and removes dead worker proxies from the pool on worker exit.
- Adds opt-in debug logging and introduces pool eviction (
__removeInstance) to prevent routing requests to crashed workers.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/playground/cli/tests/start-server.spec.ts | Updates tests to match the new handleRequest(request, streamResponse) callback signature. |
| packages/playground/cli/src/start-server.ts | Changes request handling to pass a streamResponse callback and adds opt-in HTTP debug logs. |
| packages/playground/cli/src/run-cli.ts | Evicts dead workers from the pool on exit and switches to buffered request() to avoid early worker release. |
| packages/php-wasm/universal/src/lib/single-php-instance-manager.ts | Adds opt-in debug logging (with stack traces) for double-acquire conflicts. |
| packages/php-wasm/universal/src/lib/object-pool-proxy.ts | Adds pool eviction (__removeInstance) and opt-in debug logging around acquire/release/queue behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function removeInstance(instance: unknown): void { | ||
| const inst = instance as T; | ||
|
|
||
| // Remove from the canonical list | ||
| const idx = instances.indexOf(inst); | ||
| if (idx !== -1) { | ||
| instances.splice(idx, 1); | ||
| } | ||
|
|
||
| function acquire(): Promise<T> { | ||
| // Remove from the free list if it's currently free | ||
| const freeIdx = freeInstances.indexOf(inst); | ||
| if (freeIdx !== -1) { | ||
| freeInstances.splice(freeIdx, 1); | ||
| log( | ||
| `removeInstance: removed free instance (free=${freeInstances.length})` | ||
| ); | ||
| } else { | ||
| // Instance is currently busy — mark it for removal on release | ||
| removedInstances.add(inst); | ||
| log(`removeInstance: marked busy instance for removal`); | ||
| } | ||
| } |
There was a problem hiding this comment.
removeInstance() can cause queued acquisitions to hang indefinitely when it’s called for a busy instance and that instance is later discarded on release(). In that path (discarded removed instance), the function returns early without waking a queued waiter, so if the removed instance was the last remaining capacity (or removal reduces capacity below demand), callers can deadlock awaiting acquire(). A concrete fix is to redesign the wait queue to support rejection (store {resolve, reject}), and when capacity is removed (especially when it reaches 0) fail queued requests with a clear error; alternatively, on discarding a removed instance, immediately attempt to satisfy the next waiter using any remaining free instance (if present) and/or keep the instance in the canonical count until it’s actually released.
| export function createObjectPoolProxy<T extends object>( | ||
| instances: T[] | ||
| instances: T[], | ||
| debugLabel?: string | ||
| ): Pooled<T> { |
There was a problem hiding this comment.
createObjectPoolProxy() now mutates the caller-provided instances array (via instances.splice(...)) when evicting. That’s a behavioral change from typical pool constructors and can create surprising side effects if the caller retains and uses that array elsewhere. Consider copying into an internal allInstances array and mutating that instead (or explicitly document that the input array is treated as owned and will be mutated).
| // Remove from the canonical list | ||
| const idx = instances.indexOf(inst); | ||
| if (idx !== -1) { | ||
| instances.splice(idx, 1); | ||
| } |
There was a problem hiding this comment.
createObjectPoolProxy() now mutates the caller-provided instances array (via instances.splice(...)) when evicting. That’s a behavioral change from typical pool constructors and can create surprising side effects if the caller retains and uses that array elsewhere. Consider copying into an internal allInstances array and mutating that instead (or explicitly document that the input array is treated as owned and will be mutated).
| const proxyTarget = {} as Pooled<T>; | ||
| // Expose __removeInstance directly on the target so it's found | ||
| // by the `prop in _target` check before hitting the proxy trap. | ||
| (proxyTarget as any).__removeInstance = removeInstance; | ||
|
|
||
| return new Proxy(proxyTarget, { |
There was a problem hiding this comment.
Adding a stringly-named control method (__removeInstance) onto a generic Pooled<T> can mask a real __removeInstance property/method on T (if one exists) since the proxy will return the pool control function instead of forwarding to the underlying instance. To avoid potential name collisions, prefer a Symbol-keyed control API (or a separate returned control object) so proxy consumers can’t accidentally shadow/override underlying instance members.
| } | ||
|
|
||
| return response; | ||
| await streamResponse(StreamedPHPResponse.fromPHPResponse(response)); |
There was a problem hiding this comment.
Switching from requestStreamed() to buffered request() changes runtime behavior in a way that can materially impact production loads: large downloads/responses will be fully buffered in memory and the client won’t receive any bytes until PHP completes, which can increase memory pressure and latency (and can break long-lived streaming semantics like SSE). Since this PR already introduces a callback-based streaming contract in start-server.ts, consider addressing the lifecycle mismatch by keeping streaming but ensuring the worker remains acquired until the stream finishes (e.g., have the pool/proxy hold the instance until handleStreamedResponse completes, or provide a pool method that ties instance release to stream completion) so you retain streaming benefits without reintroducing early release.
11a145e to
5bfb7fd
Compare
Vendors tarballs of WordPress/wordpress-playground#3494 under vendor/ and references them via a direct dependency (@wp-playground/cli) and an npm override (@php-wasm/universal). Without this patch the object-pool proxy returns a worker to the free list before its streamed response finishes; concurrent e2e requests then land on a mid-response worker and WordPress returns 500. Released builds (3.1.20) still have the bug. Verified: all 19 e2e tests pass locally with 4 Playwright workers (3.1m). Revisit once the upstream PR lands — drop vendor/ and bump to the next released version.
There was a problem hiding this comment.
Only commenting on the functionality, I've tested this PR on my own repo, I can confirm that it makes e2e test runs (ie, tests that hammer the server with a heap of simultaneous connections) much more reliable.
I'll leave the code review to a member of @WordPress/playground-maintainers, but 👍🏻 on the functionality of the fix!
|
@brandonpayton This pull request may fix the recurring test failure in |
|
Hmm, it looks like this PR doesn't fix that intermittent failure, unfortunately. |
|
@chubes4 It looks like there is no mention of This seems weird since @pento explicitly merged |
Two related bugs caused intermittent HTTP 500 errors under normal concurrent request loads (e.g. a browser page load firing parallel requests for HTML, CSS, JS, and REST API). 1. Pool proxy released workers before PHP finished: handleRequest called playgroundPool.requestStreamed() through the pool proxy. The proxy released the worker when the promise resolved (streaming started), but the worker's SinglePHPInstanceManager held the PHP instance until the stream finished. A subsequent request routed to that 'free' worker hit isAcquired === true and returned HTTP 500. Fix: Switch to buffered request() which holds the worker until PHP completes. Change start-server.ts to a callback pattern so the full request-to-response lifecycle stays within the pool proxy scope. 2. Crashed workers were never removed from the pool: When a worker thread exited unexpectedly (e.g. EADDRINUSE), its API proxy remained in the pool. Requests routed to the dead proxy would fail and be released back, creating an infinite failure cycle. Fix: Add __removeInstance() to createObjectPoolProxy so callers can evict dead instances. Wire up the onExit handler in run-cli.ts to remove dead workers from the pool. Fixes WordPress#3492 ## AI disclosure Per WordPress AI Guidelines (https://make.wordpress.org/ai/handbook/ai-guidelines/): AI assistance: Yes Tool: Claude Code (Claude Opus 4.6) Used for: Root cause analysis of the pool proxy lifecycle mismatch, implementing the fix (buffered request approach, callback pattern, __removeInstance for dead worker eviction), updating tests, and drafting the commit message. All code reviewed, tested, and validated by the author against a running Studio site with concurrent request benchmarks.
…tStreamed() The handleRequest code was changed to call playgroundPool.request() instead of playgroundPool.requestStreamed(). The test still stubbed .requestStreamed, so the error was never thrown and WordPress returned a 404 instead of 500.
0411b39 to
9022794
Compare
|
@mho22 interesting, now a different test is failing. |
…er's array Two related fixes to createObjectPoolProxy: 1. Deadlock when pool drains to empty. If every instance is removed (e.g. all workers crashed) while a caller is awaiting acquire(), the promise would previously hang forever — no instance will ever be released to wake the queue. Now: removeInstance() fails all queued waiters with a clear error when capacity drops to 0, and acquire() fails fast against an already-empty pool instead of enqueueing. Waiter queue entries now carry both resolve and reject to support this. 2. Caller array mutation. removeInstance() previously called instances.splice() on the caller-provided array, producing side effects visible outside the pool. Now the pool copies the input into an internally-owned allInstances array at construction and only ever mutates that copy. Three new tests cover the new behaviors (mid-flight drain rejects queued waiters; post-drain acquire fails fast; caller's array is left intact after eviction).
'The PHP instance already acquired' → 'The PHP instance is already acquired'. This message surfaces during concurrency failures so tightening it up makes logs easier to parse.
The 500-under-concurrency fix (a2285ec) switched handleRequest from requestStreamed() to buffered request() because the pool proxy released workers when the outer promise resolved — which, for streaming, was when the response *started* streaming (headers available), not when it finished. Concurrent requests could then be routed to a worker still draining its previous body, producing 'PHP instance already acquired' errors. Buffering fixes concurrency but regresses streaming: large bodies are held entirely in memory and SSE flows become request/response. Expose the existing internal withInstance helper on the pool proxy as __withInstance, then scope acquire/release around the full stream drain in run-cli.ts: await playgroundPool.__withInstance(async (instance) => { const response = await instance.requestStreamed(request); ...cookie handling... await streamResponse(response); // Instance released only after streamResponse fully drains. }); This ties the worker lifecycle to the HTTP response lifecycle: the pool holds the instance for the entire duration of a request, exactly as the buffered fix did, while the body still streams directly to the client without intermediate buffering. Tests: 4 new spec cases cover callback-held release semantics (pre-release concurrent calls must queue), sync/async error release, and raw-instance identity. Existing 17 tests unchanged. The Pooled<PlaygroundCliWorker> / Pooled<RemoteAPI<PlaygroundCliWorker>> type divergence was previously papered over by the Promisified<T> method-proxy shape; __withInstance's contravariant callback parameter makes it observable. Localized unknown casts keep public signatures unchanged — a broader refactor of the playgroundPool typing is out of scope for this PR.
The pool is constructed from RemoteAPI<PlaygroundCliWorker> instances
(comlink proxies over MessagePort), but was declared as
Pooled<PlaygroundCliWorker>. Before __withInstance, the two shapes
were structurally compatible because Promisified<T> only exposed method
signatures — and RemoteAPI<T> mirrors T's method surface with
promisified returns. __withInstance's callback parameter puts T in a
contravariant position, which makes the structural difference
observable.
The previous commit resolved this with two localized casts at the
assignment site and inside the __withInstance callback. This commit
replaces both with honest local typing: playgroundPool, the
RunCLIServer.playground public shape, zipSite(), and both
BlueprintsV{1,2}Handler.bootWordPress() signatures all move to
Pooled<RemoteAPI<PlaygroundCliWorker>>. The hot path
(handleRequest + __withInstance callback) is now cast-free, and future
lifecycle-hook additions to Pooled<T> will type-check without per-site
widening.
The casts don't disappear entirely — they migrate outward to the
cross-package boundary with @wp-playground/blueprints and xdebug-bridge,
which expect UniversalPHP (= LimitedPHPApi | Remote<...> | Pooled<...>).
UniversalPHP's Pooled<LimitedPHPApi> arm shares the same underlying
lie (the pool never holds raw LimitedPHPApi, only remote proxies to it),
but widening it is a public-type change across ~100 call sites in
blueprints, wordpress, and other packages — worth a dedicated PR rather
than folding into a streaming-fix. A follow-up issue will track the
UniversalPHP cleanup.
Net: same number of casts, but they now live at the honest
cross-package boundary instead of inside this PR's own additions.
The error-handling test (tests/run-cli.spec.ts:'should return 500 when the request handler throws an error') stubs the entry point that handleRequest calls into, relying on the stub to throw and the Express error path to surface a 500. The test has now tracked three call sites: - Original streaming path → stubbed .requestStreamed() - Buffered-fix (a2285ec) → stubbed .request() (commit 9022794) - Lifecycle-scoped streaming revival (this PR) → now stubs .__withInstance() handleRequest no longer calls .request() or .requestStreamed() on the pool proxy directly; both are invoked on the raw instance inside a .__withInstance() callback. Stubbing .__withInstance() is the equivalent seam — it intercepts the entire acquire/invoke/release cycle and fails fast, which is what the test wants. CI signal for this change: test-playground-cli (macos, ubuntu) reported 'expected 404 to be 500' on the previous commit because the stale .request() stub was being ignored.
|
Hi @chubes4, thanks for noticing these issues and working on this PR!
This was a big oversight on my part with the multi-worker changes. The PHP instance streaming a response is busy until it is done streaming, and using a busy instance breaks things. At first glance, I feel like we might be able to address this issue with fewer changes. Maybe not, but I'll try to sketch something in the morning and review this PR in more detail.
It would be good to be able to recover in the event of a worker crash. Since this is a separate issue from the overused PHP workers, let's move these fixes to a separate PR. One thing I'm concerned about here is finding and fixing the reasons for crashed workers. Workers shouldn't be crashing at all. And if workers crash on startup, perhaps we should just fail startup altogether.
When was this happening? AFAIK, the HTTP listener is currently owned by the main thread. Maybe other listeners have been added that I'm unaware of 🤔 Will look... Were there other reasons you were seeing workers exiting unexpectedly? |
Strip __removeInstance and the onExit→eviction wiring. The eviction code was defensive against a hypothetical case where a CLI worker thread exits unexpectedly post-startup — but that was never observed in practice, and the EADDRINUSE example in the original PR body was pattern-matched from an unrelated @php-wasm/node networking-proxy leak that doesn't touch the CLI worker pool. Per maintainer feedback (brandonpayton on WordPress#3494): worker crash recovery is a separate concern worth its own PR backed by real crash evidence, and 'fail startup altogether' may be the better answer for the startup case anyway. This PR now only addresses the observed pool-release-timing bug via __withInstance. Also removes the pool-drain/capacity-zero bookkeeping that only existed as a consequence of eviction: the removedInstances Set, the release() early-discard path, the acquire() fast-fail on empty pool, and the {resolve, reject} waiter shape (no caller rejects anymore). The internal allInstances copy is also gone — nothing mutates the caller's array now that eviction is out, so the defensive copy is unnecessary. Removes 3 eviction-related tests from the spec. The 4 __withInstance tests (added in edbfee1) stay — they cover the streaming lifecycle fix directly.
|
Thanks for the review, @brandonpayton! You're right that the eviction was defensive programming. Admittedly, Claude picked up a TODO: comment in the existing code and ran with it, and it slipped past me. I've updated the PR to drop that, as there were no worker crashes during extensive stress testing. The overall diff is much smaller now and I've hammered the site to ensure the fix still stands. All tests are now passing 💯 |
…eamed-response-lifecycle
…-lifecycle' into fix/pool-proxy-streamed-response-lifecycle
Summary
Fixes intermittent HTTP 500 errors that occur under multi-client concurrent load — most visibly when a user has several WordPress admin tabs open in Studio and refreshes them simultaneously.
The root cause is a pool proxy lifecycle mismatch:
createObjectPoolProxy'swithInstancereleases a worker back to the pool as soon as the proxied method's outer promise resolves. ForrequestStreamed()that's when theStreamedPHPResponseobject is returned (headers available, body still streaming), not when the body is fully drained. But the worker'sSinglePHPInstanceManagerdoesn't release the PHP instance untilresponse.finishedfires. During that window the pool proxy treats the worker as free and hands it to a concurrent request — which hitsisAcquired === trueand surfaces as HTTP 500.Approach
Expose a new
__withInstance(fn)method onPooled<T>that acquires an instance, invokes a caller-supplied callback with it, and releases only after the callback promise resolves (or throws). Callers whose logical unit of work outlives a single proxied method call — streamed responses being the canonical case — use this to bind the worker lifecycle to the full request lifecycle instead of the first-promise-to-resolve.In
run-cli.ts,handleRequestnow wrapsrequestStreamed() + streamResponse()inside a__withInstancecallback. The worker is held for the entire time the response body is streaming to the HTTP client, eliminating the race window.start-server.ts'shandleRequestsignature changes from(request) => Promise<StreamedPHPResponse>to(request, streamResponse) => Promise<void>to carry the stream-drain lifecycle back into the handler.No behavioral changes outside of the CLI request path. Streaming semantics are preserved — large bodies and SSE flows still stream progressively, they're just bounded to one worker for their duration.
Changes
object-pool-proxy.ts: Add__withInstance<R>(fn): Promise<R>toPooled<T>andcreateObjectPoolProxy. The method is exposed directly on the proxy target so it's found by theprop in _targetcheck before hitting the generic proxy trap.run-cli.ts: SwitchhandleRequestto useplaygroundPool.__withInstance(async (instance) => { ... instance.requestStreamed(request); await streamResponse(response); }). Also retypesplaygroundPoolfromPooled<PlaygroundCliWorker>toPooled<RemoteAPI<PlaygroundCliWorker>>to honestly reflect that the pool holds comlink remote proxies, not raw workers. This moves existing implicit type-laundering to the honest cross-package boundary.start-server.ts:handleRequestsignature changes to a streaming-callback pattern so the full request→stream→drain lifecycle stays within the pool proxy's__withInstancescope.single-php-instance-manager.ts: Fix grammar on the "already acquired" error message.__withInstance(callback-held release, sync/async throw release, raw-instance identity). Updatedrun-cli.spec.tserror-handling stub to target__withInstance. Updatedstart-server.spec.tsfor the new callback signature.Test plan
Unit tests
All pass.
Multi-client reproducer (the bug manifest in practice)
The bug surfaces most reliably when multiple independent clients hit the server simultaneously — e.g. several browser admin tabs refreshing at once. Each "tab" is its own shell process with its own cookie jar; each fires a 10-URL WP admin asset fanout:
Script:
multi-tab-stress.shTested against a Studio site (
http://localhost:8881), Playground built from this branch vs. stockupstream/trunk, both installed via self-hosted tarballs and loaded identically by Studio.Results:
Totals: this PR 0 / 3900 (0%). Stock trunk 227 / 3900 (5.8%).
Same workload against both branches. Strip consistently zero, trunk consistently 5-7% failure.
Additional stress (this PR only, ensures robustness)
PR scope note
An earlier version of this PR also included dead-worker eviction (
__removeInstance+ anonExithandler inrun-cli.ts) in response to an@TODO: Should we respawn the worker if it exited with an errorcomment. Per @brandonpayton's review, that's orthogonal to the pool-release-timing bug and belongs in a separate PR backed by evidence of real worker crashes. I removed the eviction code in64a8f02— ~25 minutes of sustained multi-pattern stress showed zero worker exit events, so the eviction path never triggered on this workload. If worker-process crash recovery is wanted later, it should be designed from a real repro (or, per Brandon's suggestion, via "fail startup altogether" for the startup case).This PR is now scoped to the one observed bug: the pool-release-timing race, fixed with
__withInstance.Fixes #3492
AI assistance
__withInstancecallback API in particular came out of re-reading Copilot's note about buffered vs streamed semantics), the multi-client stress harness, and drafting this PR description. Every commit reviewed, tested, and validated by the author.