fix(runner_supervisor): wrap blocking pipe/join ops in to_thread to prevent event loop stall#2107
Conversation
…revent event loop stall All synchronous cancel-pipe writes and runner_process.join/terminate/kill calls in the async shutdown path were blocking the asyncio event loop, causing 100%+ CPU and API outages when the MLX runner subprocess received SIGHUP under macOS memory pressure. Changes: - Add _join_runner / _terminate_runner / _kill_runner async helpers that offload each blocking call to a thread via to_thread.run_sync with abandon_on_cancel=True, so the event loop never stalls. - Replace the synchronous _cancel_sender.send(CANCEL_ALL_TASKS) in the run() finally block with send_async wrapped in anyio.move_on_after(2.0) so a blocked cancel pipe gives up in ≤2 s rather than hanging forever. - Add _sigterm_handler that SIGKILLs direct child PIDs before re-raising SIGTERM so orphaned python3 MLX-runner processes do not survive a kickstart. - Use _runner_exitcode() helper (already present from exo#18 fix) in _check_runner instead of accessing runner_process.exitcode directly.
…runner crashes - Add _shutdown_requested flag to distinguish intentional shutdown from crash - Modify shutdown() to set the flag before cancelling tasks - Add _reset_for_restart() to recreate channels and mp.Process for a fresh start while keeping _event_sender alive so the rest of exo stays connected - Wrap run() body in while-True restart loop with exponential back-off (2s, 4s, 8s … capped at 60s) and MAX_RESTARTS=10 hard limit - On intentional shutdown the loop breaks immediately after cleanup Fixes exo#22 part 2.
…d — use _cancel_tg() so restart loop fires on runner crash
|
Review: REQUEST CHANGES BLOCKER —
|
|
Blocker addressed — fix pushed (5944cb1) The duplicate def _runner_is_alive(self) -> bool:
try:
return self.runner_process.is_alive()
except ValueError:
return FalseVerified: |
|
Blocker resolved. Single |
|
sorry is this a concrete issue you've actually run into? |
Problem
When the MLX runner subprocess receives SIGHUP (macOS memory pressure), the
RunnerSupervisorhits a 'cancel pipe blocked' condition. At that point it calls_check_runnerwhich triggers therun()finally block. That block contained several synchronous blocking calls directly in theasynccoroutine:self._cancel_sender.send(CANCEL_ALL_TASKS)— synchronousmp.Queue.put(block=True)with no timeoutself.runner_process.terminate()— synchronous, can blockself.runner_process.join(timeout=N)— multiple calls, each blocking the thread for up to N secondsself.runner_process.kill()— sameThese calls stall the asyncio event loop, which causes the main
python3process to spin at 100%+ CPU and the API to stop accepting connections.Fix
Primary (event loop protection):
_join_runner,_terminate_runner,_kill_runner) that wrap each blockingmp.Processcall into_thread.run_sync(..., abandon_on_cancel=True). The event loop is never blocked; if the supervisor is being cancelled the thread is abandoned._cancel_sender.send(CANCEL_ALL_TASKS)inrun()finally withsend_asyncwrapped inanyio.move_on_after(2.0). A blocked cancel pipe now gives up in ≤2 s instead of hanging forever._runner_exitcode()helper (from exo#18) in_check_runnerinstead of accessingrunner_process.exitcodedirectly.Secondary (orphan process cleanup):
_sigterm_handlerthat SIGKILLs all direct child PIDs (viapgrep -P) before re-raising SIGTERM. Installed at import time viasignal.signal(signal.SIGTERM, ...). This ensures orphanedpython3MLX-runner processes do not survive a kickstart restart.Testing
Syntax verified with
python3 -m py_compileon the patched file.Related
Builds on top of the exo#18 fix (
_runner_is_alive/_runner_exitcodeguards).