fix(v1,serve): evict rollout trajectories, discard delivered intercepts, trim worker arenas#1611
fix(v1,serve): evict rollout trajectories, discard delivered intercepts, trim worker arenas#1611hubert-marek wants to merge 1 commit into
Conversation
…ts, trim worker arenas Three env-worker memory fixes for long multi-turn rollouts: - Runtime.trajectories (the live-trajectory registry serving sub-runtime trajectory handles) was never evicted; the process-lifetime Runtime retained every completed rollout's full trajectory. Pop it in cleanup_rollout alongside the existing per-trajectory dict evictions. - InterceptionServer.intercepts retained every request's raw body (full message history, incl. base64 images on multimodal envs) until rollout unregister; forward_request now discards each intercept after delivery. The HTTP handler keeps its own reference; discard is idempotent; unregister still sweeps undelivered entries. - Env workers never returned freed arena pages to the OS; malloc_trim(0) at the existing 10s stats cadence (ctypes releases the GIL; gc.collect is deliberately avoided here - full collections on large heaps stall the loop past the worker heartbeat timeout). Measured on browser-env rollouts (45 turns, 4 concurrent per worker): worker peak RSS 1053MB -> 275MB, resting 990MB -> 155MB. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| import ctypes | ||
|
|
||
| libc = ctypes.CDLL("libc.so.6") | ||
| except OSError: | ||
| libc = None | ||
| while True: | ||
| await asyncio.sleep(interval) | ||
|
|
||
| if libc is not None: | ||
| libc.malloc_trim(0) |
There was a problem hiding this comment.
🟡 Medium server/env_worker.py:252
On musl-based systems like Alpine Linux, libc.so.6 can exist but malloc_trim is a glibc-specific function that musl does not export. Calling libc.malloc_trim(0) throws AttributeError when the symbol is missing, which terminates stats_loop and stops all stats collection for the rest of the worker's lifetime. Consider catching AttributeError alongside OSError when loading the symbol, or guard the call with a try/except.
try:
import ctypes
libc = ctypes.CDLL("libc.so.6")
+ libc.malloc_trim(0) # verify symbol exists
+ libc.malloc_trim.argtypes = [ctypes.c_int]
except (OSError, AttributeError):
libc = None
while True:
await asyncio.sleep(interval)
if libc is not None:
- libc.malloc_trim(0)
+ try:
+ libc.malloc_trim(0)
+ except Exception:
+ pass🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/serve/server/env_worker.py around lines 252-261:
On musl-based systems like Alpine Linux, `libc.so.6` can exist but `malloc_trim` is a glibc-specific function that musl does not export. Calling `libc.malloc_trim(0)` throws `AttributeError` when the symbol is missing, which terminates `stats_loop` and stops all stats collection for the rest of the worker's lifetime. Consider catching `AttributeError` alongside `OSError` when loading the symbol, or guard the call with a try/except.
Evidence trail:
verifiers/serve/server/env_worker.py lines 249-278 (REVIEWED_COMMIT): stats_loop method loads libc.so.6 catching only OSError (line 255), calls libc.malloc_trim(0) at line 261 without any exception handling. Line 289: stats_loop launched as asyncio.create_task. Python ctypes CDLL.__getattr__ raises AttributeError when dlsym fails to find a symbol. musl libc does not export malloc_trim (glibc-specific).
ApprovabilityVerdict: Needs human review 1 blocking correctness issue found. An unresolved review comment identifies a potential bug where the malloc_trim call could crash the stats_loop on musl-based systems (Alpine Linux) due to missing AttributeError handling. This substantive issue warrants human review before merging. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2449c2112d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| key = str(state["trajectory_id"]) | ||
| self._model_request_locks.pop(key, None) | ||
| self._inflight_visible_model_requests.pop(key, None) | ||
| self.trajectories.pop(key, None) |
There was a problem hiding this comment.
Delay trajectory eviction until group cleanup
When a grouped eval reaches group scoring, each harness.run(...) has already executed cleanup_rollout, so this pop removes the live trajectory before Env._run_group_states calls harness.score_group(...). Any group-stage handler/reward that starts a child task with state.for_task(..., transcript="append") will now fail in resolve_trajectory with No live trajectory registered, even though the state still retains its runtime handles until group cleanup. Keep trajectories for states with a group_key until cleanup_group, analogous to how model clients are retained across the group stage.
Useful? React with 👍 / 👎.
| else: | ||
| deliver_response(request, response, error) | ||
| finally: | ||
| endpoint.discard_request(request_id) |
There was a problem hiding this comment.
Do not discard undelivered streaming intercepts
For intercepted streaming requests whose protocol is not openai_chat_completions (for example /v1/messages or /v1/responses with stream: true), the branch above raises before anything is put on the stream queue or response future. This new unconditional discard then removes the intercept from InterceptionServer.intercepts, so unregister_rollout() can no longer put the EOF sentinel or cancel the future, leaving the HTTP handler waiting indefinitely instead of being unblocked during rollout cleanup.
Useful? React with 👍 / 👎.
| if libc is not None: | ||
| libc.malloc_trim(0) |
There was a problem hiding this comment.
Run malloc_trim away from the worker event loop
When workers are serving active rollouts, this synchronous ctypes call runs in stats_loop on the same asyncio thread that handles ZMQ requests and responses. Releasing the GIL only helps other threads; it does not let this event loop make progress, so a slow trim on a large fragmented heap delays all request handling and heartbeats every 10 seconds and can make the router restart an otherwise healthy worker. Move the trim to a background thread or otherwise bound/guard it.
Useful? React with 👍 / 👎.
Three env-worker memory fixes for long multi-turn rollouts, ports to
mainof fixes production-validated onfeat/ephemeral-mm-pixels(#1608, #1610) against worldsims browser-env training runs.1. Evict completed rollouts from
Runtime.trajectories(leak)prepare_state → resolve_trajectory → register_trajectorystores every rollout's live trajectory list inRuntime.trajectoriesso handle-borrowing sub-runtime states (e.g. in-sandbox program model calls resolving the owning rollout's trajectory) can look it up — but nothing ever removed the entry, and the Runtime lives for the env-worker process lifetime. Every completed rollout's full trajectory (per-step prompt copies + full-prefix token arrays) stayed referenced forever: ~50–400MB leaked per rollout on long browser envs, OOM-killing env pods within a handful of training steps. Now popped incleanup_rolloutalongside the existing per-trajectory dict evictions (_model_request_locks,_inflight_visible_model_requests);cleanup_rolloutruns in the harnessfinally, so error paths evict too. Borrowers' lifetimes are bounded by the owning rollout, so nothing can resolve the entry after cleanup. (listdoesn't support weakrefs, so theWeakValueDictionarypattern used by the runtime registry wasn't available here — manual eviction is required.)2. Discard delivered intercepts (dominant term: −74% worker peak RSS)
InterceptionServer.interceptsretained every request's raw body — the full message history, including base64 images on multimodal envs — until rollout unregister, so a long rollout held every turn's request simultaneously (renderer-side image offload rewrites a normalized copy, so the intercept's base64 never left).forward_requestnow discards the intercept after delivery viaEndpoint.discard_request: the HTTP handler keeps its own local reference so delivery is unaffected, discard is idempotent, and rollout-unregister still sweeps undelivered entries (matching the manual per-request pop the opencode envs already do).3.
malloc_trim(0)at the worker stats cadence (resting RSS ÷3)Env workers never returned freed arena pages to the OS, so RSS ratcheted to ~3× the live set under rollout churn. Trim every 10s in the existing stats loop via ctypes (the call releases the GIL — it cannot stall the event loop).
gc.collectis deliberately avoided here: full collections on large heaps block past the worker heartbeat timeout and get healthy workers killed.Measurement
Churn simulation (4 concurrent 45-turn browser rollouts per worker, real subprocess RSS, glibc/pymalloc):
Production lineage (worldsims env pin): without fix 1 the env pod OOMed at training step ~5 (~115GB); with fix 1 alone it climbed to ~90GB and still failed; with all of the above the pod holds flat.
py_compile+ruff checkclean; discard behavior unit-checked (local reference intact post-discard; idempotent).🤖 Generated with Claude Code
Note
Medium Risk
Changes rollout cleanup and endpoint request lifecycle timing; low logic risk but affects memory-sensitive training paths where regressions would show as missing trajectories or failed intercept lookups if cleanup ran too early (mitigated by discarding only after delivery).
Overview
Addresses env-worker memory growth during long, concurrent multi-turn rollouts with three targeted changes.
Runtime:
cleanup_rolloutnowpops the rollout’s entry fromRuntime.trajectories, alongside existing per-trajectory lock/inflight cleanup, so completed rollouts no longer keep full trajectory lists alive for the worker process lifetime.Endpoint interception: Adds
Endpoint.discard_requestand calls it fromforward_requestafter stream/non-stream delivery, soInterceptionServer.interceptsno longer retains raw request bodies (e.g. multimodal base64) for every turn until rollout unregister.Env worker: The 10s stats loop optionally invokes
malloc_trim(0)vialibc.so.6(skipped if unavailable) to return freed glibc arena pages and reduce resting RSS under rollout churn.Reviewed by Cursor Bugbot for commit 2449c21. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fix memory leaks by evicting rollout trajectories, discarding delivered intercepts, and trimming worker heap
Runtime.cleanup_rolloutnow removes the trajectory entry fromself.trajectorieswhen a rollout is cleaned up, preventing unbounded trajectory accumulation.forward_requestunconditionally removes the intercept fromEndpoint.server.interceptsafter forwarding, even if response delivery raises.EnvWorker.stats_loopcallslibc.malloc_trim(0)on glibc-based systems each stats interval to release free heap memory back to the OS.📊 Macroscope summarized 2449c21. 3 files reviewed, 0 issues evaluated, 0 issues filtered, 0 comments posted
🗂️ Filtered Issues
No issues evaluated.