Skip to content

fix(v1,serve): evict rollout trajectories, discard delivered intercepts, trim worker arenas#1611

Draft
hubert-marek wants to merge 1 commit into
mainfrom
fix/intercept-discard-worker-trim
Draft

fix(v1,serve): evict rollout trajectories, discard delivered intercepts, trim worker arenas#1611
hubert-marek wants to merge 1 commit into
mainfrom
fix/intercept-discard-worker-trim

Conversation

@hubert-marek

@hubert-marek hubert-marek commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Three env-worker memory fixes for long multi-turn rollouts, ports to main of fixes production-validated on feat/ephemeral-mm-pixels (#1608, #1610) against worldsims browser-env training runs.

1. Evict completed rollouts from Runtime.trajectories (leak)

prepare_state → resolve_trajectory → register_trajectory stores every rollout's live trajectory list in Runtime.trajectories so handle-borrowing sub-runtime states (e.g. in-sandbox program model calls resolving the owning rollout's trajectory) can look it up — but nothing ever removed the entry, and the Runtime lives for the env-worker process lifetime. Every completed rollout's full trajectory (per-step prompt copies + full-prefix token arrays) stayed referenced forever: ~50–400MB leaked per rollout on long browser envs, OOM-killing env pods within a handful of training steps. Now popped in cleanup_rollout alongside the existing per-trajectory dict evictions (_model_request_locks, _inflight_visible_model_requests); cleanup_rollout runs in the harness finally, so error paths evict too. Borrowers' lifetimes are bounded by the owning rollout, so nothing can resolve the entry after cleanup. (list doesn't support weakrefs, so the WeakValueDictionary pattern used by the runtime registry wasn't available here — manual eviction is required.)

2. Discard delivered intercepts (dominant term: −74% worker peak RSS)

InterceptionServer.intercepts retained every request's raw body — the full message history, including base64 images on multimodal envs — until rollout unregister, so a long rollout held every turn's request simultaneously (renderer-side image offload rewrites a normalized copy, so the intercept's base64 never left). forward_request now discards the intercept after delivery via Endpoint.discard_request: the HTTP handler keeps its own local reference so delivery is unaffected, discard is idempotent, and rollout-unregister still sweeps undelivered entries (matching the manual per-request pop the opencode envs already do).

3. malloc_trim(0) at the worker stats cadence (resting RSS ÷3)

Env workers never returned freed arena pages to the OS, so RSS ratcheted to ~3× the live set under rollout churn. Trim every 10s in the existing stats loop via ctypes (the call releases the GIL — it cannot stall the event loop). gc.collect is deliberately avoided here: full collections on large heaps block past the worker heartbeat timeout and get healthy workers killed.

Measurement

Churn simulation (4 concurrent 45-turn browser rollouts per worker, real subprocess RSS, glibc/pymalloc):

scenario worker peak worker resting
baseline 1053 MB 943–988 MB
+ intercept discard 275 MB 179–218 MB
+ worker trim 275 MB 119–155 MB

Production lineage (worldsims env pin): without fix 1 the env pod OOMed at training step ~5 (~115GB); with fix 1 alone it climbed to ~90GB and still failed; with all of the above the pod holds flat. py_compile + ruff check clean; discard behavior unit-checked (local reference intact post-discard; idempotent).

🤖 Generated with Claude Code


Note

Medium Risk
Changes rollout cleanup and endpoint request lifecycle timing; low logic risk but affects memory-sensitive training paths where regressions would show as missing trajectories or failed intercept lookups if cleanup ran too early (mitigated by discarding only after delivery).

Overview
Addresses env-worker memory growth during long, concurrent multi-turn rollouts with three targeted changes.

Runtime: cleanup_rollout now pops the rollout’s entry from Runtime.trajectories, alongside existing per-trajectory lock/inflight cleanup, so completed rollouts no longer keep full trajectory lists alive for the worker process lifetime.

Endpoint interception: Adds Endpoint.discard_request and calls it from forward_request after stream/non-stream delivery, so InterceptionServer.intercepts no longer retains raw request bodies (e.g. multimodal base64) for every turn until rollout unregister.

Env worker: The 10s stats loop optionally invokes malloc_trim(0) via libc.so.6 (skipped if unavailable) to return freed glibc arena pages and reduce resting RSS under rollout churn.

Reviewed by Cursor Bugbot for commit 2449c21. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix memory leaks by evicting rollout trajectories, discarding delivered intercepts, and trimming worker heap

  • Runtime.cleanup_rollout now removes the trajectory entry from self.trajectories when a rollout is cleaned up, preventing unbounded trajectory accumulation.
  • forward_request unconditionally removes the intercept from Endpoint.server.intercepts after forwarding, even if response delivery raises.
  • EnvWorker.stats_loop calls libc.malloc_trim(0) on glibc-based systems each stats interval to release free heap memory back to the OS.
📊 Macroscope summarized 2449c21. 3 files reviewed, 0 issues evaluated, 0 issues filtered, 0 comments posted

🗂️ Filtered Issues

No issues evaluated.

…ts, trim worker arenas

Three env-worker memory fixes for long multi-turn rollouts:

- Runtime.trajectories (the live-trajectory registry serving sub-runtime
  trajectory handles) was never evicted; the process-lifetime Runtime
  retained every completed rollout's full trajectory. Pop it in
  cleanup_rollout alongside the existing per-trajectory dict evictions.
- InterceptionServer.intercepts retained every request's raw body (full
  message history, incl. base64 images on multimodal envs) until rollout
  unregister; forward_request now discards each intercept after delivery.
  The HTTP handler keeps its own reference; discard is idempotent;
  unregister still sweeps undelivered entries.
- Env workers never returned freed arena pages to the OS; malloc_trim(0)
  at the existing 10s stats cadence (ctypes releases the GIL; gc.collect
  is deliberately avoided here - full collections on large heaps stall
  the loop past the worker heartbeat timeout).

Measured on browser-env rollouts (45 turns, 4 concurrent per worker):
worker peak RSS 1053MB -> 275MB, resting 990MB -> 155MB.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment on lines +252 to +261
import ctypes

libc = ctypes.CDLL("libc.so.6")
except OSError:
libc = None
while True:
await asyncio.sleep(interval)

if libc is not None:
libc.malloc_trim(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium server/env_worker.py:252

On musl-based systems like Alpine Linux, libc.so.6 can exist but malloc_trim is a glibc-specific function that musl does not export. Calling libc.malloc_trim(0) throws AttributeError when the symbol is missing, which terminates stats_loop and stops all stats collection for the rest of the worker's lifetime. Consider catching AttributeError alongside OSError when loading the symbol, or guard the call with a try/except.

        try:
             import ctypes
 
             libc = ctypes.CDLL("libc.so.6")
+            libc.malloc_trim(0)  # verify symbol exists
+            libc.malloc_trim.argtypes = [ctypes.c_int]
         except (OSError, AttributeError):
             libc = None
         while True:
             await asyncio.sleep(interval)
 
             if libc is not None:
-                libc.malloc_trim(0)
+                try:
+                    libc.malloc_trim(0)
+                except Exception:
+                    pass
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/serve/server/env_worker.py around lines 252-261:

On musl-based systems like Alpine Linux, `libc.so.6` can exist but `malloc_trim` is a glibc-specific function that musl does not export. Calling `libc.malloc_trim(0)` throws `AttributeError` when the symbol is missing, which terminates `stats_loop` and stops all stats collection for the rest of the worker's lifetime. Consider catching `AttributeError` alongside `OSError` when loading the symbol, or guard the call with a try/except.

Evidence trail:
verifiers/serve/server/env_worker.py lines 249-278 (REVIEWED_COMMIT): stats_loop method loads libc.so.6 catching only OSError (line 255), calls libc.malloc_trim(0) at line 261 without any exception handling. Line 289: stats_loop launched as asyncio.create_task. Python ctypes CDLL.__getattr__ raises AttributeError when dlsym fails to find a symbol. musl libc does not export malloc_trim (glibc-specific).

@macroscopeapp

macroscopeapp Bot commented Jun 10, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

1 blocking correctness issue found. An unresolved review comment identifies a potential bug where the malloc_trim call could crash the stats_loop on musl-based systems (Alpine Linux) due to missing AttributeError handling. This substantive issue warrants human review before merging.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2449c2112d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/runtime.py
key = str(state["trajectory_id"])
self._model_request_locks.pop(key, None)
self._inflight_visible_model_requests.pop(key, None)
self.trajectories.pop(key, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delay trajectory eviction until group cleanup

When a grouped eval reaches group scoring, each harness.run(...) has already executed cleanup_rollout, so this pop removes the live trajectory before Env._run_group_states calls harness.score_group(...). Any group-stage handler/reward that starts a child task with state.for_task(..., transcript="append") will now fail in resolve_trajectory with No live trajectory registered, even though the state still retains its runtime handles until group cleanup. Keep trajectories for states with a group_key until cleanup_group, analogous to how model clients are retained across the group stage.

Useful? React with 👍 / 👎.

else:
deliver_response(request, response, error)
finally:
endpoint.discard_request(request_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not discard undelivered streaming intercepts

For intercepted streaming requests whose protocol is not openai_chat_completions (for example /v1/messages or /v1/responses with stream: true), the branch above raises before anything is put on the stream queue or response future. This new unconditional discard then removes the intercept from InterceptionServer.intercepts, so unregister_rollout() can no longer put the EOF sentinel or cancel the future, leaving the HTTP handler waiting indefinitely instead of being unblocked during rollout cleanup.

Useful? React with 👍 / 👎.

Comment on lines +260 to +261
if libc is not None:
libc.malloc_trim(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run malloc_trim away from the worker event loop

When workers are serving active rollouts, this synchronous ctypes call runs in stats_loop on the same asyncio thread that handles ZMQ requests and responses. Releasing the GIL only helps other threads; it does not let this event loop make progress, so a slow trim on a large fragmented heap delays all request handling and heartbeats every 10 seconds and can make the router restart an otherwise healthy worker. Move the trim to a background thread or otherwise bound/guard it.

Useful? React with 👍 / 👎.

@hubert-marek hubert-marek marked this pull request as draft June 10, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant