Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions verifiers/serve/server/env_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,9 +248,18 @@ async def send_error_response(error: str) -> None:

async def stats_loop(self, interval: float = 10.0) -> None:
"""Loop to push worker stats to the router."""
try:
import ctypes

libc = ctypes.CDLL("libc.so.6")
except OSError:
libc = None
while True:
await asyncio.sleep(interval)

if libc is not None:
libc.malloc_trim(0)
Comment on lines +252 to +261

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium server/env_worker.py:252

On musl-based systems like Alpine Linux, libc.so.6 can exist but malloc_trim is a glibc-specific function that musl does not export. Calling libc.malloc_trim(0) throws AttributeError when the symbol is missing, which terminates stats_loop and stops all stats collection for the rest of the worker's lifetime. Consider catching AttributeError alongside OSError when loading the symbol, or guard the call with a try/except.

        try:
             import ctypes
 
             libc = ctypes.CDLL("libc.so.6")
+            libc.malloc_trim(0)  # verify symbol exists
+            libc.malloc_trim.argtypes = [ctypes.c_int]
         except (OSError, AttributeError):
             libc = None
         while True:
             await asyncio.sleep(interval)
 
             if libc is not None:
-                libc.malloc_trim(0)
+                try:
+                    libc.malloc_trim(0)
+                except Exception:
+                    pass
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/serve/server/env_worker.py around lines 252-261:

On musl-based systems like Alpine Linux, `libc.so.6` can exist but `malloc_trim` is a glibc-specific function that musl does not export. Calling `libc.malloc_trim(0)` throws `AttributeError` when the symbol is missing, which terminates `stats_loop` and stops all stats collection for the rest of the worker's lifetime. Consider catching `AttributeError` alongside `OSError` when loading the symbol, or guard the call with a try/except.

Evidence trail:
verifiers/serve/server/env_worker.py lines 249-278 (REVIEWED_COMMIT): stats_loop method loads libc.so.6 catching only OSError (line 255), calls libc.malloc_trim(0) at line 261 without any exception handling. Line 289: stats_loop launched as asyncio.create_task. Python ctypes CDLL.__getattr__ raises AttributeError when dlsym fails to find a symbol. musl libc does not export malloc_trim (glibc-specific).

Comment on lines +260 to +261

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run malloc_trim away from the worker event loop

When workers are serving active rollouts, this synchronous ctypes call runs in stats_loop on the same asyncio thread that handles ZMQ requests and responses. Releasing the GIL only helps other threads; it does not let this event loop make progress, so a slow trim on a large fragmented heap delays all request handling and heartbeats every 10 seconds and can make the router restart an otherwise healthy worker. Move the trim to a background thread or otherwise bound/guard it.

Useful? React with 👍 / 👎.


stats = EnvWorkerStats(
worker_id=self.worker_id,
timestamp=time.time(),
Expand Down
1 change: 1 addition & 0 deletions verifiers/v1/runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ async def cleanup_rollout(self, task: Task, state: State) -> None:
key = str(state["trajectory_id"])
self._model_request_locks.pop(key, None)
self._inflight_visible_model_requests.pop(key, None)
self.trajectories.pop(key, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delay trajectory eviction until group cleanup

When a grouped eval reaches group scoring, each harness.run(...) has already executed cleanup_rollout, so this pop removes the live trajectory before Env._run_group_states calls harness.score_group(...). Any group-stage handler/reward that starts a child task with state.for_task(..., transcript="append") will now fail in resolve_trajectory with No live trajectory registered, even though the state still retains its runtime handles until group cleanup. Keep trajectories for states with a group_key until cleanup_group, analogous to how model clients are retained across the group stage.

Useful? React with 👍 / 👎.

self.release_tool_handles(state)

async def cleanup_group(self, tasks: list[Task], states: list[State]) -> None:
Expand Down
23 changes: 15 additions & 8 deletions verifiers/v1/utils/endpoint_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,10 @@ def rollout_queue(self, rollout_key: str) -> asyncio.Queue[str]:
def get_request(self, request_id: str) -> EndpointInterceptData:
return cast(EndpointInterceptData, self.server.intercepts[request_id])

def discard_request(self, request_id: str) -> None:
"""Drop a delivered intercept from the server's per-request store."""
self.server.intercepts.pop(request_id, None)

def request_context(
self, request_id: str, request: EndpointInterceptData
) -> ModelRequestContext:
Expand Down Expand Up @@ -513,14 +517,17 @@ async def forward_request(
state._set_error(e)
raise
finally:
if bool(request.get("stream")):
if request.get("protocol") != "openai_chat_completions":
raise NotImplementedError(
"Streaming interception is currently supported for OpenAI Chat Completions."
)
await synthesize_stream(request, response, error)
else:
deliver_response(request, response, error)
try:
if bool(request.get("stream")):
if request.get("protocol") != "openai_chat_completions":
raise NotImplementedError(
"Streaming interception is currently supported for OpenAI Chat Completions."
)
await synthesize_stream(request, response, error)
else:
deliver_response(request, response, error)
finally:
endpoint.discard_request(request_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not discard undelivered streaming intercepts

For intercepted streaming requests whose protocol is not openai_chat_completions (for example /v1/messages or /v1/responses with stream: true), the branch above raises before anything is put on the stream queue or response future. This new unconditional discard then removes the intercept from InterceptionServer.intercepts, so unregister_rollout() can no longer put the EOF sentinel or cancel the future, leaving the HTTP handler waiting indefinitely instead of being unblocked during rollout cleanup.

Useful? React with 👍 / 👎.



def normalize_endpoint_prompt(request: EndpointInterceptData) -> Messages:
Expand Down
Loading