Skip to content

Refactor V1 timeout handling#1543

Open
xeophon wants to merge 9 commits into
mainfrom
v1-timeout-refactor
Open

Refactor V1 timeout handling#1543
xeophon wants to merge 9 commits into
mainfrom
v1-timeout-refactor

Conversation

@xeophon

@xeophon xeophon commented Jun 4, 2026

Copy link
Copy Markdown
Member

Summary

  • replace the old V1 timeout chain with global_timeout_seconds, rollout_timeout_seconds, and task_timeout_seconds
  • route rollout/task timeout controls through eval config, env runtime state, harnesses, and sandbox-backed programs
  • remove duplicate harness, sandbox, command, scoring, and service-probe timeout knobs; hardcode internal safety caps

Tests

  • uv run ruff check --fix .
  • uv run ruff format
  • uv run pytest tests/test_v1_runtime_lifecycle.py tests/test_v1_rlm_swe.py tests/test_v1_harbor_cli.py tests/test_eval_cli.py tests/test_multiturn_env.py tests/test_v1_nemo_gym_harness.py tests/test_langchain_deep_agents_wikispeedia.py tests/test_v1_openenv_taskset.py

New structure:

flowchart TB
  subgraph Inputs["Configurable eval inputs"]
    CLI["CLI<br/>--global-timeout<br/>--rollout-timeout<br/>--task-timeout"]
    TOML["TOML<br/>global_timeout_seconds<br/>rollout_timeout_seconds<br/>task_timeout_seconds"]
    Legacy["Legacy aliases<br/>--timeout<br/>TOML timeout"]
  end

  CLI --> EvalConfig["EvalConfig"]
  TOML --> EvalConfig
  Legacy -->|"normalize to"| RolloutConfig["rollout_timeout_seconds"]
  RolloutConfig --> EvalConfig

  EvalConfig -->|"global_timeout_seconds"| Global["Global timeout<br/>timeout_after(global_timeout_seconds)"]
  Global -->|"wraps"| EvalRun["run_evaluation body<br/>env server start<br/>vf_env.evaluate(...)<br/>result metadata/save work"]
  EvalRun -->|"finally outside global timeout"| StopServer["Env server stop<br/>wait_for(..., 600s)"]

  EvalConfig -->|"rollout_timeout_seconds"| EnvKwargs["vf_env.set_kwargs(...)"]
  EnvKwargs --> V1Env["V1 Env<br/>runtime_state rollout_timeout_seconds"]
  EnvKwargs -->|"compat bridge"| V0Env["v0 Environment / MultiTurnEnv<br/>timeout_seconds"]

  V1Env -->|"single rollout"| HarnessRollout["Harness.run<br/>timeout_after(rollout_timeout_seconds)"]
  V1Env -->|"group rollout"| GroupRollout["Env._run_group_states<br/>same rollout timeout wraps gather + score_group"]
  GroupRollout --> HarnessRollout

  HarnessRollout --> RolloutIncluded["Inside rollout timeout<br/>setup_state<br/>run_program<br/>is_completed checks<br/>collect_artifacts<br/>update_rollout<br/>record generation timing<br/>score_rollout<br/>set completed"]
  GroupRollout --> GroupScore["Inside same group rollout timeout<br/>score_group uses remaining rollout time"]
  HarnessRollout -->|"outside rollout timeout"| RolloutExcluded["Outside rollout timeout<br/>cleanup_rollout<br/>cleanup_group<br/>finalize / strip handles<br/>env-server shutdown and log finalization"]

  EvalConfig -->|"task_timeout_seconds"| TaskTimeout["Task timeout<br/>active agent / program work"]
  TaskTimeout --> LocalCallable["local callable program"]
  TaskTimeout --> BaseLoop["base model/tool loop"]
  TaskTimeout --> LocalCommand["local command body"]
  TaskTimeout --> SandboxCommand["sandbox command body<br/>run_background_job timeout=None<br/>poll interval 3s"]
  TaskTimeout --> Nemo["NeMo Gym run_once"]
  Wiki["Wikispeedia<br/>sets task_timeout_seconds = 1200s"] --> TaskTimeout

  subgraph Guards["Hardcoded detailed guards"]
    ModelGuard["LLM/API no-response guard<br/>Runtime wait_for model response = 21600s<br/>ClientConfig HTTP timeout = 21600s<br/>connect timeout = 5s"]
    SandboxLifecycle["Sandbox lifecycle<br/>lease TTL = 1440 min<br/>create = 3600s<br/>ready = 3600s<br/>late-create cleanup wait = 3600s<br/>wait attempts = 120"]
    SandboxSetup["Sandbox setup and files<br/>package/setup/program setup command = 900s<br/>file transfer = 900s"]
    TestPhase["Test/scoring phase<br/>Harbor timeout_sec -> test_timeout_seconds<br/>default None<br/>RLM SWE test command default None"]
    RlmSweDetails["RLM SWE detail guards<br/>pycache shell timeout = 30s<br/>move tests = 60s<br/>archive/upload/download/cat = 300s"]
    OpenEnvGuard["OpenEnv guards<br/>startup = 300s<br/>poll = 1s<br/>health request = 10s<br/>schema request = 30s<br/>creation attempts = 120<br/>local probe = 5s"]
    ProcessCleanup["Local process cancel cleanup<br/>terminate wait = 5s"]
    EndpointWaits["Endpoint/tunnel loop waits<br/>1s ticks"]
    SandboxEndpoint["Sandbox endpoint helper HTTP<br/>300s"]
  end

  BaseLoop --> ModelGuard
  SandboxCommand --> SandboxLifecycle
  SandboxCommand --> SandboxSetup
  HarnessRollout --> TestPhase
  TestPhase --> RlmSweDetails
  HarnessRollout --> OpenEnvGuard
  LocalCommand --> ProcessCleanup
  SandboxCommand --> EndpointWaits
  SandboxCommand --> SandboxEndpoint

  subgraph Removed["Removed or no longer user-configurable timeout knobs"]
    RemovedSandbox["SandboxConfig<br/>timeout_minutes<br/>command_timeout<br/>install_timeout<br/>setup_timeout<br/>create_timeout<br/>wait_timeout<br/>poll_interval"]
    RemovedProgram["ProgramConfig<br/>setup_timeout"]
    RemovedHarness["Harness/env-specific knobs<br/>timeout_seconds<br/>agent timeout<br/>exec_timeout<br/>environment timeout<br/>provider_timeout_ms"]
    RemovedTaskset["Taskset/service probe knobs<br/>Harbor verifier default 900s<br/>OpenEnv startup/health/schema config fields"]
  end

  RemovedSandbox -->|"replaced by constants or task timeout"| SandboxLifecycle
  RemovedSandbox -->|"replaced by constants"| SandboxSetup
  RemovedProgram -->|"replaced by constants"| SandboxSetup
  RemovedHarness -->|"mapped to task timeout or removed"| TaskTimeout
  RemovedTaskset -->|"replaced by constants / explicit test_timeout_seconds"| TestPhase
Loading

Note

High Risk
Removes many per-config timeout knobs in favor of fixed internal caps and runtime timeouts, so workloads that depended on custom command_timeout or sandbox minutes may behave differently; changes touch harness, sandbox lifecycle, and eval orchestration.

Overview
V1 timeout handling is refactored around three eval-facing knobs—global_timeout_seconds, rollout_timeout_seconds, and task_timeout_seconds—wired through EvalConfig, env set_kwargs, and runtime state on harness runs. The eval CLI adds --rollout-timeout, --task-timeout, and --global-timeout; legacy --timeout / TOML timeout map to rollout timeout.

Framework enforcement uses a new timeout_after context manager and FrameworkTimeoutError in harness, v1 Env, sandbox commands, NeMo Gym, and full eval runs. Expired limits set timed_out / task_timed_out and stop_condition timeout_reached instead of relying on per-layer asyncio.wait_for in environments.

Per-config timeout fields are removed from SandboxConfig, ProgramConfig, and many environment/harness configs (command_timeout, setup_timeout, timeout_minutes, harness timeout_seconds, etc.). Internal safety limits move to timeout_utils (sandbox create/ready/setup, model no-response, OpenEnv probes, Harbor verifier seconds via test_timeout_seconds). Sandbox lease TTL is fixed at 24h; RLM config renames (rlm_tools, rlm_repo_url, …) and drops RLM_EXEC_TIMEOUT wiring.

Default HTTP client timeouts rise from 1h to 6h. Downstream envs drop explicit sandbox/bash timeouts; Wikispeedia sets env.task_timeout_seconds at load time.

Reviewed by Cursor Bugbot for commit 181e9ef. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Refactor V1 timeout handling to use centralized constants and runtime-state-driven timeouts

  • Replaces scattered per-sandbox and per-config timeout fields (timeout_minutes, command_timeout, setup_timeout, exec_timeout, etc.) with centralized constants defined in timeout_utils.py and runtime-state-propagated values (rollout_timeout_seconds, task_timeout_seconds, global_timeout_seconds).
  • Adds a timeout_after async context manager in async_utils.py used throughout harness, env, and sandbox code to enforce rollout-, task-, and global-level timeouts that raise TimeoutError and mark state as timed_out.
  • Adds --rollout-timeout, --task-timeout, and --global-timeout CLI flags in eval.py; legacy --timeout still works but maps to rollout_timeout_seconds.
  • Removes timeout fields from SandboxConfig, ProgramConfig, HarborTasksetConfig, OpenEnvTasksetConfig, and several environment-specific configs; downstream code now relies on the centralized constants.
  • Increases default ClientConfig and EndpointClientConfig request timeout from 1 hour to 6 hours.
  • Risk: Any code that previously relied on per-config timeout fields (e.g. command_timeout, setup_timeout, timeout_minutes) will no longer have those values honored; behavior now depends on the new centralized constants or runtime state.

Changes since #1543 opened

  • Introduced FrameworkTimeoutError exception class in verifiers.utils.async_utils and modified timeout_after context manager to raise it instead of TimeoutError when framework-managed timeouts expire [adde70a]
  • Modified Harness.run rollout handler to catch verifiers.errors.Error exceptions, record them via record_error, and continue to completion with scoring and state updates [adde70a]
  • Modified create_sandbox in verifiers.v1.utils.sandbox_utils to handle creation timeouts by returning promptly and scheduling asynchronous cleanup via new delete_late_sandbox_create function [adde70a]
  • Restructured Env._run_group_states to apply rollout timeout to both per-task harness run gathering and optional score_group invocation within a single timeout scope [adde70a]
  • Updated exception handling across NeMoGymHarness._run_nemo_gym, Harness.generate_program, Harness.run_tool_program, Harness.command_program, and run_sandbox_command to catch FrameworkTimeoutError instead of TimeoutError for framework-managed timeouts [adde70a]
  • Added test coverage for distinguishing user-raised TimeoutError from framework FrameworkTimeoutError, error rollout scoring and completion, group timeout behavior, and sandbox creation timeout with late cleanup [adde70a]
  • Expanded timeout exception handling in create_sandbox async function to catch both asyncio.TimeoutError and TimeoutError, and changed the exception raising behavior to throw a new TimeoutError with the message "Sandbox creation timed out." while chaining the original exception [8ab88fd]
  • Expanded timeout exception handling in delete_late_sandbox_create async function to catch both asyncio.TimeoutError and TimeoutError during late sandbox creation cleanup [8ab88fd]
  • Removed test double timeout tracking and deleted multiple timeout validation tests from V1 runtime lifecycle test suite [181e9ef]
  • Deleted CLI timeout configuration and flag validation tests [181e9ef]
  • Removed timeout field presence and absence assertions from Harbor and RLM SWE test suites [181e9ef]
  • Removed timeout value assertions from environment configuration tests [181e9ef]

Macroscope summarized 81a4506.

Comment thread verifiers/v1/harness.py Outdated
Comment thread verifiers/v1/env.py
@macroscopeapp

macroscopeapp Bot commented Jun 4, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR significantly refactors timeout handling across the V1 system, removing many configurable timeout parameters and introducing a new framework-managed timeout architecture. Multiple unresolved P1 review comments identify scenarios where previously bounded operations may now hang indefinitely, requiring careful human assessment of the behavioral changes.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87149fb579

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/scripts/eval.py
Comment on lines +575 to +579
"--rollout-timeout",
dest="rollout_timeout_seconds",
type=float,
default=None,
help="Per-rollout wall-clock timeout in seconds. Overrides timeout_seconds in --extra-env-kwargs.",
help="Per-rollout wall-clock timeout in seconds.",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep --timeout as a compatibility alias

For users and config files that still use the documented prime eval run ... --timeout 600 / TOML timeout = 600 timeout surface, this rename makes argparse reject the flag before it can be mapped to rollout_timeout_seconds. Since this was the public per-rollout timeout entrypoint for existing eval workflows, removing it turns previously valid runs into CLI failures rather than applying the equivalent rollout timeout; keep --timeout as a deprecated alias to the same destination and normalize legacy TOML timeout as well.

Useful? React with 👍 / 👎.

Comment thread packages/harnesses/harnesses/mini_swe_agent.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be71389db5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 161 to 165
result = await client.run_background_job(
sandbox_id=sandbox_id,
command="bash test.sh",
working_dir="/tests",
timeout=test_timeout,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Harbor verifier timeouts

For Harbor tasks that set [verifier].timeout_sec, this path now drops the parsed test_timeout and starts bash test.sh without passing any timeout, so a verifier that hangs or intentionally relies on its task-specific budget can block scoring until some unrelated client/global timeout instead of being capped per the task TOML. Keep carrying the parsed verifier timeout into this call or map it to the new timeout surface.

Useful? React with 👍 / 👎.

Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
mikasenghaas
mikasenghaas previously approved these changes Jun 4, 2026
Comment thread packages/tasksets/tasksets/harbor.py
Comment thread environments/rlm_swe_v1/rlm_swe_v1.py
Comment thread verifiers/scripts/eval.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37953fc57f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/env.py Outdated
Comment thread verifiers/v1/harness.py Outdated
Comment thread verifiers/utils/eval_utils.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5fe6754f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +718 to 721
sandbox = await asyncio.wait_for(create_waiter, SANDBOX_CREATE_TIMEOUT_SECONDS)
except (asyncio.CancelledError, TimeoutError):
try:
sandbox = cast(SandboxRecord, await asyncio.shield(create_task))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't await the timed-out sandbox creation task

When sandbox creation exceeds SANDBOX_CREATE_TIMEOUT_SECONDS, wait_for only cancels the shield wrapper, not create_task; the except TimeoutError path then awaits the same shielded create_task before it can re-raise or clean up. In an environment where the sandbox create RPC hangs, this new internal timeout therefore never returns and can still block rollout/global timeout handling indefinitely; bound or cancel/close the underlying create attempt instead of awaiting it unbounded after the timeout.

Useful? React with 👍 / 👎.

Comment thread verifiers/utils/eval_utils.py
Comment thread verifiers/v1/utils/timeout_utils.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 72f59c6984

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

harbor = task["harbor"]
assert isinstance(harbor, dict)
task_dir = Path(str(harbor["task_dir"]))
test_timeout = test_phase_timeout_seconds(harbor.get("test_timeout_seconds"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore a default Harbor verifier timeout

When a Harbor task omits [verifier].timeout_sec, this now resolves test_timeout to None and passes that through to run_background_job, whereas the previous path always fell back to verifier_timeout_seconds=900. In evals without an explicit rollout/global timeout, a stuck bash test.sh can now block scoring indefinitely for any task that relies on the default verifier budget; keep a finite default when the task TOML does not specify one.

Useful? React with 👍 / 👎.

# Conflicts:
#	packages/harnesses/harnesses/rlm.py
#	tests/test_v1_rlm_swe.py
Comment thread verifiers/v1/harness.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adde70aec2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fi
cd "$MINI_SWE_AGENT_WORKDIR"
timeout --kill-after=30s "${{AGENT_TIMEOUT_SECONDS:-3600}}" {shlex.quote(DEFAULT_MINI_BINARY)} \\
{shlex.quote(DEFAULT_MINI_BINARY)} \\

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep MiniSWE agent runs bounded

With the default v1 environment settings, task_timeout_seconds is None, so this sandbox command is now sent to run_background_job without any timeout, and the shell-level timeout --kill-after ... ${AGENT_TIMEOUT_SECONDS:-3600} wrapper was removed here. If mini-swe-agent hangs or never exits despite --exit-immediately, evaluations using MiniSWEAgentProgramConfig can now block indefinitely instead of being capped at the previous 3600s default; keep a finite default task timeout or restore a command-level wrapper for this harness.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ab88fdf3c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +61 to +62
if "timeout" in config and "rollout_timeout_seconds" not in config:
config["rollout_timeout_seconds"] = config["timeout"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve per-eval legacy timeout overrides

When a TOML config sets a global rollout_timeout_seconds and an individual [[eval]] still uses the legacy timeout field, this normalization runs after {**global_defaults, **eval_config} and skips copying timeout because rollout_timeout_seconds is already present. That silently leaves the global rollout timeout in effect instead of the per-eval override, so existing configs cannot override a global timeout for one eval unless they rename the field everywhere; normalize each layer before merging or otherwise give the per-eval legacy field precedence.

Useful? React with 👍 / 👎.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 181e9ef. Configure here.

Comment thread verifiers/v1/harness.py
completed = True
except FrameworkTimeoutError:
state["timed_out"] = True
state.stop("timeout_reached")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollout timeout skips scoring

Medium Severity

When the rollout-level timeout_after fires, Harness.run sets timed_out and calls stop but never runs score_rollout. Error rollouts still score in the same method, so timed-out rollouts can finish without reward or metrics.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 181e9ef. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 181e9ef40a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"agent_timeout"
if isinstance(exc, TimeoutError)
else "agent_recursion_limit"
result = await agent.ainvoke(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Wikispeedia timeout metric

When the Wikispeedia agent exceeds the new env.task_timeout_seconds, the timeout is raised and handled by Harness.local_callable_program rather than this try block, so the state keeps the initialized agent_timeout = False. The environment's agent_timeout metric reads that field, so timed-out Wikispeedia rollouts now report agent_timeout as 0 even though the agent hit the timeout; either mark this state on framework timeouts or update the metric to use task_timed_out.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants