Skip to content

Add Codex CLI harness#1568

Open
xeophon wants to merge 1 commit into
mainfrom
codex-cli-harness
Open

Add Codex CLI harness#1568
xeophon wants to merge 1 commit into
mainfrom
codex-cli-harness

Conversation

@xeophon

@xeophon xeophon commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

  • add a packaged Codex CLI command harness
  • wire the harness into package exports and docs
  • add focused unit coverage for command construction and package imports

Testing

  • uv run pytest tests/test_v1_codex_cli.py tests/test_v1_harbor_cli.py::test_packaged_command_harnesses_defer_partial_program_overrides tests/test_imports.py::test_package_tasksets_and_harnesses_are_not_root_exports tests/test_imports.py::test_package_tasksets_and_harnesses_are_not_v1_exports
  • TerminalBench2 smoke on adaptive-rejection-sampler: reward 1.0

Note

Medium Risk
New sandboxed agent path runs Codex with bypass-approvals flags and handles API keys or subscription auth JSON; mistakes could affect credential handling or remote model calls during rollouts.

Overview
Adds a packaged Codex CLI command harness (harnesses.codex_cli) so evals can run the OpenAI Codex agent in sandboxes alongside OpenCode, Pi, and similar harnesses.

CodexCLIProgramConfig.resolve() builds the sandbox program: installs Codex via the official installer, wires task/system prompts into files, runs codex exec with JSON logging and optional artifacts, and supports api_key (default: runtime model + intercepted API key / base URL) or chatgpt (CODEX_AUTH_JSON secret). Version strings like codex@latest or codex@0.137.0 control the installer release.

Package exports, BYO-harness and harnesses README docs (TOML examples, ChatGPT auth), and tests cover program construction, auth helpers, imports, and parity with other command harnesses in the deferred program-override parametrize.

Reviewed by Cursor Bugbot for commit e7b4378. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add CodexCLI harness for running Codex CLI in sandboxed eval programs

  • Adds codex_cli.py implementing CodexCLI, CodexCLIConfig, and CodexCLIProgramConfig — a new harness that installs and executes Codex CLI inside a sandboxed environment.
  • Supports two auth modes: api_key (logs in via OPENAI_API_KEY) and chatgpt (writes an auth.json from CODEX_AUTH_JSON env var).
  • Emits JSONL logs and last-message text as artifacts; consumes system/instruction prompts from the verifiers runtime.
  • Exports the new symbols from the harnesses package and adds them to import tests.

Macroscope summarized e7b4378.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e7b4378. Configure here.

__all__ = [
"CodexCLI",
"CodexCLIConfig",
"CodexCLIProgramConfig",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harness package version not bumped

Medium Severity

This PR adds the public CodexCLI harness and exports it from harnesses, but leaves __version__ at 0.1.2. That leaves package version metadata out of step with the new user-facing behavior, so a post-merge publish may not ship CodexCLI under a new release tag.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit e7b4378. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 8, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR adds a new CodexCLI harness feature with 172 lines of new implementation, including authentication handling and shell script generation. New features introducing user-facing capabilities warrant human review.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7b43784f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

--ephemeral \\
--skip-git-repo-check \\
--dangerously-bypass-approvals-and-sandbox \\
--json \\

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return Codex's final message as the completion

When this harness is used on tasks that score or display the assistant completion, --json makes Codex print newline-delimited event JSON to stdout rather than the answer text (the Codex CLI docs describe --json this way and --output-last-message as the final-message path: https://developers.openai.com/codex/cli/reference#codex-exec). The v1 sandbox runner records stdout directly into state["completion"] (verifiers/v1/utils/sandbox_utils.py), so these rollouts will expose a JSON event log as the model completion while the actual final message is only an artifact.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant