Skip to content

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079

Draft
aoshen02 wants to merge 4 commits into
THUDM:mainfrom
aoshen02:feat/agentic-rollout-deps
Draft

feat(coding_agent_rl): add SWE-bench harness evaluation path#2079
aoshen02 wants to merge 4 commits into
THUDM:mainfrom
aoshen02:feat/agentic-rollout-deps

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Add swebench_metadata as a third evaluation route in sandbox.evaluate(), alongside the existing swepro and eval_cmd paths. This allows coding_agent_rl to grade SWE-bench Verified instances directly using the official swebench harness.

Changes

  • docker/Dockerfile: add uni-agent (source install, --no-deps) and swebench Python packages
  • examples/coding_agent_rl/sandbox.py: add swebench_metadata param to evaluate(), add _run_swebench_eval() — reuses uni_agent.reward.swe_bench._make_eval_script_list for eval script generation, standard swebench grading API for result parsing
  • examples/coding_agent_rl/generate.py: pass swebench_metadata through _metadata() and the evaluate() call site

Evaluation priority

swepro           → SWEPro custom scripts (existing, unchanged)
swebench_metadata → SWE-bench official harness (new)
eval_cmd         → shell command fallback (existing, unchanged)

Context

Validated on 500-instance SWE-bench Verified eval with Qwen3.6-35B-A3B: uniagent mode 71.3% (355/498), matching the official 71.6%.

Test plan

  • `docker build` completes
  • `python -c "from uni_agent.reward.swe_bench import _make_eval_script_list; print('ok')"` inside container
  • Run a small SWE-bench eval with `swebench_metadata` in sample metadata

🤖 Generated with Claude Code

aoshen02 and others added 2 commits June 15, 2026 03:10
…louts

Add uni-agent (source install, --no-deps) and swebench to the Docker
image so coding_agent_rl can import uni_agent.reward.swe_bench for
SWE-bench harness evaluation, and swebench for grading constants/parsers.

Previously these were installed at container startup via run scripts,
adding cold-start latency and making the image non-self-contained.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Add `swebench_metadata` as a third evaluation route in `sandbox.evaluate()`,
alongside the existing `swepro` and `eval_cmd` paths. This allows
coding_agent_rl to grade SWE-bench Verified instances directly using
the official swebench harness (constants, log parsers, grading).

Eval script generation reuses `uni_agent.reward.swe_bench._make_eval_script_list`
(installed via the Dockerfile change in this PR) so the eval logic stays
in one place. Result parsing uses the standard swebench grading API.

Changes:
- sandbox.py: add `swebench_metadata` param to `evaluate()`, add
  `_run_swebench_eval()` between `_run_swepro` and `_run_eval_cmd`
- generate.py: pass `swebench_metadata` through `_metadata()` and
  the `evaluate()` call site

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
@aoshen02 aoshen02 changed the title feat(docker): add uni-agent and swebench dependencies for agentic rollouts feat(coding_agent_rl): add SWE-bench harness evaluation path Jun 15, 2026
aoshen02 and others added 2 commits June 15, 2026 03:44
Now that uni_agent.reward.swe_bench exposes make_eval_script() and
parse_eval_output() as standalone functions, _run_swebench_eval is
just 6 lines: build script, execute, parse.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: aoshen <aoshen@inferact.ai>
@jingshenghang

Copy link
Copy Markdown
Collaborator

There is an import error with uni-agent package.

Python 3.12.3 (main, Mar  3 2026, 12:15:18) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import uni_agent
>>> from uni_agent.reward.swe_bench import make_eval_script, parse_eval_output
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/uni_agent/reward/swe_bench.py", line 21, in <module>
    from uni_agent.interaction import AgentEnv
  File "/usr/local/lib/python3.12/dist-packages/uni_agent/interaction/__init__.py", line 1, in <module>
    from .env import AgentEnv, AgentEnvConfig
  File "/usr/local/lib/python3.12/dist-packages/uni_agent/interaction/env.py", line 7, in <module>
    from swerex.exceptions import BashIncorrectSyntaxError, CommandTimeoutError
ModuleNotFoundError: No module named 'swerex'
>>>

@aoshen02 aoshen02 marked this pull request as draft June 16, 2026 13:34
@jingshenghang

Copy link
Copy Markdown
Collaborator

Hi, our refactoring of the agent framework has been merged into the main branch. Welcome pull requests based on our latest code.

#2005

@aoshen02

Copy link
Copy Markdown
Contributor Author

Great, will try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants