Skip to content
Open
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8646a8d
refactor(agent): replace segment-based trajectory with turn-node Traj…
Jun 1, 2026
f10552f
refactor(agent): drop dead code left by trajectory-manager refactor
Jun 2, 2026
ba557ae
refactor(agent): port anthropic + trajectory_manager from trajectory-…
Jun 4, 2026
b8a5cd4
refactor(agent): rewrite openai.py for Codex CLI + TrajectoryManager
Jun 4, 2026
44e7999
refactor(agent): centralize snapshot-threshold default + filter acces…
Jun 5, 2026
5e59c25
feat(agent): add fork-merge rescue for short assistant rewrites
Jun 5, 2026
1d459bd
fix(agent): replace sib.messages on fork-merge rescue
Jun 5, 2026
8d6fdc9
refactor(agent): drop billing-header scrub now cc emits no header
Jun 5, 2026
2b4efc4
refactor(agent): migrate TrajectoryManager and adapters (v4)
Jun 8, 2026
6f95f18
refactor(agent): drop drift fork/merge params, strict exact-prefix li…
Jun 8, 2026
4fcbb24
feat(agent): assistant-rewrite merge to de-dilute reward
Jun 8, 2026
0f1c5aa
feat(agent): TrajectoryManager re-accepts fork_merge_max_response_tokens
Jun 8, 2026
4732702
docs(test): spec for TrajectoryManager e2e test script
Jun 8, 2026
4ea5d5a
test(agent): end-to-end TrajectoryManager test matrix (append_turn/ge…
Jun 8, 2026
cbee0de
test(agent): dump raw append_turn inputs in e2e readable output
Jun 8, 2026
a11c9b4
test(agent): 1.7 now shows token drift's effect on the linearized sample
Jun 8, 2026
f1b1792
test(agent): every case prints [samples] + mask info
Jun 8, 2026
0be5966
test(agent): make reward-split explicit in dump + conservation assert
Jun 8, 2026
fe7692a
test(agent): set every case's input reward to 1.0
Jun 8, 2026
2ee5ee8
test(agent): render whitespace in token labels as visible ␣
Jun 8, 2026
8eed4dd
test(agent): add Group 4 (boundary/defensive/feature) -> 98% coverage
Jun 8, 2026
cfad29d
test(agent): assert full output via golden token+loss strings
Jun 8, 2026
624b927
refactor(agent): drift-tolerant trajectory linearization
Jun 8, 2026
054f89a
chore(agent): untrack e2e test design doc and trajectory_manager tests
Jun 8, 2026
bc3d304
chore(agent): drop comments/docstrings in generate.py and TurnRecord
Jun 8, 2026
ece9007
docs(agent): tighten trajectory_manager comments to why-not-what
Jun 8, 2026
6565fe1
refactor(agent): slim adapters and trajectory_manager, add e2e test
Jun 9, 2026
aceb162
refactor(agent): replace segment-based trajectory with turn-node Traj…
Jun 1, 2026
0193103
refactor(agent): drop dead code left by trajectory-manager refactor
Jun 2, 2026
d84fe47
refactor(agent): port anthropic + trajectory_manager from trajectory-…
Jun 4, 2026
b7550cc
refactor(agent): rewrite openai.py for Codex CLI + TrajectoryManager
Jun 4, 2026
0b2576d
refactor(agent): centralize snapshot-threshold default + filter acces…
Jun 5, 2026
4ea8bb9
feat(agent): add fork-merge rescue for short assistant rewrites
Jun 5, 2026
8c5fe1a
fix(agent): replace sib.messages on fork-merge rescue
Jun 5, 2026
18ea895
refactor(agent): drop billing-header scrub now cc emits no header
Jun 5, 2026
f733bef
fix(agent): use rollout_id after upstream group_id revert
Jun 9, 2026
ad122f4
refactor(agent): assert base_sample in get_trajectory instead of defa…
Jun 9, 2026
b05dfdb
Merge branch 'trajectory-manager-migration-v4' into refactor_trajecto…
Jun 9, 2026
18772b0
fix(agent): mask entire drifted response span in B1 replace
Jun 9, 2026
d61b419
refactor(agent): extract shared adapter pipeline; move TurnRecord + a…
Jun 9, 2026
71f5511
refactor(agent): symmetric adapters + de-scaffold common
Jun 9, 2026
d8a6b78
refactor(agent): drop tools param; dict== mount-point matching
Jun 10, 2026
d73bb4a
refactor(agent): collapse Node turn_* into turn; rename Node->Message…
Jun 10, 2026
fc72355
perf(agent): chunked common-prefix; rename _lcp_len -> _common_prefix…
Jun 10, 2026
54b0846
refactor(agent): trajectory manager cleanup + adapter tweaks
Jun 11, 2026
804bcd1
refactor(agent): fold classify_drift into _SampleBuilder as classify_…
Jun 11, 2026
46d09de
chore(agent): untrack test_trajectory_manager.py
Jun 11, 2026
36fa60e
refactor(agent): gate fork on full output_ids length
Jun 12, 2026
76cf2cb
Refactor coding-agent RL harness: pluggable agent harness layer + SWE…
Jun 12, 2026
0da1b3c
Merge pull request #4 from jingshenghang/refactor_harness
jingshenghang Jun 12, 2026
7e1af92
refactor(agent): unify env vars under SLIME_AGENT_*/ADAPTER_*, tidy h…
Jun 15, 2026
76c9b0a
refactor(agent): make harnesses SingletonMeta-backed, drop module-lev…
Jun 15, 2026
d2897af
refactor(agent): rename debug hook to debug_callback, pass TurnRecord…
Jun 15, 2026
336985c
refactor(agent): tidy adapters, trajectory_manager, harness and SWE e…
Jun 16, 2026
36608fb
refactor(agent): drop optional sandbox metadata file/json, keep image…
Jun 16, 2026
da0a4fd
refactor(agent): reorganize agent tests under tests/test_agent/, tidy…
Jun 16, 2026
414a2ae
refactor(agent): simplify adapters, harness and sandbox
Jun 16, 2026
3cd0e30
docs(swe): move fan-out/sandbox notes from launcher into README
Jun 16, 2026
b40db1d
Merge remote-tracking branch 'origin/main' into refactor_trajectory_m…
Jun 16, 2026
94beeda
fix(agent): lazy-import load_tokenizer so CPU agent test needs no tra…
Jun 16, 2026
4405c1c
Revert "fix(agent): lazy-import load_tokenizer so CPU agent test need…
Jun 16, 2026
942358d
test(agent): stub transformers before importing generate in CPU rollo…
Jun 16, 2026
3772db3
test(agent): shim asyncio.timeout for py3.10 CI in CPU rollout test
Jun 16, 2026
f0f40b4
ci: rename agent-adapter-test job to agent-test (covers more than ada…
Jun 16, 2026
b2d3b38
refactor(agent): rename trajectory_manager.py back to trajectory.py
Jun 16, 2026
35c9222
refactor(agent): drop duplicate trajectory_manager.py left by rename
Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 14 additions & 13 deletions examples/coding_agent_rl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@

This directory provides an example of running end-to-end **SWE (Software-Engineering) coding-agent RL** with slime: a real coding agent (claude-code CLI) drives `Read/Edit/Grep/Bash/Agent` tools inside a fresh sandbox per sample, the model produces a `git diff`, and the diff is graded against the dataset's test harness in a second clean sandbox (no test-cheating).

Two example files and one shared adapter implement the loop:
Two example files, the shared harness package, and one shared adapter implement the loop:

- `generate.py` — per-sample `generate()` registered via `--custom-generate-function-path`. Boots the sandbox, runs claude-code, captures the diff, scores it, and emits one or more `Sample`s back to slime.
- `slime.agent.adapters.AnthropicAdapter` — the shared Anthropic Messages adapter. claude-code talks to it as if it were Anthropic; the adapter tokenizes the current message history each turn, records prompt/output token snapshots, preserves model-generated tokens (`loss_mask=1`) only while later prompts stitch onto them, masks template/observation tokens (`0`), and emits **three kinds of segments** per trajectory: `subagent` (completed `Task/Agent` dispatch), `wipe` (chain frozen by auto-compact), `final` (tail of the main chain).
- `sandbox.py` — coding-agent/SWE helpers built on `slime.agent.sandbox`: install bootstraps, spawn claude-code, capture patches, and run the fresh-sandbox evaluator. The shared sandbox contract lives in `slime.agent.sandbox.Sandbox`.
- `generate.py` — per-sample `generate()` registered via `--custom-generate-function-path`. Boots the sandbox, prepares the SWE workspace, runs the coding harness (claude-code), captures the diff, scores it, and emits one or more `Sample`s back to slime.
- `slime.agent.adapters.AnthropicAdapter` — the shared Anthropic Messages adapter. claude-code talks to it as if it were Anthropic; the adapter tokenizes the current message history each turn, records prompt/output token snapshots, preserves model-generated tokens (`loss_mask=1`) only while later prompts stitch onto them, and masks template/observation tokens (`0`). Each turn is routed into a per-session message tree inside `slime.agent.trajectory_manager.TrajectoryManager`; any divergence in the prompt prefix forks a new branch, so sub-agent dispatches and auto-compaction are handled as separate root-to-leaf chains. `get_trajectory` linearizes each leaf chain into one `Sample`.
- `slime.agent.harness` — harness-agnostic coding-agent lifecycle (install CLI, write config, spawn detached, poll done-marker). `BaseHarness` defines the contract; `CLAUDE_CODE` / `CODEX` are the shipped implementations. Adding a harness is one new file. The shared sandbox contract lives in `slime.agent.sandbox.Sandbox`.
- `swe.py` — harness-agnostic SWE task layer built on `slime.agent.sandbox`: `prepare_workspace` (pre_commands + PROBLEM_STATEMENT.md), `git_diff` (patch capture), and `evaluate` (fresh-sandbox grading). `SWE_PROMPT` is the task instruction handed to whichever harness runs.

`generate.py` owns one `AnthropicAdapter` instance. For each sample it calls
`adapter.open_session(...)` before starting claude-code, serves `adapter.app` as
Expand Down Expand Up @@ -111,7 +112,7 @@ All set in the launcher; tune per cluster.
| `SWE_HOST_CC_TARBALL` | — | Host path to the Claude Code CLI npm tarball. |
| `SWE_TIME_BUDGET_SEC` | `1800` | Wallclock budget for one agent run. |
| `SWE_EVAL_TIMEOUT_SEC` | `600` | Wallclock cap on the evaluator sandbox. |
| `SWE_BOOT_CONCURRENCY` | `6` | Cap on simultaneous sandbox boots (eases h2/SSL long-tail). |
| `SWE_BOOT_CONCURRENCY` | `16` | Cap on simultaneous sandbox boots (eases h2/SSL long-tail). |
| `SWE_CLAUDE_EXTRA_ARGS` | (see launcher) | Extra flags appended to the `claude` CLI invocation — registers the read-only `investigator` sub-agent, disables `WebFetch`/`WebSearch`, disables slash commands. |
| `SWE_CC_PROMPT` | unset | Optional override for the user-turn prompt. Setting this to require sub-agent dispatch is the most reliable way to maximize fan-out. |

Expand Down Expand Up @@ -145,8 +146,8 @@ The Anthropic adapter therefore follows a **string in, token out** contract:

Multi-turn agents still force the adapter to tokenize later message
histories, because tool observations and claude-code's own compacted messages
arrive as strings. `slime.agent.trajectory.merge_turns` stitches those later
prompts against the saved token stream:
arrive as strings. `slime.agent.trajectory_manager.TrajectoryManager` routes
those later prompts against the saved token stream:

- New prompt suffixes that are tool/user/environment context are appended with
`loss_mask=0`.
Expand All @@ -160,15 +161,15 @@ That last case is the important correctness guard. A re-tokenization mismatch
can make a string-level conversation look continuous while token-level
provenance is broken. slime keeps the context needed to continue the agent, but
does not backprop through tokens whose sampled origin can no longer be proven.
The unit tests in `tests/test_agent_trajectory.py` cover matched prefixes,
skipped turns, split-output drift, changed token counts, and prompt-base
restarts.
The unit tests in `tests/test_agent/test_trajectory_manager.py` cover matched
prefixes, skipped turns, split-output drift, changed token counts, and
prompt-base restarts.

## Fan-out Semantics

- `generate()` returns `list[Sample]` — one Sample per trajectory **segment** (`subagent` / `wipe` / `final`).
- Per-trajectory reward is split as `reward / K` across segments; `rollout_id` is shared so the per-rollout-mean loss reducer still counts the trajectory once.
- Sub-agent dispatch increases `K` (each completed `Agent` turn block becomes its own segment), so the effective batch after flatten can be much larger than `rollout_batch_size * n_samples_per_prompt`.
- `generate()` returns `list[Sample]` — one Sample per root-to-leaf chain in the per-session message tree.
- Per-trajectory reward is split as `reward / K` across chains; `rollout_id` is shared so the per-rollout-mean loss reducer still counts the trajectory once.
- Sub-agent dispatch and auto-compaction increase `K` (each prompt-prefix divergence forks a new branch), so the effective batch after flatten can be much larger than `rollout_batch_size * n_samples_per_prompt`.

## Porting to a New Sandbox Backend

Expand Down
Loading
Loading