Skip to content
Open
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8646a8d
refactor(agent): replace segment-based trajectory with turn-node Traj…
Jun 1, 2026
f10552f
refactor(agent): drop dead code left by trajectory-manager refactor
Jun 2, 2026
ba557ae
refactor(agent): port anthropic + trajectory_manager from trajectory-…
Jun 4, 2026
b8a5cd4
refactor(agent): rewrite openai.py for Codex CLI + TrajectoryManager
Jun 4, 2026
44e7999
refactor(agent): centralize snapshot-threshold default + filter acces…
Jun 5, 2026
5e59c25
feat(agent): add fork-merge rescue for short assistant rewrites
Jun 5, 2026
1d459bd
fix(agent): replace sib.messages on fork-merge rescue
Jun 5, 2026
8d6fdc9
refactor(agent): drop billing-header scrub now cc emits no header
Jun 5, 2026
2b4efc4
refactor(agent): migrate TrajectoryManager and adapters (v4)
Jun 8, 2026
6f95f18
refactor(agent): drop drift fork/merge params, strict exact-prefix li…
Jun 8, 2026
4fcbb24
feat(agent): assistant-rewrite merge to de-dilute reward
Jun 8, 2026
0f1c5aa
feat(agent): TrajectoryManager re-accepts fork_merge_max_response_tokens
Jun 8, 2026
4732702
docs(test): spec for TrajectoryManager e2e test script
Jun 8, 2026
4ea5d5a
test(agent): end-to-end TrajectoryManager test matrix (append_turn/ge…
Jun 8, 2026
cbee0de
test(agent): dump raw append_turn inputs in e2e readable output
Jun 8, 2026
a11c9b4
test(agent): 1.7 now shows token drift's effect on the linearized sample
Jun 8, 2026
f1b1792
test(agent): every case prints [samples] + mask info
Jun 8, 2026
0be5966
test(agent): make reward-split explicit in dump + conservation assert
Jun 8, 2026
fe7692a
test(agent): set every case's input reward to 1.0
Jun 8, 2026
2ee5ee8
test(agent): render whitespace in token labels as visible ␣
Jun 8, 2026
8eed4dd
test(agent): add Group 4 (boundary/defensive/feature) -> 98% coverage
Jun 8, 2026
cfad29d
test(agent): assert full output via golden token+loss strings
Jun 8, 2026
624b927
refactor(agent): drift-tolerant trajectory linearization
Jun 8, 2026
054f89a
chore(agent): untrack e2e test design doc and trajectory_manager tests
Jun 8, 2026
bc3d304
chore(agent): drop comments/docstrings in generate.py and TurnRecord
Jun 8, 2026
ece9007
docs(agent): tighten trajectory_manager comments to why-not-what
Jun 8, 2026
6565fe1
refactor(agent): slim adapters and trajectory_manager, add e2e test
Jun 9, 2026
aceb162
refactor(agent): replace segment-based trajectory with turn-node Traj…
Jun 1, 2026
0193103
refactor(agent): drop dead code left by trajectory-manager refactor
Jun 2, 2026
d84fe47
refactor(agent): port anthropic + trajectory_manager from trajectory-…
Jun 4, 2026
b7550cc
refactor(agent): rewrite openai.py for Codex CLI + TrajectoryManager
Jun 4, 2026
0b2576d
refactor(agent): centralize snapshot-threshold default + filter acces…
Jun 5, 2026
4ea8bb9
feat(agent): add fork-merge rescue for short assistant rewrites
Jun 5, 2026
8c5fe1a
fix(agent): replace sib.messages on fork-merge rescue
Jun 5, 2026
18ea895
refactor(agent): drop billing-header scrub now cc emits no header
Jun 5, 2026
f733bef
fix(agent): use rollout_id after upstream group_id revert
Jun 9, 2026
ad122f4
refactor(agent): assert base_sample in get_trajectory instead of defa…
Jun 9, 2026
b05dfdb
Merge branch 'trajectory-manager-migration-v4' into refactor_trajecto…
Jun 9, 2026
18772b0
fix(agent): mask entire drifted response span in B1 replace
Jun 9, 2026
d61b419
refactor(agent): extract shared adapter pipeline; move TurnRecord + a…
Jun 9, 2026
71f5511
refactor(agent): symmetric adapters + de-scaffold common
Jun 9, 2026
d8a6b78
refactor(agent): drop tools param; dict== mount-point matching
Jun 10, 2026
d73bb4a
refactor(agent): collapse Node turn_* into turn; rename Node->Message…
Jun 10, 2026
fc72355
perf(agent): chunked common-prefix; rename _lcp_len -> _common_prefix…
Jun 10, 2026
54b0846
refactor(agent): trajectory manager cleanup + adapter tweaks
Jun 11, 2026
804bcd1
refactor(agent): fold classify_drift into _SampleBuilder as classify_…
Jun 11, 2026
46d09de
chore(agent): untrack test_trajectory_manager.py
Jun 11, 2026
36fa60e
refactor(agent): gate fork on full output_ids length
Jun 12, 2026
76cf2cb
Refactor coding-agent RL harness: pluggable agent harness layer + SWE…
Jun 12, 2026
0da1b3c
Merge pull request #4 from jingshenghang/refactor_harness
jingshenghang Jun 12, 2026
7e1af92
refactor(agent): unify env vars under SLIME_AGENT_*/ADAPTER_*, tidy h…
Jun 15, 2026
76c9b0a
refactor(agent): make harnesses SingletonMeta-backed, drop module-lev…
Jun 15, 2026
d2897af
refactor(agent): rename debug hook to debug_callback, pass TurnRecord…
Jun 15, 2026
336985c
refactor(agent): tidy adapters, trajectory_manager, harness and SWE e…
Jun 16, 2026
36608fb
refactor(agent): drop optional sandbox metadata file/json, keep image…
Jun 16, 2026
da0a4fd
refactor(agent): reorganize agent tests under tests/test_agent/, tidy…
Jun 16, 2026
414a2ae
refactor(agent): simplify adapters, harness and sandbox
Jun 16, 2026
3cd0e30
docs(swe): move fan-out/sandbox notes from launcher into README
Jun 16, 2026
b40db1d
Merge remote-tracking branch 'origin/main' into refactor_trajectory_m…
Jun 16, 2026
94beeda
fix(agent): lazy-import load_tokenizer so CPU agent test needs no tra…
Jun 16, 2026
4405c1c
Revert "fix(agent): lazy-import load_tokenizer so CPU agent test need…
Jun 16, 2026
942358d
test(agent): stub transformers before importing generate in CPU rollo…
Jun 16, 2026
3772db3
test(agent): shim asyncio.timeout for py3.10 CI in CPU rollout test
Jun 16, 2026
f0f40b4
ci: rename agent-adapter-test job to agent-test (covers more than ada…
Jun 16, 2026
b2d3b38
refactor(agent): rename trajectory_manager.py back to trajectory.py
Jun 16, 2026
35c9222
refactor(agent): drop duplicate trajectory_manager.py left by rename
Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions examples/coding_agent_rl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This directory provides an example of running end-to-end **SWE (Software-Enginee
Two example files and one shared adapter implement the loop:

- `generate.py` — per-sample `generate()` registered via `--custom-generate-function-path`. Boots the sandbox, runs claude-code, captures the diff, scores it, and emits one or more `Sample`s back to slime.
- `slime.agent.adapters.AnthropicAdapter` — the shared Anthropic Messages adapter. claude-code talks to it as if it were Anthropic; the adapter tokenizes the current message history each turn, records prompt/output token snapshots, preserves model-generated tokens (`loss_mask=1`) only while later prompts stitch onto them, masks template/observation tokens (`0`), and emits **three kinds of segments** per trajectory: `subagent` (completed `Task/Agent` dispatch), `wipe` (chain frozen by auto-compact), `final` (tail of the main chain).
- `slime.agent.adapters.AnthropicAdapter` — the shared Anthropic Messages adapter. claude-code talks to it as if it were Anthropic; the adapter tokenizes the current message history each turn, records prompt/output token snapshots, preserves model-generated tokens (`loss_mask=1`) only while later prompts stitch onto them, and masks template/observation tokens (`0`). Each turn is routed into a per-session message tree inside `slime.agent.trajectory_manager.TrajectoryManager`; any divergence in the prompt prefix forks a new branch, so sub-agent dispatches and auto-compaction are handled as separate root-to-leaf chains. `get_trajectory` linearizes each leaf chain into one `Sample`.
- `sandbox.py` — coding-agent/SWE helpers built on `slime.agent.sandbox`: install bootstraps, spawn claude-code, capture patches, and run the fresh-sandbox evaluator. The shared sandbox contract lives in `slime.agent.sandbox.Sandbox`.

`generate.py` owns one `AnthropicAdapter` instance. For each sample it calls
Expand Down Expand Up @@ -145,8 +145,8 @@ The Anthropic adapter therefore follows a **string in, token out** contract:

Multi-turn agents still force the adapter to tokenize later message
histories, because tool observations and claude-code's own compacted messages
arrive as strings. `slime.agent.trajectory.merge_turns` stitches those later
prompts against the saved token stream:
arrive as strings. `slime.agent.trajectory_manager.TrajectoryManager` routes
those later prompts against the saved token stream:

- New prompt suffixes that are tool/user/environment context are appended with
`loss_mask=0`.
Expand All @@ -160,15 +160,15 @@ That last case is the important correctness guard. A re-tokenization mismatch
can make a string-level conversation look continuous while token-level
provenance is broken. slime keeps the context needed to continue the agent, but
does not backprop through tokens whose sampled origin can no longer be proven.
The unit tests in `tests/test_agent_trajectory.py` cover matched prefixes,
skipped turns, split-output drift, changed token counts, and prompt-base
restarts.
The unit tests in `tests/test_agent/test_trajectory_manager.py` cover matched
prefixes, skipped turns, split-output drift, changed token counts, and
prompt-base restarts.

## Fan-out Semantics

- `generate()` returns `list[Sample]` — one Sample per trajectory **segment** (`subagent` / `wipe` / `final`).
- Per-trajectory reward is split as `reward / K` across segments; `rollout_id` is shared so the per-rollout-mean loss reducer still counts the trajectory once.
- Sub-agent dispatch increases `K` (each completed `Agent` turn block becomes its own segment), so the effective batch after flatten can be much larger than `rollout_batch_size * n_samples_per_prompt`.
- `generate()` returns `list[Sample]` — one Sample per root-to-leaf chain in the per-session message tree.
- Per-trajectory reward is split as `reward / K` across chains; `rollout_id` is shared so the per-rollout-mean loss reducer still counts the trajectory once.
- Sub-agent dispatch and auto-compaction increase `K` (each prompt-prefix divergence forks a new branch), so the effective batch after flatten can be much larger than `rollout_batch_size * n_samples_per_prompt`.

## Porting to a New Sandbox Backend

Expand Down
112 changes: 32 additions & 80 deletions examples/coding_agent_rl/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
1. ``sandbox.run_claude_code`` prepares the agent sandbox and runs claude-code.
2. ``sandbox.git_diff`` captures the model-produced patch.
3. ``sandbox.evaluate`` scores that patch in a second clean sandbox.
4. ``_merge_samples`` combines reward + adapter ``TokenSegment``s,
delegating segment-to-``Sample`` fan-out to ``slime.agent.trajectory``.
4. ``adapter.finish_session`` drains the session tree into reward-weighted
``Sample`` objects with ``.response`` already decoded; ``generate`` logs.

All sandbox-side details live in ``sandbox.py``; the LLM plumbing
(Anthropic <-> SGLang /generate, token capture, 3-kind segment split) uses
Expand Down Expand Up @@ -49,17 +49,15 @@
import secrets
import time
import traceback
from dataclasses import dataclass
from typing import Any

from slime.agent.adapters import AnthropicAdapter
from slime.agent.trajectory import TokenSegment, fan_out_sample_segments
from slime.agent.aiohttp_threaded import FilteredAccessLogger, run_app_in_thread
from slime.utils.misc import SingletonMeta
from slime.utils.processing_utils import load_tokenizer
from slime.utils.types import Sample

from . import sandbox
from .aiohttp_threaded import run_app_in_thread

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -97,11 +95,13 @@ def __init__(self, args) -> None:
"Without it the sandbox cannot dial back and the rollout will "
"silently abort."
)
fork_merge_threshold = int(v) if (v := os.environ.get("SLIME_FORK_MERGE_MAX_RESPONSE_TOKENS")) else None
self.adapter = AnthropicAdapter(
tokenizer=self.tokenizer,
sglang_url=sglang_url,
tool_parser=self.tool_parser,
reasoning_parser=self.reasoning_parser,
fork_threshold_tokens=fork_merge_threshold,
)
# handler_cancellation=True so a client disconnect cancels the handler
# coroutine, arming the fire-and-forget /abort_request inside the
Expand All @@ -113,7 +113,10 @@ def __init__(self, args) -> None:
host=SHIM_BIND_HOST,
port=SHIM_PORT,
thread_name="anthropic-adapter",
runner_kwargs={"handler_cancellation": True},
runner_kwargs={
"handler_cancellation": True,
"access_log_class": FilteredAccessLogger,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

貌似没有别的地方用到 access_log_class 了?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"access_log_class": FilteredAccessLogger 这个对应的 FilteredAccessLogger在 aiohttp_threaded.py 里面有定义,是让 adaptor 只打印异常请求(回复不是 200,或者请求超过 120s),避免正常请求日志刷屏

},
)
self.adapter_url = f"http://{public_host}:{self.app_handle.port}"
logger.info(
Expand All @@ -127,18 +130,8 @@ def __init__(self, args) -> None:


# ---------------------------------------------------------------------------
# Trajectory -> Sample conversion
# adapter.finish_session() returns TokenSegments. One trajectory yields >=1
# segments because the agent may compact + reset mid-run; trajectory.py handles
# the mechanical segment -> Sample fan-out.
# Session setup
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class RewardResult:
reward: float
is_solved: bool
applied_cleanly: bool


def _start_session(
state: _State,
sample: Sample,
Expand All @@ -164,55 +157,11 @@ def _start_session(
return session_id


def _merge_samples(
*,
sample: Sample,
state: _State,
segments: list[TokenSegment],
reward_result: RewardResult,
elapsed_sec: float,
instance_id: str,
):
if not segments:
return _abort_result(sample, "adapter_session_empty")

trajectory_metadata = {
**(sample.metadata or {}),
"instance_id": instance_id,
"is_solved": reward_result.is_solved,
"applied_cleanly": reward_result.applied_cleanly,
"elapsed_sec": elapsed_sec,
}

# All K samples share rollout_id so the loss reducer counts this
# trajectory once.
fanned = fan_out_sample_segments(
sample,
segments,
reward_result.reward,
state.tokenizer,
metadata=trajectory_metadata,
)
if not fanned:
raise ValueError("fan-out produced no samples")

logger.info(
"[coding_agent_rl] %s: reward=%.2f solved=%s applied=%s elapsed=%.1fs segments=%d",
instance_id,
reward_result.reward,
reward_result.is_solved,
reward_result.applied_cleanly,
elapsed_sec,
len(fanned),
)
return fanned


# ---------------------------------------------------------------------------
# Main per-sample agent function
#
# The four calls inside the timeout are the high-level rollout recipe:
# run_claude_code -> git_diff -> sandbox.evaluate -> merge_samples.
# run_claude_code -> git_diff -> sandbox.evaluate -> finish_session.
# ---------------------------------------------------------------------------
async def generate(args, sample: Sample, sampling_params: dict[str, Any]):
"""Per-sample agent function with wall-clock guard. See
Expand Down Expand Up @@ -249,20 +198,26 @@ async def generate(args, sample: Sample, sampling_params: dict[str, Any]):
pre_commands=md["pre_commands"],
timeout_sec=SWE_EVAL_TIMEOUT_SEC,
)
reward_result = RewardResult(
samples = await state.adapter.finish_session(
session_id,
base_sample=sample,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

或者我们统一都存成 base_sample 也行

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已统一修改为 base_sample

reward=float(reward),
is_solved=bool(is_solved),
applied_cleanly=bool(applied_cleanly),
)
segments = await state.adapter.finish_session(session_id)
return _merge_samples(
sample=sample,
state=state,
segments=segments,
reward_result=reward_result,
elapsed_sec=time.time() - t0,
instance_id=instance_id,
if not samples:
return _abort_result(sample, "adapter_session_empty")

# finish_session already linearized, reward-weighted and decoded
# each segment's .response; here we only log a summary.
logger.info(
"[coding_agent_rl] %s: reward=%.2f solved=%s applied=%s elapsed=%.1fs segments=%d",
instance_id,
float(reward),
bool(is_solved),
bool(applied_cleanly),
time.time() - t0,
len(samples),
)
return samples

except asyncio.TimeoutError:
_log_timeout_diagnostic(t0)
Expand Down Expand Up @@ -347,7 +302,9 @@ def _coerce_prompt(prompt) -> str:
return ""


def _abort(sample: Sample, reason: str) -> Sample:
def _abort_result(sample: Sample, reason: str) -> list[Sample]:
"""Mark ``sample`` aborted in place and return it in the list shape this
fan-out generate function always yields."""
sample.tokens = [0, 0]
sample.response = ""
sample.response_length = 1
Expand All @@ -356,9 +313,4 @@ def _abort(sample: Sample, reason: str) -> Sample:
sample.status = Sample.Status.ABORTED
sample.metadata = {**(sample.metadata or {}), "abort_reason": reason}
logger.warning("[coding_agent_rl] aborted: %s", reason)
return sample


def _abort_result(sample: Sample, reason: str):
"""Return a uniform list shape for this fan-out generate function."""
return [_abort(sample, reason)]
return [sample]
4 changes: 2 additions & 2 deletions examples/coding_agent_rl/run_qwen36_35b_a3b_swe_8nodes.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#!/usr/bin/env bash
# End-to-end SWE coding-agent RL on 8 nodes.
#
# Same model and training loop as run_qwen36_35b_a3b_swe_8node.sh, with three
# extra layers that actively encourage the rollout to dispatch sub-agents.
# Standard model and training loop, with three extra layers that actively
# encourage the rollout to dispatch sub-agents.
# Trajectory trees produced by this script show real `sibling` branches:
#
# (1) An `investigator` sub-agent is registered via claude-code's --agents
Expand Down
7 changes: 3 additions & 4 deletions examples/coding_agent_rl/sandbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,8 +326,8 @@ async def evaluate(

if swepro:
r, s = await _run_swepro(ev, workdir, swepro, timeout_sec)
return r, s, True
r, s = await _run_eval_cmd(ev, workdir, eval_cmd, timeout_sec)
else:
r, s = await _run_eval_cmd(ev, workdir, eval_cmd, timeout_sec)
return r, s, True


Expand All @@ -336,8 +336,7 @@ async def _setup_swepro_assets(ev: Sandbox, swepro: dict) -> None:
for k, dst in [("run_script_path", "run_script.sh"), ("parser_script_path", "parser.py")]:
host_p = swepro.get(k)
if host_p:
text = Path(host_p).read_text()
await ev.write_file(f"{_SWEPRO_DIR}/{dst}", text, user="root")
await ev.write_file(f"{_SWEPRO_DIR}/{dst}", Path(host_p), user="root")
await ev.exec(f"chmod 755 {_SWEPRO_DIR}/* && chown -R agent:agent {_SWEPRO_DIR}", user="root", check=True)


Expand Down
Loading
Loading