Add agentic-RL research scaffold with Modal sandbox and env adapters#4
Open
junlin-star wants to merge 11 commits into
Open
Add agentic-RL research scaffold with Modal sandbox and env adapters#4junlin-star wants to merge 11 commits into
junlin-star wants to merge 11 commits into
Conversation
Introduce async_rl_research components: mini-swe-agent rollout updates, Modal-based sandbox, environment adapters (harbor, swe_gym) with slime conversion helpers, and supporting docs. Co-authored-by: Cursor <cursoragent@cursor.com>
0db9d43 to
60ec521
Compare
async_rl_research.env shadowed the common .env/env-var association and collided with the repo .gitignore's env/ habits; rename the package to async_rl_research.environment and update every module path reference (ENVS registry strings, imports, oracle-check CLI invocations, docs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mini-swe-agent prompts previously came from per-task-family builtin YAMLs (swe_gym -> benchmarks/swebench.yaml, harbor -> mini.yaml), so the prompt scaffold differed across task families and pinned task contracts (the swebench patch.txt ritual) that did not match what actually gets graded. Ship one repo-owned config (agent/config/universal.yaml) uploaded into the sandbox as the default for every family; the task-specific deliverable now lives in the instruction text the env writes (swe_gym appends an explicit git-diff deliverable contract; harbor instruction.md files already carry their own). Builtin configs remain reachable via the override ladder: MSWE_CONFIG env > metadata.agent_config > the uploaded universal config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fixes found while onboarding openthoughts-tblite: - Grade reward-file-less tasks from test.sh's exit code (0/1 only; other exit codes stay "no verdict" so infra breakage is not scored as a model failure) -- terminal-bench-style tasks end test.sh with a bare pytest run. - Tolerate float error in is_solved: weighted pytest fractions can sum to 0.999... for a fully-passing task. - Prefer task subdirectories over a stray template task.toml at the dataset root in the converter's discovery. - Provision uv via pip when the image ships no curl (python:*-slim). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rollout wall-clock is the training bottleneck (~20 min/sample) and the dumps carried no attribution. Add profiles/profiling.py: a PhaseTimer for env-side phases (work_boot/prep/agent/diff/eval_boot/eval) plus aiohttp middleware on the adapter that counts turns and generation time per session (bearer token == session_id). Both land in one "timing" dict per sample (swe_gym also records diff size metrics) -> sample extra -> rollout dumps, where the offline analyzer can attribute time across phases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A CancelledError raised inside a rollout (e.g. the Modal SDK's synchronicity bridge cancelling an in-flight .aio call on a client hiccup) is not caught by `except Exception` and propagated through generate_and_rm_group's gather, cancelling the whole generate_rollout_async task and crashing the training step. Catch it, abort only that sample, and re-raise only for a genuine external cancel (task left in a cancelling state). Aborted samples also keep whatever per-turn timing accrued, distinguishing 'agent alive but slow' from 'never dialed in'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
evalset.py turns a spec YAML into subsampled per-subset jsonls plus a manifest and a ready --eval-config block, so eval sets are versioned and pinned instead of hand-cut. Making the per-dataset overrides actually take effect required a generate.py fix: _sampling_params built the session defaults from the rollout_* args only, ignoring the sampling_params dict slime passes to generate() -- which on the eval path is what carries the eval_config temperature/top_p/top_k. Layer those overrides on top. Per-turn max_new_tokens deliberately stays adapter-governed. Document the eval flow (builder -> inline eval_config -> eval-only runs) in the README. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add repo-owned Qwen OpenAI adapter and openthoughts_agent converter, rework agent/environment base classes and convert2slime adapters, and streamline generate.py/modal_sandbox sampling and rollout handling. Remove stale data/ and profiles/ scaffolding. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
async_rl_researchagentic-RL scaffold: mini-swe-agent rollout updates and new agent basemodal_sandbox.py), replacing the oldsandbox.pyharbor,swe_gym) plusconvert2slimeconversion helpersREADME.md,data/README.md) and architecture diagramNote: the
async_rl_research/dashboarddirectory is intentionally excluded from this PR.Test plan
Made with Cursor