Add agentic-RL research scaffold with Modal sandbox and env adapters by junlin-star · Pull Request #4 · modal-projects/slime

junlin-star · 2026-06-11T02:00:35Z

Summary

Add async_rl_research agentic-RL scaffold: mini-swe-agent rollout updates and new agent base
Add Modal-based sandbox (modal_sandbox.py), replacing the old sandbox.py
Add environment adapters (harbor, swe_gym) plus convert2slime conversion helpers
Add supporting docs (README.md, data/README.md) and architecture diagram

Note: the async_rl_research/dashboard directory is intentionally excluded from this PR.

Test plan

Verify env adapters import and convert correctly
Smoke-test Modal sandbox rollout

Made with Cursor

Introduce async_rl_research components: mini-swe-agent rollout updates, Modal-based sandbox, environment adapters (harbor, swe_gym) with slime conversion helpers, and supporting docs. Co-authored-by: Cursor <cursoragent@cursor.com>

async_rl_research.env shadowed the common .env/env-var association and collided with the repo .gitignore's env/ habits; rename the package to async_rl_research.environment and update every module path reference (ENVS registry strings, imports, oracle-check CLI invocations, docs). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mini-swe-agent prompts previously came from per-task-family builtin YAMLs (swe_gym -> benchmarks/swebench.yaml, harbor -> mini.yaml), so the prompt scaffold differed across task families and pinned task contracts (the swebench patch.txt ritual) that did not match what actually gets graded. Ship one repo-owned config (agent/config/universal.yaml) uploaded into the sandbox as the default for every family; the task-specific deliverable now lives in the instruction text the env writes (swe_gym appends an explicit git-diff deliverable contract; harbor instruction.md files already carry their own). Builtin configs remain reachable via the override ladder: MSWE_CONFIG env > metadata.agent_config > the uploaded universal config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Fixes found while onboarding openthoughts-tblite: - Grade reward-file-less tasks from test.sh's exit code (0/1 only; other exit codes stay "no verdict" so infra breakage is not scored as a model failure) -- terminal-bench-style tasks end test.sh with a bare pytest run. - Tolerate float error in is_solved: weighted pytest fractions can sum to 0.999... for a fully-passing task. - Prefer task subdirectories over a stray template task.toml at the dataset root in the converter's discovery. - Provision uv via pip when the image ships no curl (python:*-slim). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Rollout wall-clock is the training bottleneck (~20 min/sample) and the dumps carried no attribution. Add profiles/profiling.py: a PhaseTimer for env-side phases (work_boot/prep/agent/diff/eval_boot/eval) plus aiohttp middleware on the adapter that counts turns and generation time per session (bearer token == session_id). Both land in one "timing" dict per sample (swe_gym also records diff size metrics) -> sample extra -> rollout dumps, where the offline analyzer can attribute time across phases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A CancelledError raised inside a rollout (e.g. the Modal SDK's synchronicity bridge cancelling an in-flight .aio call on a client hiccup) is not caught by `except Exception` and propagated through generate_and_rm_group's gather, cancelling the whole generate_rollout_async task and crashing the training step. Catch it, abort only that sample, and re-raise only for a genuine external cancel (task left in a cancelling state). Aborted samples also keep whatever per-turn timing accrued, distinguishing 'agent alive but slow' from 'never dialed in'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

evalset.py turns a spec YAML into subsampled per-subset jsonls plus a manifest and a ready --eval-config block, so eval sets are versioned and pinned instead of hand-cut. Making the per-dataset overrides actually take effect required a generate.py fix: _sampling_params built the session defaults from the rollout_* args only, ignoring the sampling_params dict slime passes to generate() -- which on the eval path is what carries the eval_config temperature/top_p/top_k. Layer those overrides on top. Per-turn max_new_tokens deliberately stays adapter-governed. Document the eval flow (builder -> inline eval_config -> eval-only runs) in the README. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add repo-owned Qwen OpenAI adapter and openthoughts_agent converter, rework agent/environment base classes and convert2slime adapters, and streamline generate.py/modal_sandbox sampling and rollout handling. Remove stale data/ and profiles/ scaffolding. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

junlin-star force-pushed the junlin/agentic_rl branch from 0db9d43 to 60ec521 Compare June 11, 2026 02:09

junlin-star and others added 10 commits June 12, 2026 11:25

Apply pre-commit formatting (ruff, isort, black)

7f92f58

Co-authored-by: Cursor <cursoragent@cursor.com>

Fix isort import grouping in cispo loss test

f8e5e9a

Co-authored-by: Cursor <cursoragent@cursor.com>

Improve agentic RL runtime diagnostics

9bd6661

Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agentic-RL research scaffold with Modal sandbox and env adapters#4

Add agentic-RL research scaffold with Modal sandbox and env adapters#4
junlin-star wants to merge 11 commits into
async-researchfrom
junlin/agentic_rl

junlin-star commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

junlin-star commented Jun 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant