Skip to content

Add agentic-RL research scaffold with Modal sandbox and env adapters#4

Open
junlin-star wants to merge 11 commits into
async-researchfrom
junlin/agentic_rl
Open

Add agentic-RL research scaffold with Modal sandbox and env adapters#4
junlin-star wants to merge 11 commits into
async-researchfrom
junlin/agentic_rl

Conversation

@junlin-star

Copy link
Copy Markdown

Summary

  • Add async_rl_research agentic-RL scaffold: mini-swe-agent rollout updates and new agent base
  • Add Modal-based sandbox (modal_sandbox.py), replacing the old sandbox.py
  • Add environment adapters (harbor, swe_gym) plus convert2slime conversion helpers
  • Add supporting docs (README.md, data/README.md) and architecture diagram

Note: the async_rl_research/dashboard directory is intentionally excluded from this PR.

Test plan

  • Verify env adapters import and convert correctly
  • Smoke-test Modal sandbox rollout

Made with Cursor

Introduce async_rl_research components: mini-swe-agent rollout updates,
Modal-based sandbox, environment adapters (harbor, swe_gym) with
slime conversion helpers, and supporting docs.

Co-authored-by: Cursor <cursoragent@cursor.com>
junlin-star and others added 10 commits June 12, 2026 11:25
async_rl_research.env shadowed the common .env/env-var association and
collided with the repo .gitignore's env/ habits; rename the package to
async_rl_research.environment and update every module path reference
(ENVS registry strings, imports, oracle-check CLI invocations, docs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
mini-swe-agent prompts previously came from per-task-family builtin YAMLs
(swe_gym -> benchmarks/swebench.yaml, harbor -> mini.yaml), so the prompt
scaffold differed across task families and pinned task contracts (the
swebench patch.txt ritual) that did not match what actually gets graded.

Ship one repo-owned config (agent/config/universal.yaml) uploaded into the
sandbox as the default for every family; the task-specific deliverable now
lives in the instruction text the env writes (swe_gym appends an explicit
git-diff deliverable contract; harbor instruction.md files already carry
their own). Builtin configs remain reachable via the override ladder:
MSWE_CONFIG env > metadata.agent_config > the uploaded universal config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Fixes found while onboarding openthoughts-tblite:

- Grade reward-file-less tasks from test.sh's exit code (0/1 only; other
  exit codes stay "no verdict" so infra breakage is not scored as a model
  failure) -- terminal-bench-style tasks end test.sh with a bare pytest run.
- Tolerate float error in is_solved: weighted pytest fractions can sum to
  0.999... for a fully-passing task.
- Prefer task subdirectories over a stray template task.toml at the dataset
  root in the converter's discovery.
- Provision uv via pip when the image ships no curl (python:*-slim).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rollout wall-clock is the training bottleneck (~20 min/sample) and the dumps
carried no attribution. Add profiles/profiling.py: a PhaseTimer for env-side
phases (work_boot/prep/agent/diff/eval_boot/eval) plus aiohttp middleware on
the adapter that counts turns and generation time per session (bearer token
== session_id). Both land in one "timing" dict per sample (swe_gym also
records diff size metrics) -> sample extra -> rollout dumps, where the
offline analyzer can attribute time across phases.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A CancelledError raised inside a rollout (e.g. the Modal SDK's
synchronicity bridge cancelling an in-flight .aio call on a client hiccup)
is not caught by `except Exception` and propagated through
generate_and_rm_group's gather, cancelling the whole
generate_rollout_async task and crashing the training step. Catch it,
abort only that sample, and re-raise only for a genuine external cancel
(task left in a cancelling state). Aborted samples also keep whatever
per-turn timing accrued, distinguishing 'agent alive but slow' from
'never dialed in'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
evalset.py turns a spec YAML into subsampled per-subset jsonls plus a
manifest and a ready --eval-config block, so eval sets are versioned and
pinned instead of hand-cut.

Making the per-dataset overrides actually take effect required a generate.py
fix: _sampling_params built the session defaults from the rollout_* args
only, ignoring the sampling_params dict slime passes to generate() -- which
on the eval path is what carries the eval_config temperature/top_p/top_k.
Layer those overrides on top. Per-turn max_new_tokens deliberately stays
adapter-governed.

Document the eval flow (builder -> inline eval_config -> eval-only runs) in
the README.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add repo-owned Qwen OpenAI adapter and openthoughts_agent converter, rework
agent/environment base classes and convert2slime adapters, and streamline
generate.py/modal_sandbox sampling and rollout handling. Remove stale data/
and profiles/ scaffolding.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant