feat: vf v1 <> nano bridge#2742
Draft
mikasenghaas wants to merge 90 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Points the submodule at the vf-nano EnvServer branch so the orchestrator can build on the env-server abstraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano EnvServer per env (it never loads an environment), dispatches rollouts by task index, and trains on the returned Trace dicts (branches + renderer tokens). - pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the vf-nano reverse-text example; override out the transitive v1 verifiers (pulled by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson /pandas/msgspec (were transitive via verifiers). - EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns). - envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring, dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict. - trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output. - train_source: index sampling; client pool builds vf-nano ClientConfig; lag monitor vendored; env-server entrypoint repointed; ~14 files retyped off vf.RolloutOutput / vf.ClientConfig. - configs/debug/vf_nano_reverse_text.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er config)
- trace_to_samples stitches each Trace branch's tokens into one TrainingSample
(prompt = branch start, then each turn's new context [masked] + generated
tokens [trained]); drop the RolloutOutput adapter — read the Trace's native
fields directly (reward, error{type,message}, timing generation/scoring,
num_turns, branches).
- envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics /
orchestrator read native Trace fields (no token_usage/completion/timing.total).
- client pool forwards the shared renderers.RendererConfig to the env server's
renderer client (so it uses qwen3, not the tool-less default fallback).
- debug config: tool_call_parser=hermes (vLLM accepts the agent's tools),
max_steps=20.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o timeout) - Env.run_rollout/run_group pass the vf-nano ClientConfig object and a SamplingConfig (built from the env's sampling args) directly — no model_dump, no per-rollout timeout forwarded to the server. - debug config: max_steps=20. - bump deps/vf-nano (typed env-server RPC).
The env server returns a Trace minus its derived fields; the orchestrator resolves
the env's Task subclass (from config.id) and validates the wire dict into a strict
Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace —
typed task fields included (e.g. task.answer), nothing subscriptable.
- envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask].
- trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils
/orchestrator: attribute access on the typed Trace (reward, error{type,message},
branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the
consumer.
- Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere.
- bump deps/vf-nano.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator spawns the env server, so request the serve extra (zmq/msgpack) explicitly now that vf-nano keeps them out of core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`from __future__ import annotations` already defers all annotations to strings, so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout` annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which suggested an unparsed dict). Renames the field + every `.raw` access, the `emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the dispatcher cancel-path locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the
example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...}
directly. The trace is the single source of truth.
- Use vf.Trace.has_error for existence checks instead of `.error is not None`.
- Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len,
total_tokens,has_response} (now on the trace); keep trace_to_samples.
- Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx),
source dict key) instead of the example/example_id dict carrier; identity comes
off trace.task.idx.
- Mark the local-package env arrangement as a temporary/experimental TODO.
- Move the debug config to configs/debug/nano/reverse_text.toml.
- Bump deps/vf-nano (Trace/Turn accessors).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The env server binds tcp://127.0.0.1:0 and reports its concrete address back over a queue; the orchestrator connects to that. Removes _get_free_port and its TOCTOU race (the OS assigns the port atomically). - A spawned server has already bound + loaded by the time it reports its address, so the untimed info() is enough — only poll wait_for_server_startup for an external (config.address) server, which has no spawn handshake. - Bump deps/vf-nano (port report + Trace/Branch token-length accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SFT trains on a teacher served over the chat client, which returns no token ids, so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore backfill: for each tokenless turn, render its prompt + assistant response with the student chat template and split on the longest common prefix to fill TurnTokens (masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills when any turn lacks tokens, before building samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drop_group's error_rollout_output calls omitted the required task_idx, so an off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx (or -1 when the group is already gone), mirroring handle_completed_rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task subclass via self.trace_type.model_validate(wire.to_wire()). - dispatcher.py: drop the error_rollout_output helper — inline the synthetic error Trace at each call site using vf.Error's field names (type/message/traceback); the task-exception path carries a real traceback, cancels/empty-trajectory carry none. - Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nical - Spawned env servers now route their output (logging + subprocess-runtime output) to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned server logged nowhere. - Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128 (interval=1), matching configs/debug/training_modes/rl.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train, .../logs/envs/eval), so _spawn must drop the file directly under it (<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the train/eval split under logs/envs/<kind>/envs/<name>.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instead of the orchestrator sidecar-spawning each env server as an mp child, the
rl launcher now spawns one `env-server` process per env (train + eval), each on a
free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same
model as inference/trainer. It sets env.address in the orchestrator config so the
orchestrator attaches (its existing external path) instead of spawning. Envs that
already set address (user-managed external server) are left alone; the orchestrator's
mp sidecar stays as the fallback for running `orchestrator` directly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds base_port + i. Drops the get_free_port dependency in the launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i (stride 1000), so each kind has headroom for many envs without the blocks colliding (was a single running index — train and eval sat adjacent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously only loguru output was captured, swallowing them. - envs.py: close the address-handoff mp.Queue after use (no resource_tracker leaked-semaphore warning on the sidecar path). - configs/debug/nano/reverse_text.toml: drop the eval block, mirroring examples/reverse_text/rl.toml (train-only smoke; eval path validated separately). - bump deps/vf-nano (serve/types docstring trim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…irectly The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata merge — the on-disk rollout is just the trace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the
integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig);
env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint
passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump
deps/vf-nano to the renamed branch.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `envs` extra wired `harnesses` and the individual `*-v1` example tasksets but
never the bundled `tasksets` package, so the integration tasksets it ships
(`harbor-v1`, `textarena-v1`) couldn't be resolved — `import_taskset("harbor-v1")`
raised ModuleNotFoundError ("tried to import 'harbor_v1'").
Add `tasksets` to the `envs` extra + a path source, and bump the verifiers
submodule to the feat/nano-as-v1 tip (#1600), where the bundled tasksets live
under the `tasksets` namespace package (`tasksets.harbor_v1`) and the loader
resolves the namespaced module.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: consume the v1 message-graph trace (graph-walk trace_to_samples) Walk the new message graph (verifiers feat/trace-message-graph, PR #1606): trace_to_samples builds one TrainingSample per branch by concatenating each branch path's node token_ids / sampled_mask / logprobs (graph.branch_token_sequences), splitting prompt|completion at the first sampled token — identical training tensors to the old per-turn stitching, off a trace that is now linear (not quadratic) in turns. backfill_rollout_tokens is a no-op (training is renderer-only; `trajectory` is now a read-only view over the graph). Bumps the verifiers submodule to the graph-trace branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (MessageNode.mask rename) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: wire alphabet-sort-v1 taskset Add alphabet-sort-v1 to the `envs` extra + `[tool.uv.sources]` so configs/debug/v1/alphabet_sort.toml resolves (it referenced an example taskset that was never wired into prime-rl). Used to verify graph-based training-sample construction on real RL runs — v0 (legacy bridge) and v1 (native renderer path) both train cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consume nodes/branches directly (drop Turn/trajectory readers) `trace_to_samples` already walks the graph; the remaining readers move off the removed Turn/trajectory API: the gibberish/repetition filters iterate per-node completions, advantage/dispatcher use `trace.num_turns`/`trace.completion_len`, `get_model_completion_len` is dropped (use `trace.completion_len`), and the renderer-only train_sink drops the backfill path (also removing `backfill_rollout_tokens`). Bumps the verifiers submodule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (merge #1605 multiplex interception) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (dead-code cleanup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (readme highlight + ruff format) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: enforce renderer, SFT backfill, branch-first-class logging Training is renderer-only now. RL/OPD roll out through the renderer client (exact sampled token ids + logprobs); SFT rolls out against a chat-completions teacher that returns no tokens and re-renders the conversation to backfill them (`backfill_trace`). A renderer is required for every mode (`renderer=None` rejected) — the oai client never produces correct training tokens for the message graph. Drops the MITO no-renderer training path. Logging consumes `trace.branches` as the first-class unit (`branch.token_ids` / `branch.messages`) instead of the removed `trajectory` field; `trace_to_samples` builds one sample per branch from the same accessors. Sample loggers take the rollout objects so env_name/advantage are available. Add configs/v1/training_mode (rl/opd/sft + lora/external) mirroring the v0 debug configs. Fix the v0 SFT debug configs + rlm_swe to validate under the renderer requirement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: flat TrainingSample (token_ids + mask), required renderer Drop the prompt/completion split from TrainingSample — it doesn't fit a multi-turn/agentic branch, where context and model-sampled spans interleave. A sample now carries the branch's flat `token_ids` plus per-token `mask` (True = trainable), `logprobs`, and `temperatures` (all aligned). `prepare_sample` passes them straight into the MicroBatch (already flat), and the packer validates against `token_ids` length. Make `orchestrator.renderer` a non-optional type (drop the `enforce_renderer` validator) — training is renderer-only, so the type carries the requirement. Bump the verifiers submodule to feat/nano-as-v1 (merged #1606 + Branch.branches inlined). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: SFT teacher rolls out through the renderer client (drop backfill) Training is renderer-only across every mode, so the SFT teacher now rolls out through the renderer client too — its rollouts carry tokens directly, the same as RL/OPD. Drops the chat-completions backfill (`backfill_trace` + the SFT path in TrainSink) and the now-unused TrainSink renderer. This requires a self-hosted teacher that shares the student's tokenizer (the student trains on exactly the ids the renderer feeds the teacher); distilling from an external chat API is no longer supported. Remove the `sft_external` debug configs. Validated: SFT on reverse-text-v1 trains cleanly (Trainable 128/128, eval reward ~0.1 -> ~0.82 over 20 steps) with the teacher on the renderer client, no backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop configs/v1/training_mode README Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consolidate rollout types into Rollout(vf.Trace) The trace *is* the rollout: replace the FinishedRollout/TrainRollout/EvalRollout wrappers with a prime-rl Rollout(vf.Trace[TaskT]) subclass that carries the orchestration metadata (kind, env_name, group_id, policy_version, off_policy_steps, samples, advantage, is_filtered, filter_results, eval_step) as exclude=True fields — so dumping a Rollout still yields a plain trace (on-disk results.jsonl unchanged). envs.py validates the wire trace into Rollout; the dispatcher stamps the metadata; train vs eval is the `kind` discriminator (replacing the isinstance check). All consumers read rollout.X directly instead of rollout.trace.X. Drop the monitor's SampleRollout duck-type Protocol — the loggers take the real Rollout (TYPE_CHECKING import) and read branch.token_ids / branch.messages. Also drop the prime monitor's _split_branch_messages and _json helpers: the conversation is the unit (no prompt/completion split — meaningless multi-turn). Fix a latent dispatcher bug surfaced along the way: synthetic error traces used `error=` / `r.error = ` (a read-only computed field) — now `errors=[...]` / `r.errors.append(...)`. Rewrite the (long-stale, dict/`raw`-based) advantage + filters unit tests to build real Rollouts — they now exercise the current trace-based code (previously all failing on import/construction). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: allow verifiers + datasets in the slim-config dep check The v1 config types (EnvConfig, Task, ...) extend `verifiers.v1`, which is a declared, pure-pydantic dependency of prime-rl-configs (it pulls `datasets` for the taskset/Task types but no GPU/ML deps). Drop `verifiers` and `datasets` from the slim-install forbidden list — keep the real heavy training deps (torch, vllm, transformers, ...). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SFT teacher rolls out through the renderer client (token-in/out) and must share the student's tokenizer; drop the leftover oai-client / token backfill description removed in the renderer-only refactor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up verifiers#1615: the legacy bridge builds a chat-completions client for v0 eval rollouts (renderer for training), instead of raising on the non-renderer eval client. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- register the scaleswe-v1 taskset (pyproject envs list + uv source) - point the existing rlm-swe config (configs/rlm_swe/qwen35_4b.toml) at the scaleswe taskset (task_type="scaleswe", train + eval) - add configs/debug/v1/scaleswe.toml — a per-env v1 port of that config using the scaleswe-v1 taskset via the rlm harness on the prime runtime Companion to verifiers feat/scaleswe-v1 (scaleswe-v1 taskset + setup/workdir hooks). Needs the deps/verifiers submodule bumped to that branch once it lands.
…ord-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or-codeword-v1" This reverts commit 85b27cf.
…rd-v1 (#2766) * feat(v1): multimodal training through the message graph + color-codeword-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin — multimodal review-pass cleanups Picks up the verifiers feat/v1-multimodal head: the multimodal review-pass (capability-flag docstrings, trimmed mm comments, color-codeword-v1 config validator + module constants) and the merged malloc_trim worker-RSS fix (#1621). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (content-part mm attribution) + config/test sync - Bump deps/verifiers to the content-part multimodal attribution (drops the unused placeholder offset machinery). - Drop max_turns/seed from the color-codeword-v1 taskset args in the config — the taskset hard-codes them as module constants now, and passing them is rejected. - Update the mm egress unit test to assert mm_items order (the new attribution), not placeholder offsets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(trainer): slice mm_kwargs on truncation so tokens match embeddings When a sample exceeds seq_len, prepare_sample truncated input_ids and mm_token_type_ids but passed mm_kwargs through whole — leaving more image embeddings than surviving image placeholders. Now truncation cuts to a whole-image boundary (never splitting an image's placeholder block) and slices mm_kwargs (pixel_values + image_grid_thw) to the images that fully survive, so image-placeholder count == image-embedding count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to ruff-formatted graph.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): remove test_trajectories_mm.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): thread num_workers to the env-server worker pool Wire the verifiers env-server worker pool into prime-rl: the orchestrator's spawned env server (envs.py) and the `env-server` CLI now serve via verifiers' serve_env with num_workers, so requests fan out across N worker processes instead of one event loop. num_workers was already a config field but dropped on the floor; it's now passed through and defaults to 4. Companion to verifiers feat/v1-env-workers; needs deps/verifiers bumped to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): default num_workers to 4 Make the worker pool the default: num_workers defaults to 4 (was "auto"->1) across the per-env, train, and eval configs, so training/eval env servers fan rollouts across 4 worker processes out of the box. "auto" stays a valid value (scales per concurrency); set num_workers=1 for the old single-process server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): keep num_workers="auto" default on the orchestrator Revert the orchestrator's per-env / train / eval num_workers defaults back to "auto" (was 4) so they keep scaling 1 worker per 256 concurrent rollouts out of the box. The standalone env server can't scale (no concurrency context — it's driven by external clients), so its resolver collapses "auto" to a fixed 4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through the v1 loader, matching the other -v1 tasksets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the eval env to swebench-verified-quick (as on main), reverting the scaleswe switch - v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env at the r2e-gym-v1 taskset, and drop the eval block Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the edits the prior rename commit missed: - v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main) - v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env servers spawn their worker pool as fresh `spawn` processes with no logging
handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed
warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers'
`serve_env` as `log_setup`; it runs in the broker and in every worker. A worker
inherits the broker's redirected stdout/stderr, so its logs land in the same
`envs/{train,eval}/<name>.log` as before — no new files or paths.
Bumps deps/verifiers to the worker-logging fix.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich dashboard's token counts fall back to provider usage when the endpoint returns no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626 variant; 955b6cdf already contains the equivalent #1626 (env-server worker logging) plus #1627. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no longer print a spurious KeyboardInterrupt traceback into the env logs on shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2774) * feat(v1): elastic env-server pool (inherit pool config from verifiers) Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so each env inherits the `pool` discriminated union (static{num_workers=4} | elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env servers scale workers on demand instead of pre-spawning a fixed `auto` count. - Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution (ceil(max_inflight/256)); the elastic pool self-sizes from load. - envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env. - Bump deps/verifiers to the elastic-pool branch. Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic", multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug configs are migrated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): back-compat shim mapping legacy num_workers -> pool EnvConfig forbids extra fields, so configs still setting the removed `num_workers` would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`: an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit `pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool) The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was redundant — just remove the legacy `num_workers` and inherit the default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_wire validation) Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra inputs are not permitted' that crashed every rollout (#1636 drops computed durations from to_wire). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1638 (add --resume for evals: re-run a previous run's missing/errored rollouts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…[WireTask]) (#2781) * chore(v1): stop importing env modules in the orchestrator The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and vf.task_type imports the env package just to read its Task subclass for typing the wire trace. Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask (extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the orchestrator never imports an env package: the env's type and runtime both live only in the server process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE It no longer varies per env, so it doesn't belong as a per-instance attribute set in Env.__init__ - lift it to a module constant used directly in run_rollout/run_group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion PR to PrimeIntellect-ai/verifiers#1576 for verifiers v1 training integration.