feat: vf v1 <> nano bridge by mikasenghaas · Pull Request #2742 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-06-09T17:52:37Z

Companion PR to PrimeIntellect-ai/verifiers#1576 for verifiers v1 training integration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Points the submodule at the vf-nano EnvServer branch so the orchestrator can build on the env-server abstraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano EnvServer per env (it never loads an environment), dispatches rollouts by task index, and trains on the returned Trace dicts (branches + renderer tokens). - pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the vf-nano reverse-text example; override out the transitive v1 verifiers (pulled by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson /pandas/msgspec (were transitive via verifiers). - EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns). - envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring, dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict. - trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output. - train_source: index sampling; client pool builds vf-nano ClientConfig; lag monitor vendored; env-server entrypoint repointed; ~14 files retyped off vf.RolloutOutput / vf.ClientConfig. - configs/debug/vf_nano_reverse_text.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er config) - trace_to_samples stitches each Trace branch's tokens into one TrainingSample (prompt = branch start, then each turn's new context [masked] + generated tokens [trained]); drop the RolloutOutput adapter — read the Trace's native fields directly (reward, error{type,message}, timing generation/scoring, num_turns, branches). - envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics / orchestrator read native Trace fields (no token_usage/completion/timing.total). - client pool forwards the shared renderers.RendererConfig to the env server's renderer client (so it uses qwen3, not the tool-less default fallback). - debug config: tool_call_parser=hermes (vLLM accepts the agent's tools), max_steps=20. - bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…o timeout) - Env.run_rollout/run_group pass the vf-nano ClientConfig object and a SamplingConfig (built from the env's sampling args) directly — no model_dump, no per-rollout timeout forwarded to the server. - debug config: max_steps=20. - bump deps/vf-nano (typed env-server RPC).

The env server returns a Trace minus its derived fields; the orchestrator resolves the env's Task subclass (from config.id) and validates the wire dict into a strict Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace — typed task fields included (e.g. task.answer), nothing subscriptable. - envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask]. - trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils /orchestrator: attribute access on the typed Trace (reward, error{type,message}, branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the consumer. - Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere. - bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The orchestrator spawns the env server, so request the serve extra (zmq/msgpack) explicitly now that vf-nano keeps them out of core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`from __future__ import annotations` already defers all annotations to strings, so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout` annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which suggested an unparsed dict). Renames the field + every `.raw` access, the `emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the dispatcher cancel-path locals. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...} directly. The trace is the single source of truth. - Use vf.Trace.has_error for existence checks instead of `.error is not None`. - Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len, total_tokens,has_response} (now on the trace); keep trace_to_samples. - Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx), source dict key) instead of the example/example_id dict carrier; identity comes off trace.task.idx. - Mark the local-package env arrangement as a temporary/experimental TODO. - Move the debug config to configs/debug/nano/reverse_text.toml. - Bump deps/vf-nano (Trace/Turn accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- The env server binds tcp://127.0.0.1:0 and reports its concrete address back over a queue; the orchestrator connects to that. Removes _get_free_port and its TOCTOU race (the OS assigns the port atomically). - A spawned server has already bound + loaded by the time it reports its address, so the untimed info() is enough — only poll wait_for_server_startup for an external (config.address) server, which has no spawn handshake. - Bump deps/vf-nano (port report + Trace/Branch token-length accessors). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

SFT trains on a teacher served over the chat client, which returns no token ids, so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore backfill: for each tokenless turn, render its prompt + assistant response with the student chat template and split on the longest common prefix to fill TurnTokens (masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills when any turn lacks tokens, before building samples. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

drop_group's error_rollout_output calls omitted the required task_idx, so an off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx (or -1 when the group is already gone), mirroring handle_completed_rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task subclass via self.trace_type.model_validate(wire.to_wire()). - dispatcher.py: drop the error_rollout_output helper — inline the synthetic error Trace at each call site using vf.Error's field names (type/message/traceback); the task-exception path carries a real traceback, cancels/empty-trajectory carry none. - Bump deps/vf-nano. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nical - Spawned env servers now route their output (logging + subprocess-runtime output) to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned server logged nowhere. - Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128 (interval=1), matching configs/debug/training_modes/rl.toml. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train, .../logs/envs/eval), so _spawn must drop the file directly under it (<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the train/eval split under logs/envs/<kind>/envs/<name>.log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Instead of the orchestrator sidecar-spawning each env server as an mp child, the rl launcher now spawns one `env-server` process per env (train + eval), each on a free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same model as inference/trainer. It sets env.address in the orchestrator config so the orchestrator attaches (its existing external path) instead of spawning. Envs that already set address (user-managed external server) are left alone; the orchestrator's mp sidecar stays as the fallback for running `orchestrator` directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds base_port + i. Drops the get_free_port dependency in the launcher. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i (stride 1000), so each kind has headroom for many envs without the blocks colliding (was a single running index — train and eval sat adjacent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously only loguru output was captured, swallowing them. - envs.py: close the address-handoff mp.Queue after use (no resource_tracker leaked-semaphore warning on the sidecar path). - configs/debug/nano/reverse_text.toml: drop the eval block, mirroring examples/reverse_text/rl.toml (train-only smoke; eval path validated separately). - bump deps/vf-nano (serve/types docstring trim). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…irectly The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata merge — the on-disk rollout is just the trace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig); env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump deps/vf-nano to the renamed branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The `envs` extra wired `harnesses` and the individual `*-v1` example tasksets but never the bundled `tasksets` package, so the integration tasksets it ships (`harbor-v1`, `textarena-v1`) couldn't be resolved — `import_taskset("harbor-v1")` raised ModuleNotFoundError ("tried to import 'harbor_v1'"). Add `tasksets` to the `envs` extra + a path source, and bump the verifiers submodule to the feat/nano-as-v1 tip (#1600), where the bundled tasksets live under the `tasksets` namespace package (`tasksets.harbor_v1`) and the loader resolves the namespaced module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: consume the v1 message-graph trace (graph-walk trace_to_samples) Walk the new message graph (verifiers feat/trace-message-graph, PR #1606): trace_to_samples builds one TrainingSample per branch by concatenating each branch path's node token_ids / sampled_mask / logprobs (graph.branch_token_sequences), splitting prompt|completion at the first sampled token — identical training tensors to the old per-turn stitching, off a trace that is now linear (not quadratic) in turns. backfill_rollout_tokens is a no-op (training is renderer-only; `trajectory` is now a read-only view over the graph). Bumps the verifiers submodule to the graph-trace branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (MessageNode.mask rename) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: wire alphabet-sort-v1 taskset Add alphabet-sort-v1 to the `envs` extra + `[tool.uv.sources]` so configs/debug/v1/alphabet_sort.toml resolves (it referenced an example taskset that was never wired into prime-rl). Used to verify graph-based training-sample construction on real RL runs — v0 (legacy bridge) and v1 (native renderer path) both train cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consume nodes/branches directly (drop Turn/trajectory readers) `trace_to_samples` already walks the graph; the remaining readers move off the removed Turn/trajectory API: the gibberish/repetition filters iterate per-node completions, advantage/dispatcher use `trace.num_turns`/`trace.completion_len`, `get_model_completion_len` is dropped (use `trace.completion_len`), and the renderer-only train_sink drops the backfill path (also removing `backfill_rollout_tokens`). Bumps the verifiers submodule. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (merge #1605 multiplex interception) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump verifiers submodule (dead-code cleanup) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers (readme highlight + ruff format) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: enforce renderer, SFT backfill, branch-first-class logging Training is renderer-only now. RL/OPD roll out through the renderer client (exact sampled token ids + logprobs); SFT rolls out against a chat-completions teacher that returns no tokens and re-renders the conversation to backfill them (`backfill_trace`). A renderer is required for every mode (`renderer=None` rejected) — the oai client never produces correct training tokens for the message graph. Drops the MITO no-renderer training path. Logging consumes `trace.branches` as the first-class unit (`branch.token_ids` / `branch.messages`) instead of the removed `trajectory` field; `trace_to_samples` builds one sample per branch from the same accessors. Sample loggers take the rollout objects so env_name/advantage are available. Add configs/v1/training_mode (rl/opd/sft + lora/external) mirroring the v0 debug configs. Fix the v0 SFT debug configs + rlm_swe to validate under the renderer requirement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: flat TrainingSample (token_ids + mask), required renderer Drop the prompt/completion split from TrainingSample — it doesn't fit a multi-turn/agentic branch, where context and model-sampled spans interleave. A sample now carries the branch's flat `token_ids` plus per-token `mask` (True = trainable), `logprobs`, and `temperatures` (all aligned). `prepare_sample` passes them straight into the MicroBatch (already flat), and the packer validates against `token_ids` length. Make `orchestrator.renderer` a non-optional type (drop the `enforce_renderer` validator) — training is renderer-only, so the type carries the requirement. Bump the verifiers submodule to feat/nano-as-v1 (merged #1606 + Branch.branches inlined). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: SFT teacher rolls out through the renderer client (drop backfill) Training is renderer-only across every mode, so the SFT teacher now rolls out through the renderer client too — its rollouts carry tokens directly, the same as RL/OPD. Drops the chat-completions backfill (`backfill_trace` + the SFT path in TrainSink) and the now-unused TrainSink renderer. This requires a self-hosted teacher that shares the student's tokenizer (the student trains on exactly the ids the renderer feeds the teacher); distilling from an external chat API is no longer supported. Remove the `sft_external` debug configs. Validated: SFT on reverse-text-v1 trains cleanly (Trainable 128/128, eval reward ~0.1 -> ~0.82 over 20 steps) with the teacher on the renderer client, no backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: drop configs/v1/training_mode README Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor: consolidate rollout types into Rollout(vf.Trace) The trace *is* the rollout: replace the FinishedRollout/TrainRollout/EvalRollout wrappers with a prime-rl Rollout(vf.Trace[TaskT]) subclass that carries the orchestration metadata (kind, env_name, group_id, policy_version, off_policy_steps, samples, advantage, is_filtered, filter_results, eval_step) as exclude=True fields — so dumping a Rollout still yields a plain trace (on-disk results.jsonl unchanged). envs.py validates the wire trace into Rollout; the dispatcher stamps the metadata; train vs eval is the `kind` discriminator (replacing the isinstance check). All consumers read rollout.X directly instead of rollout.trace.X. Drop the monitor's SampleRollout duck-type Protocol — the loggers take the real Rollout (TYPE_CHECKING import) and read branch.token_ids / branch.messages. Also drop the prime monitor's _split_branch_messages and _json helpers: the conversation is the unit (no prompt/completion split — meaningless multi-turn). Fix a latent dispatcher bug surfaced along the way: synthetic error traces used `error=` / `r.error = ` (a read-only computed field) — now `errors=[...]` / `r.errors.append(...)`. Rewrite the (long-stale, dict/`raw`-based) advantage + filters unit tests to build real Rollouts — they now exercise the current trace-based code (previously all failing on import/construction). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: allow verifiers + datasets in the slim-config dep check The v1 config types (EnvConfig, Task, ...) extend `verifiers.v1`, which is a declared, pure-pydantic dependency of prime-rl-configs (it pulls `datasets` for the taskset/Task types but no GPU/ML deps). Drop `verifiers` and `datasets` from the slim-install forbidden list — keep the real heavy training deps (torch, vllm, transformers, ...). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up the v1 end-to-end eval test suite (#1609) and the v0 legacy env-server group-scoring fix (#1612). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up the v0 legacy-bridge fixes: guard against non-renderer training clients (#1613) and serve the eval split for eval-only v0 envs (#1614). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The SFT teacher rolls out through the renderer client (token-in/out) and must share the student's tokenizer; drop the leftover oai-client / token backfill description removed in the renderer-only refactor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up verifiers#1615: the legacy bridge builds a chat-completions client for v0 eval rollouts (renderer for training), instead of raising on the non-renderer eval client. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- register the scaleswe-v1 taskset (pyproject envs list + uv source) - point the existing rlm-swe config (configs/rlm_swe/qwen35_4b.toml) at the scaleswe taskset (task_type="scaleswe", train + eval) - add configs/debug/v1/scaleswe.toml — a per-env v1 port of that config using the scaleswe-v1 taskset via the rlm harness on the prime runtime Companion to verifiers feat/scaleswe-v1 (scaleswe-v1 taskset + setup/workdir hooks). Needs the deps/verifiers submodule bumped to that branch once it lands.

…ord-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…or-codeword-v1" This reverts commit 85b27cf.

…rd-v1 (#2766) * feat(v1): multimodal training through the message graph + color-codeword-v1 Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch, `mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's `mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the mapping through. The wandb sample logger now renders the task as a Table-safe JSON string with image data elided — an image-bearing instruction crashed wandb's Table type inference on the nested content list. Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1, 2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule for the multimodal message-graph support. Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78, Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword` eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin — multimodal review-pass cleanups Picks up the verifiers feat/v1-multimodal head: the multimodal review-pass (capability-flag docstrings, trimmed mm comments, color-codeword-v1 config validator + module constants) and the merged malloc_trim worker-RSS fix (#1621). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin (content-part mm attribution) + config/test sync - Bump deps/verifiers to the content-part multimodal attribution (drops the unused placeholder offset machinery). - Drop max_turns/seed from the color-codeword-v1 taskset args in the config — the taskset hard-codes them as module constants now, and passing them is rejected. - Update the mm egress unit test to assert mm_items order (the new attribution), not placeholder offsets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(trainer): slice mm_kwargs on truncation so tokens match embeddings When a sample exceeds seq_len, prepare_sample truncated input_ids and mm_token_type_ids but passed mm_kwargs through whole — leaving more image embeddings than surviving image placeholders. Now truncation cuts to a whole-image boundary (never splitting an image's placeholder block) and slices mm_kwargs (pixel_values + image_grid_thw) to the images that fully survive, so image-placeholder count == image-embedding count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): bump verifiers pin to ruff-formatted graph.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): remove test_trajectories_mm.py Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): thread num_workers to the env-server worker pool Wire the verifiers env-server worker pool into prime-rl: the orchestrator's spawned env server (envs.py) and the `env-server` CLI now serve via verifiers' serve_env with num_workers, so requests fan out across N worker processes instead of one event loop. num_workers was already a config field but dropped on the floor; it's now passed through and defaults to 4. Companion to verifiers feat/v1-env-workers; needs deps/verifiers bumped to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(v1): default num_workers to 4 Make the worker pool the default: num_workers defaults to 4 (was "auto"->1) across the per-env, train, and eval configs, so training/eval env servers fan rollouts across 4 worker processes out of the box. "auto" stays a valid value (scales per concurrency); set num_workers=1 for the old single-process server. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): keep num_workers="auto" default on the orchestrator Revert the orchestrator's per-env / train / eval num_workers defaults back to "auto" (was 4) so they keep scaling 1 worker per 256 concurrent rollouts out of the box. The standalone env server can't scale (no concurrency context — it's driven by external clients), so its resolver collapses "auto" to a fixed 4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Align the pin with #1623 (env-server worker pool: router + N workers), which the just-merged #2768 (thread num_workers to the pool) requires; the pin had lagged at the pre-#1623 multimodal tip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through the v1 loader, matching the other -v1 tasksets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the eval env to swebench-verified-quick (as on main), reverting the scaleswe switch - v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env at the r2e-gym-v1 taskset, and drop the eval block Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Apply the edits the prior rename commit missed: - v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main) - v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Env servers spawn their worker pool as fresh `spawn` processes with no logging handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers' `serve_env` as `log_setup`; it runs in the broker and in every worker. A worker inherits the broker's redirected stdout/stderr, so its logs land in the same `envs/{train,eval}/<name>.log` as before — no new files or paths. Bumps deps/verifiers to the worker-logging fix. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich dashboard's token counts fall back to provider usage when the endpoint returns no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626 variant; 955b6cdf already contains the equivalent #1626 (env-server worker logging) plus #1627. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no longer print a spurious KeyboardInterrupt traceback into the env logs on shutdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up the verifiers floor bump so the renderers offset-tokenizer fix (dev40, PRs #72/#75) can't be undercut by a pre-fix PyPI resolution. Re-locks uv.lock to the dev40 specifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…2774) * feat(v1): elastic env-server pool (inherit pool config from verifiers) Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so each env inherits the `pool` discriminated union (static{num_workers=4} | elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env servers scale workers on demand instead of pre-spawning a fixed `auto` count. - Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution (ceil(max_inflight/256)); the elastic pool self-sizes from load. - envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env. - Bump deps/verifiers to the elastic-pool branch. Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic", multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug configs are migrated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(v1): back-compat shim mapping legacy num_workers -> pool EnvConfig forbids extra fields, so configs still setting the removed `num_workers` would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`: an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit `pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool) The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was redundant — just remove the legacy `num_workers` and inherit the default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Realign the pin onto origin/feat/nano-as-v1: the prior pin d0c5bc98 was the unsquashed #1629 feature branch, now squash-merged as f404e97f (content-identical). Picks up #1629 (static/elastic env-server pool config). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632 (per-call model + runtime retries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…_wire validation) Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra inputs are not permitted' that crashed every rollout (#1636 drops computed durations from to_wire). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Picks up #1638 (add --resume for evals: re-run a previous run's missing/errored rollouts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…[WireTask]) (#2781) * chore(v1): stop importing env modules in the orchestrator The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and vf.task_type imports the env package just to read its Task subclass for typing the wire trace. Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask (extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the orchestrator never imports an env package: the env's type and runtime both live only in the server process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE It no longer varies per env, so it doesn't belong as a per-instance attribute set in Env.__init__ - lift it to a module constant used directly in run_rollout/run_group. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas and others added 30 commits June 8, 2026 17:05

feat: add vf-nano as submodule

43714c2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump deps/vf-nano to feat/env-server (EnvServer)

4c6b282

Points the submodule at the vf-nano EnvServer branch so the orchestrator can build on the env-server abstraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: depend on vf-nano[serve]; bump submodule

a71dfff

The orchestrator spawns the env server, so request the serve extra (zmq/msgpack) explicitly now that vf-nano keeps them out of core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (client docstring cleanup)

525ced0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (Error.traceback str | None)

a1e9d5e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (to_wire ordering)

5bb5d2e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (BaseRequest marker, no request_type field)

3f15d8c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: env client uses client= (was client_config=); bump vf-nano

5859660

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (drop renderers dep comment)

7a9651a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump vf-nano (configs/ + cli/ split, serve/ runtime-only)

614ab80

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas and others added 30 commits June 10, 2026 04:20

chore(v1): bump verifiers submodule to feat/nano-as-v1 tip

d6522de

Picks up the v1 end-to-end eval test suite (#1609) and the v0 legacy env-server group-scoring fix (#1612). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers submodule to feat/nano-as-v1 tip

9b9f002

Picks up the v0 legacy-bridge fixes: guard against non-renderer training clients (#1613) and serve the eval split for eval-only v0 envs (#1614). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert "feat(v1): multimodal training through the message graph + col…

e4800fe

…or-codeword-v1" This reverts commit 85b27cf.

chore(v1): bump verifiers to db82b38a (reap subprocess tree on cancel)

77c75a9

Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to 88e9bedd

9ae4ee3

Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632 (per-call model + runtime retries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to 5dc084f5

123ebfb

Picks up #1638 (add --resume for evals: re-run a previous run's missing/errored rollouts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to 472622ba

cf0c966

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to 7270e69b

e52471c

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to 66c87d5b

aa70aba

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore(v1): bump verifiers to ef45f720

0fa5eb4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vf v1 <> nano bridge#2742

feat: vf v1 <> nano bridge#2742
mikasenghaas wants to merge 90 commits into
mainfrom
feat/nano-as-v1

mikasenghaas commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Jun 9, 2026 •

edited

Loading