Skip to content

feat: vf v1 <> nano bridge#2742

Draft
mikasenghaas wants to merge 90 commits into
mainfrom
feat/nano-as-v1
Draft

feat: vf v1 <> nano bridge#2742
mikasenghaas wants to merge 90 commits into
mainfrom
feat/nano-as-v1

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 9, 2026

Copy link
Copy Markdown
Member

Companion PR to PrimeIntellect-ai/verifiers#1576 for verifiers v1 training integration.

mikasenghaas and others added 30 commits June 8, 2026 17:05
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Points the submodule at the vf-nano EnvServer branch so the orchestrator can
build on the env-server abstraction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Switch prime-rl's env path to vf-nano: the orchestrator spawns a vf-nano
EnvServer per env (it never loads an environment), dispatches rollouts by task
index, and trains on the returned Trace dicts (branches + renderer tokens).

- pyproject: dep verifiers -> vf-nano; drop v1/research env packages; only the
  vf-nano reverse-text example; override out the transitive v1 verifiers (pulled
  by the prime CLI) so it can't shadow vf-nano's `verifiers` package; add orjson
  /pandas/msgspec (were transitive via verifiers).
- EnvConfig inherits vf-nano's swappable agent/runtime (+ max_turns).
- envs.py: spawn EnvServer child + EnvClient, info() for num_tasks/group-scoring,
  dispatch by task_idx, adapt Trace -> RolloutOutput-shaped dict.
- trajectories.py: trace_to_samples (one sample per Trace branch) + trace_to_output.
- train_source: index sampling; client pool builds vf-nano ClientConfig; lag
  monitor vendored; env-server entrypoint repointed; ~14 files retyped off
  vf.RolloutOutput / vf.ClientConfig.
- configs/debug/vf_nano_reverse_text.toml.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er config)

- trace_to_samples stitches each Trace branch's tokens into one TrainingSample
  (prompt = branch start, then each turn's new context [masked] + generated
  tokens [trained]); drop the RolloutOutput adapter — read the Trace's native
  fields directly (reward, error{type,message}, timing generation/scoring,
  num_turns, branches).
- envs returns the raw Trace; eval_sink / train_sink / dispatcher / metrics /
  orchestrator read native Trace fields (no token_usage/completion/timing.total).
- client pool forwards the shared renderers.RendererConfig to the env server's
  renderer client (so it uses qwen3, not the tool-less default fallback).
- debug config: tool_call_parser=hermes (vLLM accepts the agent's tools),
  max_steps=20.
- bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o timeout)

- Env.run_rollout/run_group pass the vf-nano ClientConfig object and a
  SamplingConfig (built from the env's sampling args) directly — no model_dump,
  no per-rollout timeout forwarded to the server.
- debug config: max_steps=20.
- bump deps/vf-nano (typed env-server RPC).
The env server returns a Trace minus its derived fields; the orchestrator resolves
the env's Task subclass (from config.id) and validates the wire dict into a strict
Trace[EnvTask], so the whole orchestrator works with a real, typed vf.Trace —
typed task fields included (e.g. task.answer), nothing subscriptable.

- envs.py: resolve_task_type(env_id); run_rollout/run_group validate -> Trace[EnvTask].
- trajectories/types/dispatcher/train_sink/eval_sink/metrics/filters/advantage/utils
  /orchestrator: attribute access on the typed Trace (reward, error{type,message},
  branches, timing.<span>.duration, num_turns, ...); derived fields recompute on the
  consumer.
- Task/Trace/TimeSpan stay strict (StrictBaseModel) — no extra=ignore anywhere.
- bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator spawns the env server, so request the serve extra
(zmq/msgpack) explicitly now that vf-nano keeps them out of core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`from __future__ import annotations` already defers all annotations to strings,
so the quotes + `# noqa: F821` on the TYPE_CHECKING-only `vf.Trace` / `TrainRollout`
annotations are unnecessary (no import cycle — verifiers.nano never imports prime_rl).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The field holds a typed vf.Trace, so `trace` reads truer than `raw` (which
suggested an unparsed dict). Renames the field + every `.raw` access, the
`emit_rollout(trace=...)` param/kwarg, the to_dict field filter, and the
dispatcher cancel-path locals.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Drop the FinishedRollout proxy properties (error/reward/is_truncated and the
  example_id field); consumers now read r.trace.{reward,is_truncated,task.idx,...}
  directly. The trace is the single source of truth.
- Use vf.Trace.has_error for existence checks instead of `.error is not None`.
- Replace the prime-rl trace_* token-length utils with vf.Trace.{completion_len,
  total_tokens,has_response} (now on the trace); keep trace_to_samples.
- Carry task_idx end-to-end (GroupState.task_idx, env.run_rollout/run_group(task_idx),
  source dict key) instead of the example/example_id dict carrier; identity comes
  off trace.task.idx.
- Mark the local-package env arrangement as a temporary/experimental TODO.
- Move the debug config to configs/debug/nano/reverse_text.toml.
- Bump deps/vf-nano (Trace/Turn accessors).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- The env server binds tcp://127.0.0.1:0 and reports its concrete address back
  over a queue; the orchestrator connects to that. Removes _get_free_port and its
  TOCTOU race (the OS assigns the port atomically).
- A spawned server has already bound + loaded by the time it reports its address,
  so the untimed info() is enough — only poll wait_for_server_startup for an
  external (config.address) server, which has no spawn handshake.
- Bump deps/vf-nano (port report + Trace/Branch token-length accessors).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Task-subclass introspection now lives in vf-nano (vf.task_type); drop the
prime-rl copy and build the typed Trace via vf.Trace[vf.task_type(env_id)]. Bump
deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SFT trains on a teacher served over the chat client, which returns no token ids,
so the trace's turns have tokens=None and trace_to_samples yields nothing. Restore
backfill: for each tokenless turn, render its prompt + assistant response with the
student chat template and split on the longest common prefix to fill TurnTokens
(masks/logprobs come from trace_to_samples). train_sink.process_rollout backfills
when any turn lacks tokens, before building samples.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drop_group's error_rollout_output calls omitted the required task_idx, so an
off-policy cancel (on_new_version) raised TypeError. Use the group's task_idx
(or -1 when the group is already gone), mirroring handle_completed_rollout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- envs.py: EnvClient now returns Trace[WireTask]; upgrade to this env's real Task
  subclass via self.trace_type.model_validate(wire.to_wire()).
- dispatcher.py: drop the error_rollout_output helper — inline the synthetic error
  Trace at each call site using vf.Error's field names (type/message/traceback); the
  task-exception path carries a real traceback, cancels/empty-trajectory carry none.
- Bump deps/vf-nano.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nical

- Spawned env servers now route their output (logging + subprocess-runtime output)
  to <output_dir>/logs/envs/<name>.log via a _run_env_server wrapper that redirects
  stdout/stderr and sets up logging in the child. Previously the orchestrator-spawned
  server logged nowhere.
- Debug config: batch_size 16->128, group_size 8->16, eval num_examples 8->128
  (interval=1), matching configs/debug/training_modes/rl.toml.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The orchestrator already passes a train/eval-split log_dir (.../logs/envs/train,
.../logs/envs/eval), so _spawn must drop the file directly under it
(<log_dir>/<name>.log) rather than re-adding an envs/ subdir — which had buried the
train/eval split under logs/envs/<kind>/envs/<name>.log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instead of the orchestrator sidecar-spawning each env server as an mp child, the
rl launcher now spawns one `env-server` process per env (train + eval), each on a
free port, with output to logs/envs/{kind}/{name}.log and a crash monitor — same
model as inference/trainer. It sets env.address in the orchestrator config so the
orchestrator attaches (its existing external path) instead of spawning. Envs that
already set address (user-managed external server) are left alone; the orchestrator's
mp sidecar stays as the fallback for running `orchestrator` directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add RLConfig.env_server_base_port (default 5000); the i-th launcher-managed env binds
base_port + i. Drops the get_free_port dependency in the launcher.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Train envs bind base_port + i; eval envs bind base_port + ENV_SERVER_KIND_STRIDE + i
(stride 1000), so each kind has headroom for many envs without the blocks colliding
(was a single running index — train and eval sat adjacent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- env_server entrypoint: intercept vf-nano stdlib logging so the server's own logs
  (EnvServer up, request failures) land in logs/envs/<kind>/<name>.log — previously
  only loguru output was captured, swallowing them.
- envs.py: close the address-handoff mp.Queue after use (no resource_tracker
  leaked-semaphore warning on the sidecar path).
- configs/debug/nano/reverse_text.toml: drop the eval block, mirroring
  examples/reverse_text/rl.toml (train-only smoke; eval path validated separately).
- bump deps/vf-nano (serve/types docstring trim).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…irectly

The I/O boundary (save_rollouts + monitor sample tables) now dumps the typed
vf.Trace itself (r.trace.model_dump(mode="json")) instead of a Trace+metadata
merge — the on-disk rollout is just the trace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vf-nano renamed its rollout-driver abstraction Agent -> Harness. Update the
integration: EnvConfig.agent -> harness (HarnessConfig/DefaultHarnessConfig);
env.run_rollout/run_group spawn forwards harness_config; the env-server entrypoint
passes harness_config/harness_timeout; debug config uses `harness = {...}`. Bump
deps/vf-nano to the renamed branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mikasenghaas and others added 30 commits June 10, 2026 04:20
The `envs` extra wired `harnesses` and the individual `*-v1` example tasksets but
never the bundled `tasksets` package, so the integration tasksets it ships
(`harbor-v1`, `textarena-v1`) couldn't be resolved — `import_taskset("harbor-v1")`
raised ModuleNotFoundError ("tried to import 'harbor_v1'").

Add `tasksets` to the `envs` extra + a path source, and bump the verifiers
submodule to the feat/nano-as-v1 tip (#1600), where the bundled tasksets live
under the `tasksets` namespace package (`tasksets.harbor_v1`) and the loader
resolves the namespaced module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat: consume the v1 message-graph trace (graph-walk trace_to_samples)

Walk the new message graph (verifiers feat/trace-message-graph, PR #1606): trace_to_samples
builds one TrainingSample per branch by concatenating each branch path's node token_ids /
sampled_mask / logprobs (graph.branch_token_sequences), splitting prompt|completion at the
first sampled token — identical training tensors to the old per-turn stitching, off a trace
that is now linear (not quadratic) in turns. backfill_rollout_tokens is a no-op (training is
renderer-only; `trajectory` is now a read-only view over the graph). Bumps the verifiers
submodule to the graph-trace branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump verifiers submodule (MessageNode.mask rename)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: wire alphabet-sort-v1 taskset

Add alphabet-sort-v1 to the `envs` extra + `[tool.uv.sources]` so
configs/debug/v1/alphabet_sort.toml resolves (it referenced an example taskset that was
never wired into prime-rl). Used to verify graph-based training-sample construction on real
RL runs — v0 (legacy bridge) and v1 (native renderer path) both train cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: consume nodes/branches directly (drop Turn/trajectory readers)

`trace_to_samples` already walks the graph; the remaining readers move off the removed
Turn/trajectory API: the gibberish/repetition filters iterate per-node completions,
advantage/dispatcher use `trace.num_turns`/`trace.completion_len`, `get_model_completion_len`
is dropped (use `trace.completion_len`), and the renderer-only train_sink drops the backfill
path (also removing `backfill_rollout_tokens`). Bumps the verifiers submodule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump verifiers submodule (merge #1605 multiplex interception)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump verifiers submodule (dead-code cleanup)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): bump verifiers (readme highlight + ruff format)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: enforce renderer, SFT backfill, branch-first-class logging

Training is renderer-only now. RL/OPD roll out through the renderer client
(exact sampled token ids + logprobs); SFT rolls out against a chat-completions
teacher that returns no tokens and re-renders the conversation to backfill them
(`backfill_trace`). A renderer is required for every mode (`renderer=None`
rejected) — the oai client never produces correct training tokens for the
message graph. Drops the MITO no-renderer training path.

Logging consumes `trace.branches` as the first-class unit (`branch.token_ids` /
`branch.messages`) instead of the removed `trajectory` field; `trace_to_samples`
builds one sample per branch from the same accessors. Sample loggers take the
rollout objects so env_name/advantage are available.

Add configs/v1/training_mode (rl/opd/sft + lora/external) mirroring the v0
debug configs. Fix the v0 SFT debug configs + rlm_swe to validate under the
renderer requirement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: flat TrainingSample (token_ids + mask), required renderer

Drop the prompt/completion split from TrainingSample — it doesn't fit a
multi-turn/agentic branch, where context and model-sampled spans interleave. A
sample now carries the branch's flat `token_ids` plus per-token `mask` (True =
trainable), `logprobs`, and `temperatures` (all aligned). `prepare_sample`
passes them straight into the MicroBatch (already flat), and the packer
validates against `token_ids` length.

Make `orchestrator.renderer` a non-optional type (drop the `enforce_renderer`
validator) — training is renderer-only, so the type carries the requirement.

Bump the verifiers submodule to feat/nano-as-v1 (merged #1606 + Branch.branches
inlined).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: SFT teacher rolls out through the renderer client (drop backfill)

Training is renderer-only across every mode, so the SFT teacher now rolls out
through the renderer client too — its rollouts carry tokens directly, the same
as RL/OPD. Drops the chat-completions backfill (`backfill_trace` + the SFT path
in TrainSink) and the now-unused TrainSink renderer.

This requires a self-hosted teacher that shares the student's tokenizer (the
student trains on exactly the ids the renderer feeds the teacher); distilling
from an external chat API is no longer supported. Remove the `sft_external`
debug configs.

Validated: SFT on reverse-text-v1 trains cleanly (Trainable 128/128, eval reward
~0.1 -> ~0.82 over 20 steps) with the teacher on the renderer client, no backfill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: drop configs/v1/training_mode README

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor: consolidate rollout types into Rollout(vf.Trace)

The trace *is* the rollout: replace the FinishedRollout/TrainRollout/EvalRollout
wrappers with a prime-rl Rollout(vf.Trace[TaskT]) subclass that carries the
orchestration metadata (kind, env_name, group_id, policy_version,
off_policy_steps, samples, advantage, is_filtered, filter_results, eval_step) as
exclude=True fields — so dumping a Rollout still yields a plain trace (on-disk
results.jsonl unchanged). envs.py validates the wire trace into Rollout; the
dispatcher stamps the metadata; train vs eval is the `kind` discriminator
(replacing the isinstance check). All consumers read rollout.X directly instead
of rollout.trace.X.

Drop the monitor's SampleRollout duck-type Protocol — the loggers take the real
Rollout (TYPE_CHECKING import) and read branch.token_ids / branch.messages. Also
drop the prime monitor's _split_branch_messages and _json helpers: the
conversation is the unit (no prompt/completion split — meaningless multi-turn).

Fix a latent dispatcher bug surfaced along the way: synthetic error traces used
`error=` / `r.error = ` (a read-only computed field) — now `errors=[...]` /
`r.errors.append(...)`.

Rewrite the (long-stale, dict/`raw`-based) advantage + filters unit tests to
build real Rollouts — they now exercise the current trace-based code (previously
all failing on import/construction).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: allow verifiers + datasets in the slim-config dep check

The v1 config types (EnvConfig, Task, ...) extend `verifiers.v1`, which is a
declared, pure-pydantic dependency of prime-rl-configs (it pulls `datasets` for
the taskset/Task types but no GPU/ML deps). Drop `verifiers` and `datasets` from
the slim-install forbidden list — keep the real heavy training deps (torch,
vllm, transformers, ...).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the v1 end-to-end eval test suite (#1609) and the v0 legacy
env-server group-scoring fix (#1612).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the v0 legacy-bridge fixes: guard against non-renderer training
clients (#1613) and serve the eval split for eval-only v0 envs (#1614).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The SFT teacher rolls out through the renderer client (token-in/out) and
must share the student's tokenizer; drop the leftover oai-client / token
backfill description removed in the renderer-only refactor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up verifiers#1615: the legacy bridge builds a chat-completions client
for v0 eval rollouts (renderer for training), instead of raising on the
non-renderer eval client.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- register the scaleswe-v1 taskset (pyproject envs list + uv source)
- point the existing rlm-swe config (configs/rlm_swe/qwen35_4b.toml) at the
  scaleswe taskset (task_type="scaleswe", train + eval)
- add configs/debug/v1/scaleswe.toml — a per-env v1 port of that config using
  the scaleswe-v1 taskset via the rlm harness on the prime runtime

Companion to verifiers feat/scaleswe-v1 (scaleswe-v1 taskset + setup/workdir hooks).
Needs the deps/verifiers submodule bumped to that branch once it lands.
…ord-v1

Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch,
`mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and
EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's
`mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the
mapping through. The wandb sample logger now renders the task as a Table-safe JSON
string with image data elided — an image-bearing instruction crashed wandb's
Table type inference on the nested content list.

Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1,
2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule
for the multimodal message-graph support.

Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78,
Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword`
eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rd-v1 (#2766)

* feat(v1): multimodal training through the message graph + color-codeword-v1

Consume the v1 trace's multimodal sidecar. `trace_to_samples` builds, per branch,
`mm_kwargs` (the branch's per-image renderer items concatenated on dim 0 and
EncodedTensor-encoded) and `mm_token_type_ids` (the renderer's
`mm_token_type_id_map` applied to the branch tokens); `TrainSink` threads the
mapping through. The wandb sample logger now renders the task as a Table-safe JSON
string with image data elided — an image-bearing instruction crashed wandb's
Table type inference on the nested content list.

Adds `configs/v1/multimodal_color_codeword.toml` (Qwen3-VL-4B on color-codeword-v1,
2-GPU) and registers the `color-codeword-v1` taskset; bumps the verifiers submodule
for the multimodal message-graph support.

Verified end-to-end: the VLM trains through the mm path (eval 0.69 -> 0.78,
Trainable 256/256 — mm_kwargs reach the Qwen3-VL forward); v0 `color-codeword`
eval 0.625 ~= v1 `color-codeword-v1` eval 0.69 (faithful port).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): bump verifiers pin — multimodal review-pass cleanups

Picks up the verifiers feat/v1-multimodal head: the multimodal review-pass
(capability-flag docstrings, trimmed mm comments, color-codeword-v1 config
validator + module constants) and the merged malloc_trim worker-RSS fix (#1621).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): bump verifiers pin (content-part mm attribution) + config/test sync

- Bump deps/verifiers to the content-part multimodal attribution (drops the unused
  placeholder offset machinery).
- Drop max_turns/seed from the color-codeword-v1 taskset args in the config — the
  taskset hard-codes them as module constants now, and passing them is rejected.
- Update the mm egress unit test to assert mm_items order (the new attribution),
  not placeholder offsets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(trainer): slice mm_kwargs on truncation so tokens match embeddings

When a sample exceeds seq_len, prepare_sample truncated input_ids and
mm_token_type_ids but passed mm_kwargs through whole — leaving more image
embeddings than surviving image placeholders. Now truncation cuts to a
whole-image boundary (never splitting an image's placeholder block) and slices
mm_kwargs (pixel_values + image_grid_thw) to the images that fully survive, so
image-placeholder count == image-embedding count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): bump verifiers pin to ruff-formatted graph.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): remove test_trajectories_mm.py

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(v1): thread num_workers to the env-server worker pool

Wire the verifiers env-server worker pool into prime-rl: the orchestrator's
spawned env server (envs.py) and the `env-server` CLI now serve via
verifiers' serve_env with num_workers, so requests fan out across N worker
processes instead of one event loop. num_workers was already a config field but
dropped on the floor; it's now passed through and defaults to 4.

Companion to verifiers feat/v1-env-workers; needs deps/verifiers bumped to it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(v1): default num_workers to 4

Make the worker pool the default: num_workers defaults to 4 (was "auto"->1)
across the per-env, train, and eval configs, so training/eval env servers fan
rollouts across 4 worker processes out of the box. "auto" stays a valid value
(scales per concurrency); set num_workers=1 for the old single-process server.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): keep num_workers="auto" default on the orchestrator

Revert the orchestrator's per-env / train / eval num_workers defaults back to
"auto" (was 4) so they keep scaling 1 worker per 256 concurrent rollouts out of
the box. The standalone env server can't scale (no concurrency context — it's
driven by external clients), so its resolver collapses "auto" to a fixed 4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Align the pin with #1623 (env-server worker pool: router + N workers),
which the just-merged #2768 (thread num_workers to the pool) requires;
the pin had lagged at the pre-#1623 multimodal tip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add r2e-gym-v1 to the base v1 taskset deps + uv sources (editable from
deps/verifiers/examples/tasksets/r2e_gym_v1) so the id resolves through
the v1 loader, matching the other -v1 tasksets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- v0 configs/rlm_swe/qwen35_4b.toml: restore the train env to r2e and the
  eval env to swebench-verified-quick (as on main), reverting the scaleswe switch
- v1: rename configs/debug/v1/scaleswe.toml -> r2e_gym.toml, point the train env
  at the r2e-gym-v1 taskset, and drop the eval block

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Apply the edits the prior rename commit missed:
- v0 rlm_swe/qwen35_4b.toml: train -> r2e, eval -> swebench-verified-quick (as on main)
- v1 debug/v1/r2e_gym.toml: taskset -> r2e-gym-v1, eval block removed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Env servers spawn their worker pool as fresh `spawn` processes with no logging
handlers (verifiers#1626), so per-rollout logs (rollout start/done, context-exceed
warnings) were silently dropped. Pass `setup_env_server_logging` to verifiers'
`serve_env` as `log_setup`; it runs in the broker and in every worker. A worker
inherits the broker's redirected stdout/stderr, so its logs land in the same
`envs/{train,eval}/<name>.log` as before — no new files or paths.

Bumps deps/verifiers to the worker-logging fix.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1 and pick up #1627: the --rich
dashboard's token counts fall back to provider usage when the endpoint returns
no token ids (no more 0/0). The prior pin 3df34ba5 was a pre-rebase #1626
variant; 955b6cdf already contains the equivalent #1626 (env-server worker
logging) plus #1627.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the serve_env SIGTERM-teardown fix: pool/in-process env servers no
longer print a spurious KeyboardInterrupt traceback into the env logs on
shutdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up the verifiers floor bump so the renderers offset-tokenizer fix (dev40,
PRs #72/#75) can't be undercut by a pre-fix PyPI resolution. Re-locks uv.lock to
the dev40 specifier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1628 (reap the whole subprocess tree when a runtime run is cancelled).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2774)

* feat(v1): elastic env-server pool (inherit pool config from verifiers)

Companion to verifiers#1629. prime-rl's EnvConfig now extends vf.EnvServerConfig, so
each env inherits the `pool` discriminated union (static{num_workers=4} |
elastic{max_workers=None, multiplex=128}, default elastic) and the orchestrator's env
servers scale workers on demand instead of pre-spawning a fixed `auto` count.

- Drop the per-env / train-group / eval-group `num_workers` fields + the auto-resolution
  (ceil(max_inflight/256)); the elastic pool self-sizes from load.
- envs.py / env_server.py pass `vf.pool_serve_kwargs(env.pool)` to serve_env.
- Bump deps/verifiers to the elastic-pool branch.

Breaking: `num_workers` is replaced by `pool`. Configs set `pool = { type = "elastic",
multiplex = N }` or `{ type = "static", num_workers = N }`; the rlm_swe + r2e debug
configs are migrated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(v1): back-compat shim mapping legacy num_workers -> pool

EnvConfig forbids extra fields, so configs still setting the removed `num_workers`
would hard-fail. Add a `model_validator(mode="before")` that maps it onto `pool`:
an int -> a fixed `static` pool, `"auto"` -> the default `elastic` pool; an explicit
`pool` always wins. Keeps existing (incl. out-of-tree) configs parsing without edits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): drop num_workers from rlm_swe + r2e configs (use default elastic pool)

The default `pool` is already elastic (multiplex 128), so an explicit `pool` here was
redundant — just remove the legacy `num_workers` and inherit the default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Realign the pin onto origin/feat/nano-as-v1: the prior pin d0c5bc98 was the
unsquashed #1629 feature branch, now squash-merged as f404e97f
(content-identical). Picks up #1629 (static/elastic env-server pool config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1631 (per-rollout setup timing as a distinct phase) and #1632
(per-call model + runtime retries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_wire validation)

Fixes RunRolloutResponse ValidationError 'trace.timing.setup.duration: Extra
inputs are not permitted' that crashed every rollout (#1636 drops computed
durations from to_wire).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Picks up #1638 (add --resume for evals: re-run a previous run's
missing/errored rollouts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…[WireTask]) (#2781)

* chore(v1): stop importing env modules in the orchestrator

The orchestrator built its per-env trace_type as Rollout[vf.task_type(env_id)] for v1 envs, and
vf.task_type imports the env package just to read its Task subclass for typing the wire trace.
Nothing reads typed env task fields - only task.idx and a full task.model_dump - and WireTask
(extra="allow") preserves those fields (incl. on disk). Always use Rollout[vf.WireTask], so the
orchestrator never imports an env package: the env's type and runtime both live only in the
server process.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(v1): hoist the constant Rollout[WireTask] to a module-level ROLLOUT_TYPE

It no longer varies per env, so it doesn't belong as a per-instance attribute set in
Env.__init__ - lift it to a module constant used directly in run_rollout/run_group.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants