feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo) by hallerite · Pull Request #2746 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-09T19:17:56Z

What

Makes prime-rl's training algorithm a first-class, hackable abstraction — removes model roles from the pipeline, unifies every training signal under one concept (the advantage: credit assignment and loss routing, fused), and makes each algorithm a named runtime class that owns its methods. There is no registry and no separate "token scorer": there is the live policy — the only model prime-rl ever hosts — and per-env algorithms whose model references are either "policy" or an inline externally-hosted frozen endpoint.

An algorithm is a bundle of two components, configured under [orchestrator.algo] (per-env: algo = {...} or the advantage = {...} shorthand on the env) and resolved per env. The advantage type names the algorithm — grpo, max_rl, opd, opsd, sft, echo, reward, custom — and each type's class defaults are its vetted setting; there is no separate preset layer:

Sampling — how train rollouts are produced: which model generates them, "policy" or an inline frozen endpoint. At runtime this builds the env's Sampler (orchestrator/sampler.py) — the pool rollouts come from, the liveness consequences (logprobs, prefix-cache salting, staleness), and the home of future sampling strategies (replay buffers, branching).
Advantage — the per-token training signal: one mapping from a finalized rollout to per-token (loss component, weight) pairs. Credit assignment computes the magnitude, loss routing picks the component — two coordinates of the same output, so they are one config component and one runtime object. Group-relative strategies compute scalars on the orchestrator and ship numbers; reference-KL strategies query a reference model at batch-ship time and ship its prefill logprobs for the trainer to evaluate against the live policy. The strategy determines the action-token loss component (rl / ce / ref_kl) and what happens to env-provided observation tokens (masked out by default; echo trains on them with weighted CE — selected by message role via the renderer's per-token attribution, each role at its own alpha, tool-response bodies at λ=0.1 by default, optionally narrowed by a user-supplied per-rollout token filter).

The training loss is a sum of three components — rl (DPPO+KL or custom), ce (masked NLL), ref_kl (reverse KL to a reference as the PG signal) — each normalized by its own global token count:

$$\mathcal{L} = \tfrac{\sum \mathcal{L}_{rl}}{N_{rl}} + \tfrac{\sum \mathcal{L}_{ce}}{N_{ce}} + \tfrac{\sum \mathcal{L}_{ref_kl}}{N_{ref_kl}}$$

The orchestrator stamps per-token component weight streams (rl_weights / ce_weights / ref_kl_weights); a weight scales that component's per-token loss, 0.0 removes the token from the component's mask and denominator, and components may overlap on the same token (gradients sum). Per-component normalization means components never dilute each other: echo's observation tokens no longer shrink the rl term's effective per-token learning rate (previously both shared one global denominator, so the rl gradient scaled with the batch's obs/action ratio), and a supervised env packed next to a GRPO env doesn't soften its gradient.

[orchestrator.algo.advantage]
type = "opd"

[orchestrator.algo.teacher]   # alias for `model`; folds into advantage.model
name = "Qwen/Qwen3-32B"
base_url = ["http://localhost:8001/v1"]

The trainer is algorithm-blind (component weight streams ship per token on the wire; the trainer executes the three fixed components), and the orchestrator pipeline is too: dispatcher, train sink, and orchestrator call hooks on each env's Sampler + Algorithm objects and never branch on algorithm config or model roles.

The algorithm classes

Each algorithm is a named runtime class — the algorithm object is the algorithm. Dispatch is keyed on advantage.type — it names the algorithm, and each config class's defaults are its vetted parameterization:

`advantage.type`	Class	Component	`assign` (group time)	`score` (ship time)
`grpo`	`GRPOAlgorithm`	`rl`	group-norm scalars (optional length penalty)	—
`max_rl`	`MaxRLAlgorithm`	`rl`	mean-normalized group scalars (arXiv:2602.02710: unbiased for the order-`group_size` truncation of the max-likelihood objective)	—
`echo`	`EchoAlgorithm`	`rl` + `ce`	group-norm scalars; env-provided tokens selected by message role (`roles.<role>.alpha`, tool bodies @ 0.1 default), optional user token filter	—
`reward`	`RewardAlgorithm`	`rl`	raw reward	—
`opd`	`OPDAlgorithm`	`ref_kl`	none (`advantage=None`, filters skip)	own-context prefill under the teacher
`opsd`	`OPSDAlgorithm`	`ref_kl`	none (`advantage=None`, filters skip)	demo-conditioned prefill under the teacher (default `"policy"`)
`sft`	`SFTDistillAlgorithm`	`ce`	group-norm (feeds reward-based filtering)	—
`custom`	`CustomAlgorithm`	`rl`	your function — scalar per rollout, optionally per-token (`AdvantageOutputs.token_advantages`, completion-aligned)	—

Reading a class top to bottom reads the algorithm; writing your own is subclassing Algorithm and overriding the same two methods. Duplication of orchestration between similar algorithms (OPD vs OPSD) is accepted so each class stays self-contained; shared math (group normalization, prefill alignment, length penalties) lives as plain functions in algo/advantage.py. Algorithms take exactly two runtime resources — Algorithm(config, policy_pool, renderer); text → token ids always goes through the renderer, the same path the policy's own prompts take (opsd and echo require one, validated at config time — demo-conditioned scoring or role-attributed selection under MITO would diverge from the policy's own rendering, so they're rejected rather than approximated).

One deliberate expressiveness trade: loss routing is not a free config axis (there is no algo.loss; you can't write opd + observation-CE in TOML) — routing variation is algorithm variation, expressed as a class. ECHO is a proper advantage type, not a flag on GRPO. Its selection surface is configurable where it matters: per-role alpha (roles.system/user/assistant/tool — setting any role replaces the whole table) and an optional filter hook (import_path + kwargs, called once per rollout as filter_fn(rollout, **kwargs) -> list[list[bool]], one keep-mask per trajectory step spanning that step's prompt_ids + completion_ids) — the raw rollout exposes message text and sampling logprobs, so warning filters and low-probability filters are user code, no framework surface.

The algorithms

`advantage.type`	Sampling	Loss
`grpo` (default)	policy	`rl`
`max_rl` (arXiv:2602.02710)	policy	`rl`
`opd`	policy	`ref_kl` (needs `teacher`)
`sft`	frozen model (`teacher` folds into `sampling.source`)	`ce`
`opsd` (SDFT, arXiv:2601.19897)	policy	`ref_kl` vs `"policy"` by default
`echo` (ECHO)	policy	`rl` on actions + per-role α·`ce` on env-provided tokens

There is no preset layer: the type IS the algorithm, its class defaults ARE the vetted setting, and every key beyond type is visibly the user's own assembly (an earlier iteration had named atomic presets; with type-plus-defaults equal to the preset for every algorithm, the layer had nothing left to do and was deleted). Per-env algorithms compose in one run. Because a reference can be the literal "policy", opsd runs the SDFT paper's setting with zero extra deployments — that's its default.

Model references — no registry; roles are algorithm-local

prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:

ModelReference = "policy" | FrozenModelConfig, where FrozenModelConfig is just the existing ClientConfig plus the served model's name — no new declaration scheme. base_url (or an elastic deployment) is required: a frozen reference with no endpoint fails at parse time.
algo.model is shorthand that folds into the slot the advantage type declares for its reference (model_role → advantage.model for opd/opsd, source_role → sampling.source for sft); redundant-but-consistent explicit settings are accepted, contradictions rejected.
Algorithms declare what they need. The distillation algorithms declare their reference's role as "teacher" (model_role / source_role ClassVars), which makes [orchestrator.algo.teacher] a parse-time alias for the model shorthand and puts the same word in validation errors ("advantage 'opd' needs a teacher — set 'teacher' on the algorithm ..."). Roles stay strictly algorithm-local: no role ever reaches flow code or the wire.
Liveness is a property of the reference, not a role: policy-sourced rollouts get version-salted prefix caches, carry sampling logprobs, and age off-policy; frozen-sourced rollouts get a stable prefix cache, skip the logprobs knob, and are never off-policy-cancelled. The pipeline branches on liveness alone.
Config-time validation: opd pointed at "policy" is rejected as degenerate (KL ≡ 0), frozen sampling can't feed an rl-type strategy (no policy sampling logprobs for importance ratios), sft without a frozen source is rejected (CE on the policy's own tokens is not a distillation target), opsd and echo require a renderer, and group-relative advantage with group_size=1 warns loudly.

How

Configs (prime_rl.configs.algorithm): one advantage discriminated union absorbs the former token scorers (logprobs/demo_logprobs → opd/opsd with a ModelReference) and loss routing (EchoAdvantageConfig(GRPOAdvantageConfig) carries the role table + filter); the action-token loss component is an action_loss_type class property of the strategy, never configured; algo.model (alias teacher) folds fill-or-agree into the slot the type's ClassVars declare (model_role / source_role).
No preset machinery at all: the advantage shorthands ([orchestrator.advantage], per-env advantage = {...}) fold into algo.advantage on raw input; an env-level shorthand assembles the env's own algorithm rather than copy-modifying the inherited one, and env algorithm inheritance is a one-loop after-validator. Every AlgorithmConfig is built exactly once with everything in place: no preset tables, no name field, no merge machinery, no __pydantic_fields_set__ surgery.
Runtime (the orchestrator/algo/ package + orchestrator/sampler.py):
- Algorithm — base class with the two execution points: assign(rollouts) at group finalization (scalar advantages, synchronous) and async score(rollouts) at batch-ship time (reference prefill logprobs, bounded concurrency); finalize_group stamps the wire fields (advantage spreading + component weight streams). setup() connects an InferencePool to the algorithm's inline frozen reference (connect_frozen_pool — client-side only: prime-rl connects and waits, never launches) and tracks connected_pools for shutdown. The named classes above own their assign/score bodies outright; build_algorithm dispatches on advantage.type.
- Sampler — one per env, owns the rollout source: the generating pool (policy, or a frozen pool it connects in its own setup()), samples_from_live_policy, and sampling_args() (strips the logprobs knob for frozen endpoints). The dispatcher reads env.sampler for pool/liveness; the algorithm never sees sampling.
- The pipeline has zero algorithm conditionals: the dispatcher derives cache-salt/aging from the sampler's liveness; the sink calls finalize_group; orchestrator setup is one asyncio.gather over both objects' setup() and finalize_train_batch does one unconditional per-env score_batch gather (time/scoring).
Wire: unchanged from before this PR for the GRPO path — TrainingSample/MicroBatch carry optional per-token rl_weights / ce_weights / ref_kl_weights / token_advantages and ref_logprobs; absent streams mean rl weight 1.0 on every trainable token, so plain GRPO ships nothing extra (one nil byte per trailing field — array_like structs encode positionally; omit_defaults doesn't trim them). Membership is per token, so samples of different algorithms pack freely into one micro batch (no mode-segregated bins). Reference-KL strategies must ship reference logprobs (not precomputed advantages) because the trainer evaluates the KL against live policy logprobs each microbatch.
Trainer: compute_loss runs each component over its weight stream (rl mask = loss_mask & rl_weights != 0; ce/ref_kl mask = weights != 0) and sums the per-component means; the three global denominators come from one batched all-reduce ([N_rl, N_ce, N_ref_kl]), so every rank issues the same collective. The no-stream hot path is byte-identical to the old single-loss_scale math with zero extra device syncs. Mixed-algorithm packing fix: bins mixing ref-bearing (e.g. OPD) and ref-less (e.g. GRPO) samples now keep ref_logprobs position-aligned with input_ids (0.0 placeholders, backfill-or-pad like the sibling per-token arrays), with a regression test and a post-pack invariant assert that every per-token array on every micro batch matches len(input_ids).

Breaking changes (intentionally, no deprecation aliases)

orchestrator.training_mode is deleted (was rl/opd/sft) → [orchestrator.algo.advantage] type = ....
The advantage type names the algorithm: group_norm → grpo, ref_kl → opd, demo_ref_kl → opsd, supervised → sft (config classes renamed to match). The preset name field is deleted — type-plus-defaults is the algorithm.
Echo selection is a role table: observations / observation_weight are replaced by roles.<role>.alpha (tool bodies @ 0.1 default; setting any role replaces the table) plus an optional filter hook. Echo always requires orchestrator.renderer (the no-attribution "all" mode is gone).
sft requires a frozen sampling source — CE on the policy's own tokens is rejected at validation (previously silently assemblable).
[orchestrator.teacher] is deleted → inline [orchestrator.algo.model] (alias: [orchestrator.algo.teacher]) with name + base_url.
The policy sub-config is [orchestrator.model], flat: name / lora / vlm / trust_remote_code directly on it, the deployment under [orchestrator.model.client] — no more model.model stutter, in configs or resolved dumps (HostedModelConfig is now ModelConfig + a client field, mirroring FrozenModelConfig). The [orchestrator.policy] / [orchestrator.student] aliases, the [orchestrator.client] shorthand, and the flat-key re-nesting shim are deleted.
algorithm.token_scorer is deleted — scorers are advantage strategies now (logprobs → ref_kl, demo_logprobs → demo_ref_kl).
algo.loss does not exist — the action-token loss component derives from the advantage strategy, and observation-token routing is the echo advantage type (observation_weight).
Advantage types: default → group_norm; none is deleted (the ref_kl-family strategies carry its no-scalar semantics; nothing else used it).
Wire/trainer: TrainingSample.teacher_logprobs → ref_logprobs; the loss-type partition (loss_type / token_loss_types / token_loss_weights / loss_type_ids, LOSS_CORE_*) is replaced by the component weight streams rl_weights / ce_weights / ref_kl_weights (the partition was the degenerate disjoint case); metric time/teacher_logprobs → time/scoring.
configs/debug/training_modes/ → configs/debug/algorithms/ (incl. new self_distill.toml running SDFT against the live policy); CI integration configs migrated.

Configs are extra="forbid", so stale configs fail loudly at parse time with the exact unknown key.

Merged with main including #2720 (teacherless SFT). Its spelling is deliberately not carried forward: sft means distillation from a frozen teacher, and pointing it at "policy" is rejected at validation — CE on the policy's own rollouts is a different algorithm, and if wanted it gets its own advantage type/class rather than a degenerate spelling of sft.

Also merged main including #2641 (multiplexed trainer token export): the export machinery composes with the component weight streams — each exported record carries the per-token rl/ce/ref_kl arrays, and micro-batches carry run_id/run_step alongside the streams (training_mode stays deleted).

Note: this PR absorbed #2764 (named algorithm classes → two-component model → Sampler split), #2778 (MaxRL), and #2782 (echo per-role selection + filter hook); those branches were merged into this one and GitHub marked their PRs merged.

Validation

Pre-review polish round: a full-branch review pass (parsimony / stale-reference / docs-accuracy sweep) landed two follow-ups — 48ac7ffb9 scrubs leftover preset vocabulary and stale type names from docs and debug configs (ref_kl → opd in the advantage tables, missing max_rl/reward/custom rows, one wandb project for the debug folder), and f5be6f4b3 fixes a degenerate-batch crash: a micro batch whose components are all empty (e.g. a distillation sample whose prompt alone exceeds trainer seq_len, so truncation strips every nonzero ce/ref_kl token while the stamped all-zero rl stream skips the rl branch) returned a Python float from compute_loss and crashed backward(). The loss is now seeded with a graph-attached zero, so the batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync — pinned by test_empty_components_keep_backward_valid, which fails on the prior code. A follow-up (0606f1272) namespaces ref_kl_loss_fn's trust-region metrics as ref_kl/* (mixed batches previously averaged the rl and ref_kl trust-region definitions into one wandb series), turns a missing advantage on an rl-member sample into a loud ValueError instead of silent zero-gradient training, pads weight streams with 0.0 so padded pure-ce batches read as rl-empty in token export, and pins the pack-boundary STREAM_FILL backfill with positional asserts in both pack orders. Verified by the full trainer+orchestrator unit suites (188 passed) and a 5-step opsd GPU smoke with the renamed keys confirmed in the wandb run.

On the type-rename + preset-layer deletion (full CPU unit suite 485 green excl. the pre-existing tilelang test_qwen3_5_moe* box failures; ruff clean; --dry-run parse of every flipped debug TOML; 5-step grpo GPU smoke via type = "grpo", reward 0.10→0.41, 128/128 trainable, exit 0).

On the echo selection surface (20-step GPU run on multi-turn alphabet-sort with an assembled user-role table, exit 0) — verified at the wire: all 32 samples of a shipped batch carry ce_weights with 5,240 nonzero tokens at exactly the configured α=0.1, and the orchestrator-internal completion_obs_weights field leaked into zero shipped samples. The filter hook is covered by shape-violation and composition unit tests.

On the final two-component shape (full CPU unit suite green — 526 passed, pre-existing test_qwen3_5_moe* tilelang failures on this box excluded; ruff clean; 5-step GPU smokes, 2-GPU, all exit 0):

echo on multi-turn alphabet-sort (3.1–3.6 turns, 32/32 trainable every step) — EchoAlgorithm + Sampler split end-to-end: observation tokens tagged, ce-weighted, and trained while staying out of the rl denominator.
self_distill — OPSDAlgorithm.score() against the live policy pool, demo-conditioned prefix rendered via renderer.render_ids (the policy's own prompt-rendering path).
grpo — the no-stream hot path (eval 0.16→0.40 in 5 steps).

On the preset-resolution refactor: full unit suite incl. the config-load sweep over every shipped TOML, resolved-config round-trip (model_dump → re-validate), legacy [[env]] layout, env-shorthand-on-inherited-preset and conflict tests, plus a 5-step grpo smoke (eval 0.13→0.55, 128/128 trainable).

On the component-sum shape (5-step smokes, all exit 0): grpo eval 0.13→0.53 (no-stream hot path — math identical to the old single-denominator path for single-component runs); echo with the ce stream carrying observation tokens (loss ≈ 0.03 while GRPO advantages are ~0, i.e. the gradient is the CE component); opd eval 0.067→0.684 with the frozen reference on :8001 (ref_kl stream on action tokens, rl stream zeroed, frozen pool initialized from the inline config).

Earlier 50-step runs validated the same orchestrator machinery on the loss-partition iteration of this branch (per-component normalization is numerically identical for these single-component runs):

grpo: eval reward 0.139→0.847 — default path end-to-end, wire unchanged.
opd: 0.106→0.828 with the frozen Qwen3-0.6B-Reverse-Text-RL server on :8001 declared inline — algorithm-owned pool wiring + bounded ship-time scoring + ref_kl routing.
sft_distill: 0.113→0.835 — frozen-sourced sampling (stable prefix cache, no off-policy aging) + supervised advantage + CE.
mixed grpo+opd (new configs/debug/algorithms/mixed_grpo_opd.toml): two envs with different algorithms in one run, 0.091→0.835 — every batch mixes ref-bearing (opd, 30–60% per step) and ref-less (grpo) samples, exercising heterogeneous packing end-to-end with the post-pack alignment assert live.
custom per-token advantages: 5-step run with a custom strategy emitting alternating scalar / scalar·0.5 per-token advantages (eval 0.099→0.681), verified via trainer token export: all 128 step-0 sequences carry the exact alternating pattern over the loss-masked region — AdvantageOutputs.token_advantages → TrainingSample.token_advantages → trainer, prompt positions padded out.
self_distill vs policy: wiring healthy (no frozen pool created, demo_ref_kl scores against the live policy, no-scalar rollouts stay trainable). Known issue (not this PR's machinery): over 50 steps it degrades (eval 0.078→0.0, truncation →99%) — a verbosity spiral plausibly driven by the debug config's 128-token cap (~80% truncation at step 0) + the self-referential reference. Tracked for follow-up (truncation-aware demo scoring / larger budget / EMA reference).
atomic presets + echo tool-mode (full unit suite 483 green; 5-step grpo smoke on the refactored preset path, eval 0.10→0.53, exit 0). A second 50-step round of all presets on the merged branch reproduced round-1 results (grpo 0.845, opd 0.837, sft_distill 0.822, self_distill collapse).
opd without scalars (50-step A/B vs the group-norm baseline): group-norm scalars on OPD rollouts were dead weight — ref_kl_loss_fn zeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter, which wrongly dropped uniform-reward OPD groups carrying full teacher-KL signal. Removing them ties the baseline (0.825 vs 0.828); OPD now ships advantage=None like OPSD and ref_kl_loss_fn reads no scalars (its trust region is the low side explicitly — bit-identical math).

Deferred (next PRs)

The two-component model is built so each of these lands as its own PR without re-touching the wire:

Sampling strategies behind the Sampler — today the Sampler answers one question (which pool, and is it live); next it absorbs within-env example iteration and group production, plus a sink→sampler observe() feedback edge at group finalization. In increasing order of machinery that unlocks: difficulty-pool curricula (per-example selection state fed back from group rewards), static dataset sources (supervised training from demonstrations instead of a frozen endpoint), and replay buffers / offline experience (a store between stamping and batch assembly — advantages and weight streams are already stamped-then-frozen at group finalization, which is the prerequisite).
Trainer collapse — the three fixed components reduce to two cores (policy-gradient × supervised) composed with per-token factor streams. Dissolves the rl-slot-only custom-loss wart and lets the stability guards (importance correction, trust region) compose around custom losses instead of being replaced by them.
Remaining ECHO surface — per-role α, arbitrary roles, and content/sampling-logprob filters shipped in this PR; still open: tool_names filtering (non-breaking optional field on the tool role; message_tool_names already rides in the attribution) and θ-dependent low-probability filters, which need a trainer-side per-component knob since the denominator collective happens pre-forward.
Smaller: config knobs for component sums (per-env α/β/γ folded into the streams orchestrator-side — the wire and trainer already support overlap); opt-in weight-sum normalization (Σw denominators — a globally-folded λ cancels under it, so it pairs with a trainer-side coefficient); per-slot loss-fn configs for ce/ref_kl (the ref_kl trust region — one-sided today — becomes a deliberate choice there); barrier-blind sink (the group wait becomes a dependency declared by the algorithm; finalize_group → finalize); EMA/lagged reference endpoints; exposing frozen endpoints to envs as judge clients.

🤖 Generated with Claude Code

Note

High Risk
Large breaking config and orchestrator/trainer contract change (wire fields, validation, multi-algorithm packing); incorrect migration or edge cases in mixed batches could affect training correctness.

Overview
Replaces the rl / opd / sft training-mode switch with a first-class [orchestrator.algo] bundle: sampling (policy vs inline frozen endpoint) and advantage (grpo, max_rl, opd, opsd, sft, echo, reward, custom) that jointly pick credit assignment and which loss component trains each token.

Config & hosting: Drops orchestrator.teacher and nested student in favor of flat [orchestrator.model] (trainable policy only) plus [orchestrator.algo.model] / teacher for external frozen servers. CI and debug TOMLs move from training_modes/ to configs/debug/algorithms/ with new recipes (MaxRL, ECHO, self-distill, mixed GRPO+OPD).

Runtime: Per-env Sampler + named Algorithm classes (assign at group time, score at batch ship for reference prefill). Dispatcher/sink/orchestrator branch on liveness (policy vs frozen), not modes. Wire ships ref_logprobs and optional rl_weights / ce_weights / ref_kl_weights; trainer sums rl + ce + ref_kl with per-component token normalization. Metrics rename time/teacher_logprobs → time/scoring.

Docs (algorithms.md, training.md, scaling/inference paths) and skills/configs/SKILL.md are rewritten around the new abstraction.

^{Reviewed by Cursor Bugbot for commit 0606f12. Bugbot is set up for automated code reviews on this repo. Configure here.}

…d, sft_distill, self_distill, echo) Replace the global training_mode enum with a per-env Algorithm abstraction: a preset bundle of (1) sampling source, (2) scoring (group advantage + async token scorer), and (3) per-token loss routing. The trainer becomes algorithm-blind: routing ships per token on the wire and the trainer executes three fixed loss cores (rl / ce / teacher_kl). - configs: new prime_rl.configs.algorithm with AlgorithmConfig presets, component-level overrides, compatibility validation (incl. the group-relative-advantage-with-group_size=1 footgun warning); training_mode kept as a deprecated alias - orchestrator: per-env algorithm; dispatcher selects student/teacher pool per env (no mode branches); OPD teacher logprobs moved out of finalize_train_batch into a bounded-concurrency token scorer; demo-conditioned teacher scorer for SDFT; interleave_rollout can tag env-observation tokens for ECHO - wire: TrainingSample/MicroBatch carry loss_core + optional per-token cores/weights/advantages (omit_defaults — plain GRPO wire unchanged); packer no longer bins by mode - trainer: unified per-token loss routing, bit-for-bit with the previous rl/opd/sft loss fns on pure batches Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke runs for grpo (reverse_text), opd (teacher pool + alias path), and echo (multi-turn alphabet-sort, per-token routing verified on the wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…hm strategy object "Teacher" is no longer a concept anywhere in the system. There is the live policy (reserved registry key "policy") and named frozen hosted models under [orchestrator.models.<key>]; algorithm components hold references into that registry. The same entry can serve any number of envs' algorithms, and self_distill can point its demo scorer at "policy" itself — the SDFT paper's setting, zero extra deployments. - configs: scorer types logprobs/demo_logprobs with required model refs; sampling.source is a registry key; algorithm.model shorthand folds into the unresolved component; orchestrator.teacher and training_mode deleted; student renamed policy; registry validation (refs resolve, entries used, "policy" reserved, degenerate logprobs@policy rejected) - runtime: ModelRegistry + per-env Algorithm strategy object as the sole interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and never branch on algorithm config; liveness drives cache salting, sampling logprobs, and off-policy aging (frozen-sourced rollouts no longer age) - wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl, time/scoring metric - fixes found by the new SDFT smoke: resolved-config round-trip (shorthands are now write-only / excluded from dumps) and apply_chat_template returning BatchEncoding on newer transformers - configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml running SDFT against the live policy); docs/skills updated Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry 0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32 trainable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…gy ontology Every training signal is an advantage — varying in granularity (group-scalar vs per-token) and evaluation site (orchestrator vs trainer). The advantage union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs -> demo_ref_kl), the action-token loss core derives from the strategy instead of being configured (loss.action deleted), and runtime AdvantageStrategy objects own both execution points: group-time assign() and ship-time score(). Also fixes a shorthand-folding regression: resolve_preset's component assignment polluted model_fields_set, so any [orchestrator.advantage] shorthand differing from the preset raised a bogus conflict error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py

A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones (grpo/echo) extended ref_logprobs without backfilling or padding, shifting it out of alignment with input_ids. Mirror the rewards/loss_core_ids pattern with 0.0 placeholders (already the outside-the-mask filler used by the demo scorer and pad_micro_batch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mikasenghaas · 2026-06-10T00:28:47Z

can put into orch configs? i think rn we have one config per entrypoint

mikasenghaas · 2026-06-10T00:33:34Z

-Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields.
+Set `[trainer.loss] type = "default"` and configure via the knobs above. The `ce` and `ref_kl` cores are fixed and unaffected by `[trainer.loss]`.

 ### Custom Loss


isnt this obsolete?

we could make it obsolete by making the RL loss more configurable in the algo config I guess

The config surface key is now [orchestrator.algo] (per-env: algo = {...}); the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL, TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids, advantage.action_loss_type). Also scrubs stale token-scorer mentions from the ref_kl error message and the configs skill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and `[orchestrator.student]` fold in as aliases, with the canonical key winning at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys still re-nest ([orchestrator.model] name = ...). Shared-field propagation checks all spellings for conflicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two envs with different algorithms in one run — exercises heterogeneous train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…r references Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… registry prime-rl now assumes it only ever hosts the trainable policy. Frozen models are external endpoints declared inline on the algorithm component that uses them (FrozenModelConfig: model.name + required client.base_url) — no more [orchestrator.models] namespace or runtime ModelRegistry. Each env's Algorithm builds and readies its own frozen pools in async setup(); the dispatcher reads algorithm.sampling_pool and gets the policy pool directly. References are "policy" | inline config; demo_ref_kl now defaults to "policy" (the SDFT setting needs zero config). The algo.model shorthand folds with fill-or-agree semantics, which also fixes the two Bugbot findings (redundant-but-consistent model rejected; advantage shorthand clearing a folded model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A frozen model reference is the client config we already have plus the one request-level datum it lacks: the served model's name. Drops the nested {model, client} shape — TOML reads `[orchestrator.algo.model]` with name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning, which still read the deleted [orchestrator.models] dict. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ction # Conflicts: # skills/configs/SKILL.md # src/prime_rl/trainer/rl/data.py # tests/unit/test_configs.py

MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the implicit success probability instead of pass@1: the policy gradient averaged over successful rollouts only is unbiased for the order-group_size truncation of the ML objective's pass@k expansion. In estimator form that is one change to GRPO — normalize the centered group reward by the group MEAN instead of the standard deviation, upweighting low-pass-rate examples like 1/p. group_size becomes the truncation order (REINFORCE at 1, exact ML in the limit). New 'max_rl' advantage type + preset: MaxRLAdvantageConfig, max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs rows, and a unit test for the estimator. Groups with zero mean reward carry zero advantages (the paper's no-success convention — the zero-advantage filter drops them). Everything else rides the existing GRPO path: policy sampling (enforced by the rl-component guard), rl loss component, group barrier. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…self A preset name with explicit advantage/sampling keys is now a parse-time error instead of a merge: a modified preset is not the preset, so the config must state what it actually runs. Only the model/teacher shorthand may accompany a name (the distillation presets are incomplete without an endpoint by design). Assembly stays cheap — presets are thin deltas, so a variant costs one explicit 'type' key. Deletes the merge machinery: _merge_preset_delta and the discriminator-aware typeless override (advantage = { max_concurrent } under opd silently inheriting ref_kl) are gone; the preset validator inserts components, never merges. 'name' becomes write-only input sugar (excluded from dumps, like 'model') so resolved configs round-trip as plain component assemblies; the orchestrator startup log now reports advantage types instead of preset labels. The advantage shorthand gets a preset-aware error instead of silently relabeling an inherited preset. echo.toml's lambda override becomes the assembled spelling, and the debug configs spell algo as an [orchestrator.algo] section instead of an inline table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The echo preset now means the vetted ECHO setting: weighted CE on tool/terminal response bodies (lambda = 0.1), selected via the renderer's per-token attribution (message_indices / message_roles / is_content from RendererClient rollouts). EchoAdvantageConfig grows observations: 'tool' (default; requires the renderer, validated at config time) | 'all' (every env-provided token — the previous behavior). interleave_rollout takes the mode instead of a bool and tags tool spans token-exactly: response bodies when is_content is available, whole tool messages otherwise; MITO rollouts raise loudly. Env-shorthand assembly fix: an env advantage shorthand now assembles the env's own algorithm instead of copy-modifying the inherited preset (atomicity); self_distill.toml spells its demo_key variant assembled; echo.toml (alphabet-sort, user-role feedback) assembles with observations = 'all'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e, no aliases HostedModelConfig was a pairing wrapper ({model, client}) that made the canonical path stutter (orchestrator.model.model.name) and needed a before-validator apologizing for it (re-nesting flat keys, folding the policy/student aliases and the [orchestrator.client] shorthand). The flat spelling was what everyone wrote anyway — make it the schema: HostedModelConfig is now ModelConfig + a client field, mirroring FrozenModelConfig (name + endpoint, no new declaration scheme), and fold_policy_shortcuts is deleted along with the policy/student aliases and the client shorthand. Resolved dumps stop stuttering too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Echo's selection surface generalizes from the observations="tool"|"all" binary to a role table: each env-provided message role (system / user / assistant / tool) trains at its own alpha, selected via the renderer's per-token attribution. An optional filter hook (import_path + kwargs, matching the custom advantage/loss precedent) narrows the selection per rollout with one keep-mask per trajectory step. - completion_obs_mask (bool) -> completion_obs_weights (float): the per-token weight carries its role's alpha, so stamping folds it into ce_weights directly and stamp_loss_routing drops the scalar observation_weight parameter. Orchestrator-internal as before. - The echo preset is unchanged in meaning: tool-response bodies at 0.1. Setting any role replaces the whole table. - Echo now always requires the renderer (role selection needs attribution); the blanket "all" mode is gone — assemble the roles you want instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…eted The advantage type now names the algorithm — group_norm -> grpo, ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes renamed to match) — and each type's class defaults are its vetted setting, so 'type = "opd"' with a teacher IS on-policy distillation. With type-plus-defaults equal to the preset for every algorithm, the preset layer had nothing left to do: AlgorithmName, _PRESETS, the name field, and the atomicity guard are deleted. The model/teacher shorthand survives and now folds by the type's own declarations (model_role -> advantage.model for opd/opsd; source_role -> sampling.source for sft). sampling.source loses its None state (it existed only for preset resolution); sft without a frozen source is rejected at validation — CE on the policy's own tokens was never the vetted meaning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8e7e51f. Configure here.}

- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the missing max_rl/reward/custom rows, fix the frozen-model base_url and custom-Algorithm wording, make the length_penalty example self-sufficient, drop the Per-Env Advantage section (duplicate of Per-Env Algorithms) - configs/debug/algorithms: README gains max_rl and uses the real type names, comments lose leftover preset vocabulary, one wandb project for the whole folder - docs/training.md / skills/configs/SKILL.md: complete type lists and a union example that parses on its own Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A fully truncated distillation sample (prompt >= trainer seq_len) loses all its nonzero ce/ref_kl tokens to prepare_sample's truncation while its stamped all-zero rl_weights suppress the rl branch; with every component empty, compute_loss returned the Python float 0.0 and loss.backward() crashed. Seed the rl accumulator with a graph-attached zero so the degenerate batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- ref_kl_loss_fn emitted the same trust-region metric keys as the rl loss fn into one shared dict, so mixed batches (per-env algorithms) averaged two different trust-region definitions into one wandb series. Namespaced as ref_kl/*; the wandb noise filter gets matching prefixes and the ref_kl value series is unchanged. - prepare_sample: a sample with rl member tokens but no advantage now raises instead of silently training with advantage 0.0 — the orchestrator always stamps a scalar, so a missing one is a producer bug (ce/ref_kl-only samples still default to 0.0 legitimately). - pad_micro_batch: padding fills every weight stream with 0.0 instead of the pack-boundary defaults; padding is loss-masked so this is training-equivalent, and padded pure-ce batches now read as rl-empty in token export, which keys off nonzero weights. - test_prepare_batch_packs_mixed_components: sorted() multiset checks replaced with exact positional asserts in both pack orders, pinning STREAM_FILL backfill alignment across the bin boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hallerite force-pushed the feat/algorithm-abstraction branch from fb0da20 to f35e3a8 Compare June 9, 2026 19:59

hallerite changed the title ~~feat: Algorithm abstraction — sampling/scoring/loss presets (grpo, opd, sft_distill, self_distill, echo)~~ feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026

hallerite changed the title ~~feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo)~~ feat: algorithm abstraction — model registry + unified advantage strategies (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026

hallerite marked this pull request as ready for review June 9, 2026 22:50

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py

Merge remote-tracking branch 'origin/main' into feat/algorithm-abstra…

3c3c3d1

…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/batch.py

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated

feat(trainer): assert per-token array alignment after packing

fc79651

Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mikasenghaas reviewed Jun 10, 2026

View reviewed changes

hallerite and others added 2 commits June 10, 2026 00:57