feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo) by hallerite · Pull Request #2746 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-09T19:17:56Z

What

Makes prime-rl's training algorithm a first-class, hackable abstraction — removes model roles from the pipeline, unifies every training signal under one concept (the advantage: credit assignment and loss routing, fused), and makes each algorithm a named runtime class that owns its methods. There is no registry and no separate "token scorer": there is the live policy — the only model prime-rl ever hosts — and per-env algorithms whose model references are either "policy" or an inline externally-hosted frozen endpoint.

An algorithm is a bundle of two components, configured under [orchestrator.algo] (per-env: algo = {...} or the advantage = {...} shorthand on the env) and resolved per env. The advantage type names the algorithm — grpo, max_rl, opd, opsd, sft, echo, reward, custom — and each type's class defaults are its vetted setting; there is no separate preset layer:

Sampling — how train rollouts are produced: which model generates them, "policy" or an inline frozen endpoint. At runtime this builds the env's Sampler (orchestrator/sampler.py) — the pool rollouts come from, the liveness consequences (logprobs, prefix-cache salting, staleness), and the home of future sampling strategies (replay buffers, branching).
Advantage — the per-token training signal: one mapping from a finalized rollout to per-token (loss component, weight) pairs. Credit assignment computes the magnitude, loss routing picks the component — two coordinates of the same output, so they are one config component and one runtime object. Group-relative strategies compute scalars on the orchestrator and ship numbers; reference-KL strategies query a reference model at batch-ship time and ship its prefill logprobs for the trainer to evaluate against the live policy. The strategy determines the action-token loss component (rl / ce / ref_kl) and what happens to env-provided observation tokens (masked out by default; echo trains on them with weighted CE — selected by message role via the renderer's per-token attribution, each role at its own alpha, tool-response bodies at λ=0.1 by default, optionally narrowed by a user-supplied per-rollout token filter).

The training loss is a sum of three components — rl (DPPO+KL or custom), ce (masked NLL), ref_kl (reverse KL to a reference as the PG signal) — each normalized by its own global token count:

$$\mathcal{L} = \tfrac{\sum \mathcal{L}_{rl}}{N_{rl}} + \tfrac{\sum \mathcal{L}_{ce}}{N_{ce}} + \tfrac{\sum \mathcal{L}_{ref_kl}}{N_{ref_kl}}$$

The orchestrator stamps per-token component weight streams (rl_weights / ce_weights / ref_kl_weights) plus the per-token advantage stream (advantages — there is no scalar advantage anywhere; uniform group credit is broadcast over completion tokens at assignment); a weight scales that component's per-token loss, 0.0 removes the token from the component's mask and denominator, and components may overlap on the same token (gradients sum). Per-component normalization means components never dilute each other: echo's observation tokens no longer shrink the rl term's effective per-token learning rate (previously both shared one global denominator, so the rl gradient scaled with the batch's obs/action ratio), and a supervised env packed next to a GRPO env doesn't soften its gradient.

[orchestrator.algo.advantage]
type = "opd"

[orchestrator.algo.teacher]   # alias for `model`; folds into advantage.model
name = "Qwen/Qwen3-32B"
base_url = ["http://localhost:8001/v1"]

The trainer is algorithm-blind (component weight streams ship per token on the wire; the trainer executes the three fixed components), and the orchestrator pipeline is too: dispatcher, train sink, and orchestrator call hooks on each env's Sampler + Algorithm objects and never branch on algorithm config or model roles.

The algorithm classes

Each algorithm is a named runtime class — the algorithm object is the algorithm. Dispatch is keyed on advantage.type — it names the algorithm, and each config class's defaults are its vetted parameterization:

`advantage.type`	Class	Component	`assign` (group time)	`score` (ship time)
`grpo`	`GRPOAlgorithm`	`rl`	group-norm scalars (optional length penalty)	—
`max_rl`	`MaxRLAlgorithm`	`rl`	mean-normalized group scalars (arXiv:2602.02710: unbiased for the order-`group_size` truncation of the max-likelihood objective)	—
`echo`	`EchoAlgorithm`	`rl` + `ce`	group-norm scalars; env-provided tokens selected by message role (`roles.<role>.alpha`, tool bodies @ 0.1 default), optional user token filter	—
`reward`	`RewardAlgorithm`	`rl`	raw reward	—
`opd`	`OPDAlgorithm`	`ref_kl`	none (`advantages=None`, filters skip)	own-context prefill under the teacher
`opsd`	`OPSDAlgorithm`	`ref_kl`	none (`advantages=None`, filters skip)	demo-conditioned prefill under the teacher (default `"policy"`)
`sft`	`SFTDistillAlgorithm`	`ce`	group-norm (feeds reward-based filtering)	—
`custom`	`CustomAlgorithm`	`rl`	your function — per-token advantages, one list per rollout aligned to `inputs.completion_lengths` (`inputs.broadcast(...)` spreads uniform group credit)	—

Reading a class top to bottom reads the algorithm — one module per algorithm under orchestrator/algo/ (grpo.py, opd.py, …, with the base class and pipeline hooks in base.py and dispatch in the package __init__); writing your own is subclassing Algorithm and overriding the same hooks — a third hook, observation_weights(output), decides per-token ce weights for env-provided observation tokens at sample-construction time (default None = masked; echo overrides it with role selection + the user filter). The hooks are stages of one compilation the base class drives: the pipeline hands the algorithm its rollout / group / batch (build_samples / finalize_group / score) and never composes algorithm internals itself. Duplication of orchestration between similar algorithms (OPD vs OPSD) is accepted so each class stays self-contained; shared math (group normalization, prefill alignment, length penalties) lives as plain functions in algo/advantage.py. Algorithms take exactly two runtime resources — Algorithm(config, policy_pool, renderer); text → token ids always goes through the renderer, the same path the policy's own prompts take (opsd and echo require one, validated at config time — demo-conditioned scoring or role-attributed selection under MITO would diverge from the policy's own rendering, so they're rejected rather than approximated).

One deliberate expressiveness trade: loss routing is not a free config axis (there is no algo.loss; you can't write opd + observation-CE in TOML) — routing variation is algorithm variation, expressed as a class. ECHO is a proper advantage type, not a flag on GRPO. Its selection surface is configurable where it matters: per-role alpha (roles.system/user/assistant/tool — setting any role replaces the whole table) and an optional filter hook (import_path + kwargs, called once per rollout as filter_fn(rollout, **kwargs) -> list[list[bool]], one keep-mask per trajectory step spanning that step's prompt_ids + completion_ids) — the raw rollout exposes message text and sampling logprobs, so warning filters and low-probability filters are user code, no framework surface.

The algorithms

`advantage.type`	Sampling	Loss
`grpo` (default)	policy	`rl`
`max_rl` (arXiv:2602.02710)	policy	`rl`
`opd`	policy	`ref_kl` (needs `teacher`)
`sft`	frozen model (`teacher` folds into `sampling.source`)	`ce`
`opsd` (SDFT, arXiv:2601.19897)	policy	`ref_kl` vs `"policy"` by default
`echo` (ECHO)	policy	`rl` on actions + per-role α·`ce` on env-provided tokens

There is no preset layer: the type IS the algorithm, its class defaults ARE the vetted setting, and every key beyond type is visibly the user's own assembly (an earlier iteration had named atomic presets; with type-plus-defaults equal to the preset for every algorithm, the layer had nothing left to do and was deleted). Per-env algorithms compose in one run. Because a reference can be the literal "policy", opsd runs the SDFT paper's setting with zero extra deployments — that's its default.

Model references — no registry; roles are algorithm-local

prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:

ModelReference = "policy" | FrozenModelConfig, where FrozenModelConfig is just the existing ClientConfig plus the served model's name — no new declaration scheme. base_url (or an elastic deployment) is required: a frozen reference with no endpoint fails at parse time.
algo.model is shorthand that folds into the slot the advantage type declares for its reference (model_role → advantage.model for opd/opsd, source_role → sampling.source for sft); redundant-but-consistent explicit settings are accepted, contradictions rejected.
Algorithms declare what they need. The distillation algorithms declare their reference's role as "teacher" (model_role / source_role ClassVars), which makes [orchestrator.algo.teacher] a parse-time alias for the model shorthand and puts the same word in validation errors ("advantage 'opd' needs a teacher — set 'teacher' on the algorithm ..."). Roles stay strictly algorithm-local: no role ever reaches flow code or the wire.
Liveness is a property of the reference, not a role: policy-sourced rollouts get version-salted prefix caches, carry sampling logprobs, and age off-policy; frozen-sourced rollouts get a stable prefix cache, skip the logprobs knob, and are never off-policy-cancelled. The pipeline branches on liveness alone.
Config-time validation: opd pointed at "policy" is rejected as degenerate (KL ≡ 0), frozen sampling can't feed an rl-type strategy (no policy sampling logprobs for importance ratios), sft without a frozen source is rejected (CE on the policy's own tokens is not a distillation target), opsd and echo require a renderer, and group-relative advantage with group_size=1 warns loudly.

How

Configs (prime_rl.configs.algorithm): one advantage discriminated union absorbs the former token scorers (logprobs/demo_logprobs → opd/opsd with a ModelReference) and loss routing (EchoAdvantageConfig(GRPOAdvantageConfig) carries the role table + filter); the action-token loss component is an action_loss_type class property of the strategy, never configured; algo.model (alias teacher) folds fill-or-agree into the slot the type's ClassVars declare (model_role / source_role).
No preset machinery at all: the advantage shorthands ([orchestrator.advantage], per-env advantage = {...}) fold into algo.advantage on raw input; an env-level shorthand assembles the env's own algorithm rather than copy-modifying the inherited one, and env algorithm inheritance is a one-loop after-validator. Every AlgorithmConfig is built exactly once with everything in place: no preset tables, no name field, no merge machinery, no __pydantic_fields_set__ surgery.
Runtime (the orchestrator/algo/ package + orchestrator/sampler.py):
- Algorithm — base class with the two execution points: assign(rollouts) at group finalization (per-token advantage streams, synchronous) and async score(rollouts) at batch-ship time (reference prefill logprobs, bounded concurrency); finalize_group stamps the wire fields (the advantage stream — prompt-padded and sliced across samples — plus component weight streams). setup() connects an InferencePool to the algorithm's inline frozen reference (connect_frozen_pool — client-side only: prime-rl connects and waits, never launches) and tracks connected_pools for shutdown. The named classes above own their assign/score bodies outright; build_algorithm dispatches on advantage.type.
- Sampler — one per env, owns the rollout source: the generating pool (policy, or a frozen pool it connects in its own setup()), samples_from_live_policy, and sampling_args() (strips the logprobs knob for frozen endpoints). The dispatcher reads env.sampler for pool/liveness; the algorithm never sees sampling.
- The pipeline has zero algorithm conditionals: the dispatcher derives cache-salt/aging from the sampler's liveness; the sink calls finalize_group; orchestrator setup is one asyncio.gather over both objects' setup() and finalize_train_batch does one unconditional per-env score_batch gather (time/scoring).
Wire: TrainingSample/MicroBatch carry optional per-token rl_weights / ce_weights / ref_kl_weights, the per-token advantages stream, and ref_logprobs. Advantage is per-token from end to end: the scalar advantage field and the optional token_advantages are collapsed into one advantages: list[float] | None (None = no rl credit assigned — opd/opsd; legal only for samples without live rl member tokens, the trainer raises otherwise). Absent weight streams mean rl weight 1.0 on every trainable token, so plain GRPO ships one per-token stream (advantages — same order as completion_logprobs) and nothing else extra. Membership is per token, so samples of different algorithms pack freely into one micro batch (no mode-segregated bins). Reference-KL strategies must ship reference logprobs (not precomputed advantages) because the trainer evaluates the KL against live policy logprobs each microbatch.
Trainer: compute_loss runs each component over its weight stream (rl mask = loss_mask & rl_weights != 0; ce/ref_kl mask = weights != 0) and sums the per-component means; the three global denominators come from one batched all-reduce ([N_rl, N_ce, N_ref_kl]), so every rank issues the same collective. The no-stream hot path is byte-identical to the old single-loss_scale math with zero extra device syncs. Mixed-algorithm packing fix: bins mixing ref-bearing (e.g. OPD) and ref-less (e.g. GRPO) samples now keep ref_logprobs position-aligned with input_ids (0.0 placeholders, backfill-or-pad like the sibling per-token arrays), with a regression test and a post-pack invariant assert that every per-token array on every micro batch matches len(input_ids).

Breaking changes (intentionally, no deprecation aliases)

orchestrator.training_mode is deleted (was rl/opd/sft) → [orchestrator.algo.advantage] type = ....
The advantage type names the algorithm: group_norm → grpo, ref_kl → opd, demo_ref_kl → opsd, supervised → sft (config classes renamed to match). The preset name field is deleted — type-plus-defaults is the algorithm.
Echo selection is a role table: observations / observation_weight are replaced by roles.<role>.alpha (tool bodies @ 0.1 default; setting any role replaces the table) plus an optional filter hook. Echo always requires orchestrator.renderer (the no-attribution "all" mode is gone).
sft requires a frozen sampling source — CE on the policy's own tokens is rejected at validation (previously silently assemblable).
[orchestrator.teacher] is deleted → inline [orchestrator.algo.model] (alias: [orchestrator.algo.teacher]) with name + base_url.
The policy sub-config is [orchestrator.model], flat: name / lora / vlm / trust_remote_code directly on it, the deployment under [orchestrator.model.client] — no more model.model stutter, in configs or resolved dumps (HostedModelConfig is now ModelConfig + a client field, mirroring FrozenModelConfig). The [orchestrator.policy] / [orchestrator.student] aliases, the [orchestrator.client] shorthand, and the flat-key re-nesting shim are deleted.
algorithm.token_scorer is deleted — scorers are advantage strategies now (logprobs → ref_kl, demo_logprobs → demo_ref_kl).
algo.loss does not exist — the action-token loss component derives from the advantage strategy, and observation-token routing is the echo advantage type (observation_weight).
Advantage types: default → group_norm; none is deleted (the ref_kl-family strategies carry its no-scalar semantics; nothing else used it).
Wire/trainer: TrainingSample.advantage (scalar) + token_advantages → one per-token advantages stream; AdvantageOutputs is deleted — custom advantage functions return list[list[float]] aligned to AdvantageInputs.completion_lengths (inputs.broadcast(...) for uniform credit), and advantage-based filters/metrics derive from the streams (zero-advantage filter = all-zero stream; logged distributions = per-rollout means). TrainingSample.teacher_logprobs → ref_logprobs; the loss-type partition (loss_type / token_loss_types / token_loss_weights / loss_type_ids, LOSS_CORE_*) is replaced by the component weight streams rl_weights / ce_weights / ref_kl_weights (the partition was the degenerate disjoint case); metric time/teacher_logprobs → time/scoring.
configs/debug/training_modes/ → configs/debug/algorithms/ (incl. new self_distill.toml running SDFT against the live policy); CI integration configs migrated.

Configs are extra="forbid", so stale configs fail loudly at parse time with the exact unknown key.

Merged with main including #2720 (teacherless SFT). Its spelling is deliberately not carried forward: sft means distillation from a frozen teacher, and pointing it at "policy" is rejected at validation — CE on the policy's own rollouts is a different algorithm, and if wanted it gets its own advantage type/class rather than a degenerate spelling of sft.

Also merged main including #2641 (multiplexed trainer token export): the export machinery composes with the component weight streams — each exported record carries the per-token rl/ce/ref_kl arrays, and micro-batches carry run_id/run_step alongside the streams (training_mode stays deleted).

Note: this PR absorbed #2764 (named algorithm classes → two-component model → Sampler split), #2778 (MaxRL), and #2782 (echo per-role selection + filter hook); those branches were merged into this one and GitHub marked their PRs merged.

Validation

Pre-review polish round: a full-branch review pass (parsimony / stale-reference / docs-accuracy sweep) landed two follow-ups — 48ac7ffb9 scrubs leftover preset vocabulary and stale type names from docs and debug configs (ref_kl → opd in the advantage tables, missing max_rl/reward/custom rows, one wandb project for the debug folder), and f5be6f4b3 fixes a degenerate-batch crash: a micro batch whose components are all empty (e.g. a distillation sample whose prompt alone exceeds trainer seq_len, so truncation strips every nonzero ce/ref_kl token while the stamped all-zero rl stream skips the rl branch) returned a Python float from compute_loss and crashed backward(). The loss is now seeded with a graph-attached zero, so the batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync — pinned by test_empty_components_keep_backward_valid, which fails on the prior code. A follow-up (0606f1272) namespaces ref_kl_loss_fn's trust-region metrics as ref_kl/* (mixed batches previously averaged the rl and ref_kl trust-region definitions into one wandb series), turns a missing advantage on an rl-member sample into a loud ValueError instead of silent zero-gradient training, pads weight streams with 0.0 so padded pure-ce batches read as rl-empty in token export, and pins the pack-boundary STREAM_FILL backfill with positional asserts in both pack orders. Verified by the full trainer+orchestrator unit suites (188 passed) and a 5-step opsd GPU smoke with the renamed keys confirmed in the wandb run.

On the type-rename + preset-layer deletion (full CPU unit suite 485 green excl. the pre-existing tilelang test_qwen3_5_moe* box failures; ruff clean; --dry-run parse of every flipped debug TOML; 5-step grpo GPU smoke via type = "grpo", reward 0.10→0.41, 128/128 trainable, exit 0).

039b4009f collapses the scalar/per-token advantage duality entirely — advantage is a per-token stream from the advantage function to the trainer. The wire's advantage scalar + optional token_advantages become one advantages stream; TrainRollout carries the same single field; AdvantageOutputs is deleted in favor of functions returning list[list[float]] aligned to AdvantageInputs.completion_lengths (with inputs.broadcast(...) for uniform group credit — GRPO's reward-minus-mean is internal math broadcast on the way out); the zero-advantage filter checks all-zero streams and logged distributions use per-rollout means. Verified by the full unit suite (488 passed) and two GPU smokes on the new wire: grpo (broadcast stream, eval 0.82 in 10 steps) and self_distill (advantages=None + ref_kl, all samples trainable, the missing-advantage tripwire silent).

On the echo selection surface (20-step GPU run on multi-turn alphabet-sort with an assembled user-role table, exit 0) — verified at the wire: all 32 samples of a shipped batch carry ce_weights with 5,240 nonzero tokens at exactly the configured α=0.1, and the orchestrator-internal completion_obs_weights field leaked into zero shipped samples. The filter hook is covered by shape-violation and composition unit tests.

On the final two-component shape (full CPU unit suite green — 526 passed, pre-existing test_qwen3_5_moe* tilelang failures on this box excluded; ruff clean; 5-step GPU smokes, 2-GPU, all exit 0):

echo on multi-turn alphabet-sort (3.1–3.6 turns, 32/32 trainable every step) — EchoAlgorithm + Sampler split end-to-end: observation tokens tagged, ce-weighted, and trained while staying out of the rl denominator.
self_distill — OPSDAlgorithm.score() against the live policy pool, demo-conditioned prefix rendered via renderer.render_ids (the policy's own prompt-rendering path).
grpo — the no-stream hot path (eval 0.16→0.40 in 5 steps).

On the preset-resolution refactor: full unit suite incl. the config-load sweep over every shipped TOML, resolved-config round-trip (model_dump → re-validate), legacy [[env]] layout, env-shorthand-on-inherited-preset and conflict tests, plus a 5-step grpo smoke (eval 0.13→0.55, 128/128 trainable).

On the component-sum shape (5-step smokes, all exit 0): grpo eval 0.13→0.53 (no-stream hot path — math identical to the old single-denominator path for single-component runs); echo with the ce stream carrying observation tokens (loss ≈ 0.03 while GRPO advantages are ~0, i.e. the gradient is the CE component); opd eval 0.067→0.684 with the frozen reference on :8001 (ref_kl stream on action tokens, rl stream zeroed, frozen pool initialized from the inline config).

Earlier 50-step runs validated the same orchestrator machinery on the loss-partition iteration of this branch (per-component normalization is numerically identical for these single-component runs):

grpo: eval reward 0.139→0.847 — default path end-to-end, wire unchanged.
opd: 0.106→0.828 with the frozen Qwen3-0.6B-Reverse-Text-RL server on :8001 declared inline — algorithm-owned pool wiring + bounded ship-time scoring + ref_kl routing.
sft_distill: 0.113→0.835 — frozen-sourced sampling (stable prefix cache, no off-policy aging) + supervised advantage + CE.
mixed grpo+opd (new configs/debug/algorithms/mixed_grpo_opd.toml): two envs with different algorithms in one run, 0.091→0.835 — every batch mixes ref-bearing (opd, 30–60% per step) and ref-less (grpo) samples, exercising heterogeneous packing end-to-end with the post-pack alignment assert live.
custom per-token advantages: 5-step run with a custom strategy emitting alternating scalar / scalar·0.5 per-token advantages (eval 0.099→0.681), verified via trainer token export: all 128 step-0 sequences carry the exact alternating pattern over the loss-masked region — custom-fn per-token lists → TrainingSample.advantages → trainer, prompt positions padded out (API has since collapsed to per-token streams everywhere).
self_distill vs policy: wiring healthy (no frozen pool created, demo_ref_kl scores against the live policy, no-scalar rollouts stay trainable). Known issue (not this PR's machinery): over 50 steps it degrades (eval 0.078→0.0, truncation →99%) — a verbosity spiral plausibly driven by the debug config's 128-token cap (~80% truncation at step 0) + the self-referential reference. Tracked for follow-up (truncation-aware demo scoring / larger budget / EMA reference).
atomic presets + echo tool-mode (full unit suite 483 green; 5-step grpo smoke on the refactored preset path, eval 0.10→0.53, exit 0). A second 50-step round of all presets on the merged branch reproduced round-1 results (grpo 0.845, opd 0.837, sft_distill 0.822, self_distill collapse).
opd without scalars (50-step A/B vs the group-norm baseline): group-norm scalars on OPD rollouts were dead weight — ref_kl_loss_fn zeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter, which wrongly dropped uniform-reward OPD groups carrying full teacher-KL signal. Removing them ties the baseline (0.825 vs 0.828); OPD now ships advantage=None like OPSD and ref_kl_loss_fn reads no scalars (its trust region is the low side explicitly — bit-identical math).

Deferred (next PRs)

The two-component model is built so each of these lands as its own PR without re-touching the wire:

Sampling strategies behind the Sampler — today the Sampler answers one question (which pool, and is it live); next it absorbs within-env example iteration and group production, plus a sink→sampler observe() feedback edge at group finalization. In increasing order of machinery that unlocks: difficulty-pool curricula (per-example selection state fed back from group rewards), static dataset sources (supervised training from demonstrations instead of a frozen endpoint), and replay buffers / offline experience (a store between stamping and batch assembly — advantages and weight streams are already stamped-then-frozen at group finalization, which is the prerequisite).
Trainer collapse — the three fixed components reduce to two cores (policy-gradient × supervised) composed with per-token factor streams. Dissolves the rl-slot-only custom-loss wart and lets the stability guards (importance correction, trust region) compose around custom losses instead of being replaced by them.
Remaining ECHO surface — per-role α, arbitrary roles, and content/sampling-logprob filters shipped in this PR; still open: tool_names filtering (non-breaking optional field on the tool role; message_tool_names already rides in the attribution) and θ-dependent low-probability filters, which need a trainer-side per-component knob since the denominator collective happens pre-forward.
Smaller: config knobs for component sums (per-env α/β/γ folded into the streams orchestrator-side — the wire and trainer already support overlap); opt-in weight-sum normalization (Σw denominators — a globally-folded λ cancels under it, so it pairs with a trainer-side coefficient); per-slot loss-fn configs for ce/ref_kl (the ref_kl trust region — one-sided today — becomes a deliberate choice there); barrier-blind sink (the group wait becomes a dependency declared by the algorithm; finalize_group → finalize); EMA/lagged reference endpoints; exposing frozen endpoints to envs as judge clients.

🤖 Generated with Claude Code

Note

High Risk
Breaking orchestrator/trainer config and wire format at the core RL training path; incorrect migration or mixed-algorithm packing could silently change gradients or crash distributed training.

Overview
Replaces orchestrator.training_mode (rl / opd / sft) with a composable [orchestrator.algo] bundle: sampling (policy vs inline frozen endpoint) and advantage.type (grpo, max_rl, opd, opsd, sft, echo, reward, custom), which names the algorithm and routes tokens to rl, ce, or ref_kl loss components.

Config breaking changes: top-level [orchestrator.model] (flat policy + client) replaces student / teacher blocks; frozen teachers move to [orchestrator.algo.teacher] (name + base_url). Debug configs move from training_modes/ to algorithms/ with expanded recipes (echo, self-distill, mixed GRPO+OPD).

Runtime: new orchestrator/algo/ named classes per type (assign / score / observation_weights), per-env Sampler for rollout source, dispatcher off-policy rules keyed on liveness, batch score_train_batch instead of mode-specific teacher logprobs. Trainer treats batches via per-token rl_weights / ce_weights / ref_kl_weights and ref_logprobs (renamed from teacher logprobs); docs and skills updated accordingly.

^{Reviewed by Cursor Bugbot for commit 639b7b5. Bugbot is set up for automated code reviews on this repo. Configure here.}

…d, sft_distill, self_distill, echo) Replace the global training_mode enum with a per-env Algorithm abstraction: a preset bundle of (1) sampling source, (2) scoring (group advantage + async token scorer), and (3) per-token loss routing. The trainer becomes algorithm-blind: routing ships per token on the wire and the trainer executes three fixed loss cores (rl / ce / teacher_kl). - configs: new prime_rl.configs.algorithm with AlgorithmConfig presets, component-level overrides, compatibility validation (incl. the group-relative-advantage-with-group_size=1 footgun warning); training_mode kept as a deprecated alias - orchestrator: per-env algorithm; dispatcher selects student/teacher pool per env (no mode branches); OPD teacher logprobs moved out of finalize_train_batch into a bounded-concurrency token scorer; demo-conditioned teacher scorer for SDFT; interleave_rollout can tag env-observation tokens for ECHO - wire: TrainingSample/MicroBatch carry loss_core + optional per-token cores/weights/advantages (omit_defaults — plain GRPO wire unchanged); packer no longer bins by mode - trainer: unified per-token loss routing, bit-for-bit with the previous rl/opd/sft loss fns on pure batches Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke runs for grpo (reverse_text), opd (teacher pool + alias path), and echo (multi-turn alphabet-sort, per-token routing verified on the wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…hm strategy object "Teacher" is no longer a concept anywhere in the system. There is the live policy (reserved registry key "policy") and named frozen hosted models under [orchestrator.models.<key>]; algorithm components hold references into that registry. The same entry can serve any number of envs' algorithms, and self_distill can point its demo scorer at "policy" itself — the SDFT paper's setting, zero extra deployments. - configs: scorer types logprobs/demo_logprobs with required model refs; sampling.source is a registry key; algorithm.model shorthand folds into the unresolved component; orchestrator.teacher and training_mode deleted; student renamed policy; registry validation (refs resolve, entries used, "policy" reserved, degenerate logprobs@policy rejected) - runtime: ModelRegistry + per-env Algorithm strategy object as the sole interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and never branch on algorithm config; liveness drives cache salting, sampling logprobs, and off-policy aging (frozen-sourced rollouts no longer age) - wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl, time/scoring metric - fixes found by the new SDFT smoke: resolved-config round-trip (shorthands are now write-only / excluded from dumps) and apply_chat_template returning BatchEncoding on newer transformers - configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml running SDFT against the live policy); docs/skills updated Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry 0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32 trainable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…gy ontology Every training signal is an advantage — varying in granularity (group-scalar vs per-token) and evaluation site (orchestrator vs trainer). The advantage union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs -> demo_ref_kl), the action-token loss core derives from the strategy instead of being configured (loss.action deleted), and runtime AdvantageStrategy objects own both execution points: group-time assign() and ship-time score(). Also fixes a shorthand-folding regression: resolve_preset's component assignment polluted model_fields_set, so any [orchestrator.advantage] shorthand differing from the preset raised a bogus conflict error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py

A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones (grpo/echo) extended ref_logprobs without backfilling or padding, shifting it out of alignment with input_ids. Mirror the rewards/loss_core_ids pattern with 0.0 placeholders (already the outside-the-mask filler used by the demo scorer and pad_micro_batch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mikasenghaas · 2026-06-10T00:28:47Z

can put into orch configs? i think rn we have one config per entrypoint

mikasenghaas · 2026-06-10T00:33:34Z

-Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields.
+Set `[trainer.loss] type = "default"` and configure via the knobs above. The `ce` and `ref_kl` cores are fixed and unaffected by `[trainer.loss]`.

 ### Custom Loss


isnt this obsolete?

we could make it obsolete by making the RL loss more configurable in the algo config I guess

The config surface key is now [orchestrator.algo] (per-env: algo = {...}); the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL, TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids, advantage.action_loss_type). Also scrubs stale token-scorer mentions from the ref_kl error message and the configs skill. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and `[orchestrator.student]` fold in as aliases, with the canonical key winning at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys still re-nest ([orchestrator.model] name = ...). Shared-field propagation checks all spellings for conflicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Two envs with different algorithms in one run — exercises heterogeneous train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…r references Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… registry prime-rl now assumes it only ever hosts the trainable policy. Frozen models are external endpoints declared inline on the algorithm component that uses them (FrozenModelConfig: model.name + required client.base_url) — no more [orchestrator.models] namespace or runtime ModelRegistry. Each env's Algorithm builds and readies its own frozen pools in async setup(); the dispatcher reads algorithm.sampling_pool and gets the policy pool directly. References are "policy" | inline config; demo_ref_kl now defaults to "policy" (the SDFT setting needs zero config). The algo.model shorthand folds with fill-or-agree semantics, which also fixes the two Bugbot findings (redundant-but-consistent model rejected; advantage shorthand clearing a folded model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A frozen model reference is the client config we already have plus the one request-level datum it lacks: the served model's name. Drops the nested {model, client} shape — TOML reads `[orchestrator.algo.model]` with name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning, which still read the deleted [orchestrator.models] dict. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e, no aliases HostedModelConfig was a pairing wrapper ({model, client}) that made the canonical path stutter (orchestrator.model.model.name) and needed a before-validator apologizing for it (re-nesting flat keys, folding the policy/student aliases and the [orchestrator.client] shorthand). The flat spelling was what everyone wrote anyway — make it the schema: HostedModelConfig is now ModelConfig + a client field, mirroring FrozenModelConfig (name + endpoint, no new declaration scheme), and fold_policy_shortcuts is deleted along with the policy/student aliases and the client shorthand. Resolved dumps stop stuttering too. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Echo's selection surface generalizes from the observations="tool"|"all" binary to a role table: each env-provided message role (system / user / assistant / tool) trains at its own alpha, selected via the renderer's per-token attribution. An optional filter hook (import_path + kwargs, matching the custom advantage/loss precedent) narrows the selection per rollout with one keep-mask per trajectory step. - completion_obs_mask (bool) -> completion_obs_weights (float): the per-token weight carries its role's alpha, so stamping folds it into ce_weights directly and stamp_loss_routing drops the scalar observation_weight parameter. Orchestrator-internal as before. - The echo preset is unchanged in meaning: tool-response bodies at 0.1. Setting any role replaces the whole table. - Echo now always requires the renderer (role selection needs attribution); the blanket "all" mode is gone — assemble the roles you want instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…eted The advantage type now names the algorithm — group_norm -> grpo, ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes renamed to match) — and each type's class defaults are its vetted setting, so 'type = "opd"' with a teacher IS on-policy distillation. With type-plus-defaults equal to the preset for every algorithm, the preset layer had nothing left to do: AlgorithmName, _PRESETS, the name field, and the atomicity guard are deleted. The model/teacher shorthand survives and now folds by the type's own declarations (model_role -> advantage.model for opd/opsd; source_role -> sampling.source for sft). sampling.source loses its None state (it existed only for preset resolution); sft without a frozen source is rejected at validation — CE on the policy's own tokens was never the vetted meaning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8e7e51f. Configure here.}

- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the missing max_rl/reward/custom rows, fix the frozen-model base_url and custom-Algorithm wording, make the length_penalty example self-sufficient, drop the Per-Env Advantage section (duplicate of Per-Env Algorithms) - configs/debug/algorithms: README gains max_rl and uses the real type names, comments lose leftover preset vocabulary, one wandb project for the whole folder - docs/training.md / skills/configs/SKILL.md: complete type lists and a union example that parses on its own Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A fully truncated distillation sample (prompt >= trainer seq_len) loses all its nonzero ce/ref_kl tokens to prepare_sample's truncation while its stamped all-zero rl_weights suppress the rl branch; with every component empty, compute_loss returned the Python float 0.0 and loss.backward() crashed. Seed the rl accumulator with a graph-attached zero so the degenerate batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- ref_kl_loss_fn emitted the same trust-region metric keys as the rl loss fn into one shared dict, so mixed batches (per-env algorithms) averaged two different trust-region definitions into one wandb series. Namespaced as ref_kl/*; the wandb noise filter gets matching prefixes and the ref_kl value series is unchanged. - prepare_sample: a sample with rl member tokens but no advantage now raises instead of silently training with advantage 0.0 — the orchestrator always stamps a scalar, so a missing one is a producer bug (ce/ref_kl-only samples still default to 0.0 legitimately). - pad_micro_batch: padding fills every weight stream with 0.0 instead of the pack-boundary defaults; padding is loss-masked so this is training-equivalent, and padded pure-ce batches now read as rl-empty in token export, which keys off nonzero weights. - test_prepare_batch_packs_mixed_components: sorted() multiset checks replaced with exact positional asserts in both pack orders, pinning STREAM_FILL backfill alignment across the bin boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mikasenghaas · 2026-06-12T18:00:42Z

+        super().__init__(config, policy_pool, renderer)
+        assert isinstance(config.advantage, OPSDAdvantageConfig)
+        assert renderer is not None, "opsd requires the renderer (validated at config time)"
+        self.demo_key = config.advantage.demo_key


is this the privileged informatioN? if so, dont like it

mikasenghaas · 2026-06-12T18:02:52Z

                continue
+            # Frozen-sourced rollouts never go stale — their sampler doesn't
+            # change with policy updates.
+            if not self.train_envs.get(meta.env_name).sampler.samples_from_live_policy:


Staleness measures drift between the sampling policy and the trainer policy — a frozen sampler doesn't drift, so off-policy aging doesn't apply to its rollouts (and salting its prefix cache per policy version would just evict a perfectly valid cache every weight update). Frozen-sourced rollouts (sft) train as fixed-teacher data, like offline SFT; the importance-ratio machinery is off for them anyway (ce component, no sampling logprobs).

mikasenghaas · 2026-06-12T18:03:37Z

    config: TrainEnvConfig

-    def __init__(self, config: TrainEnvConfig):
+    def __init__(self, config: TrainEnvConfig, sampler: Sampler, algorithm: Algorithm):


seems weird to have train env contain sampler + algorithm? maybe not.. have to think

I think it actually makes sense. It should contain everything that is needed to return training samples and just be the env + all the stuff you need at train time, hence train env

algo/algorithm.py (all eight classes in one file) splits into one module per algorithm — grpo, echo, max_rl, opd, opsd, sft, reward, custom — plus base.py holding Algorithm, connect_frozen_pool, and score_train_batch. The dispatch table and build_algorithm move to the package __init__; the shared group-norm assign moves to advantage.py as assign_group_norm. No behavior change; external imports all go through the package and are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Echo's selection state (echo_roles / echo_filter_fn on the base class) was pipeline-visible configuration for behavior that lived in trajectories.py. Replace it with Algorithm.observation_weights(output) — one per-token ce-weight list per trajectory step; None (the default) masks all observations out. EchoAlgorithm owns the whole selection (role table, attribution lookup, user filter + its shape validation); interleave_rollout just validates alignment and slices each extension span; the train sink calls the hook and passes data. A custom algorithm can now implement any observation-token policy by overriding one method instead of forking interleave_rollout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… phase entry points The three hooks are stages of one compilation (rollouts in, component weight streams out), but the sink still hand-composed phase 1 (observation_weights + interleave_rollout). Algorithm.build_samples now drives it, completing the pattern: the pipeline hands the algorithm its rollout / group / batch (build_samples / finalize_group / score) and never composes algorithm internals; subclasses override only the hooks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…cher] The alias existed; the shipped configs still used the role-neutral 'model' spelling. Configs should say what they mean — every teacher-meaning table flips to 'teacher' (per Mika's review). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

samsja

prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:

should we have utility or docs tho to still help doing this ?

mikasenghaas

i still feel the algo api is a bit complicated to understand

mikasenghaas · 2026-06-12T20:48:34Z

+
+    # Per-token advantages (full sequence length). ``None`` broadcasts the
+    # rollout-level ``advantage`` scalar over the sequence.
+    token_advantages: list[float] | None = None


just advantages?

yeah lets unify and always do token-level

Done in 039b400 — and it went further than a rename: the scalar is gone everywhere, not just on the wire. One advantages: list[float] | None stream end to end (wire, rollout, advantage-fn API — AdvantageOutputs deleted, fns return per-token lists with inputs.broadcast(...) for uniform group credit). None = no rl credit (opd/opsd), exactly like the absent weight streams.

mikasenghaas · 2026-06-12T20:49:08Z

+    # Set up the loss function for the RL loss type (ce / ref_kl are fixed)
    logger.info(f"Setting up loss function ({config.loss})")
-    loss_fns = setup_loss_fns(config.loss)
+    rl_loss_fn = setup_rl_loss_fn(config.loss)


just loss_fn?

no because its just the custom loss fn for rl

There is never a scalar advantage anywhere in the pipeline: - Wire: TrainingSample.advantage + token_advantages collapse into one advantages: list[float] | None stream (the fourth stream next to the rl/ce/ref_kl weights). None = no rl credit assigned (opd/opsd) — legal only for samples without live rl member tokens; prepare_sample keeps the producer-bug tripwire. - TrainRollout carries the same single field, aligned to its samples' completion tokens (concatenated in step order); rollout dumps keep a scalar view (mean) for logging. - Advantage-fn API: AdvantageOutputs deleted. Functions return list[list[float]] aligned to inputs.completion_lengths, with inputs.broadcast(...) spreading uniform group credit — GRPO's reward-minus-mean is internal math the fn broadcasts on the way out. - stamp_advantages (replaces spread_token_advantages) pads prompt positions with 0.0 and slices the stream across samples. - ZeroAdvantageFilter checks for all-zero streams; logged advantage distributions use per-rollout means. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hallerite force-pushed the feat/algorithm-abstraction branch from fb0da20 to f35e3a8 Compare June 9, 2026 19:59

hallerite changed the title ~~feat: Algorithm abstraction — sampling/scoring/loss presets (grpo, opd, sft_distill, self_distill, echo)~~ feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026

hallerite changed the title ~~feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo)~~ feat: algorithm abstraction — model registry + unified advantage strategies (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026

hallerite marked this pull request as ready for review June 9, 2026 22:50

cursor Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py

Merge remote-tracking branch 'origin/main' into feat/algorithm-abstra…

3c3c3d1

…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread src/prime_rl/trainer/batch.py

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated

feat(trainer): assert per-token array alignment after packing

fc79651

Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mikasenghaas reviewed Jun 10, 2026

View reviewed changes

hallerite and others added 2 commits June 10, 2026 00:57