feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo)#2746
Conversation
…d, sft_distill, self_distill, echo) Replace the global training_mode enum with a per-env Algorithm abstraction: a preset bundle of (1) sampling source, (2) scoring (group advantage + async token scorer), and (3) per-token loss routing. The trainer becomes algorithm-blind: routing ships per token on the wire and the trainer executes three fixed loss cores (rl / ce / teacher_kl). - configs: new prime_rl.configs.algorithm with AlgorithmConfig presets, component-level overrides, compatibility validation (incl. the group-relative-advantage-with-group_size=1 footgun warning); training_mode kept as a deprecated alias - orchestrator: per-env algorithm; dispatcher selects student/teacher pool per env (no mode branches); OPD teacher logprobs moved out of finalize_train_batch into a bounded-concurrency token scorer; demo-conditioned teacher scorer for SDFT; interleave_rollout can tag env-observation tokens for ECHO - wire: TrainingSample/MicroBatch carry loss_core + optional per-token cores/weights/advantages (omit_defaults — plain GRPO wire unchanged); packer no longer bins by mode - trainer: unified per-token loss routing, bit-for-bit with the previous rl/opd/sft loss fns on pure batches Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke runs for grpo (reverse_text), opd (teacher pool + alias path), and echo (multi-turn alphabet-sort, per-token routing verified on the wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fb0da20 to
f35e3a8
Compare
…hm strategy object "Teacher" is no longer a concept anywhere in the system. There is the live policy (reserved registry key "policy") and named frozen hosted models under [orchestrator.models.<key>]; algorithm components hold references into that registry. The same entry can serve any number of envs' algorithms, and self_distill can point its demo scorer at "policy" itself — the SDFT paper's setting, zero extra deployments. - configs: scorer types logprobs/demo_logprobs with required model refs; sampling.source is a registry key; algorithm.model shorthand folds into the unresolved component; orchestrator.teacher and training_mode deleted; student renamed policy; registry validation (refs resolve, entries used, "policy" reserved, degenerate logprobs@policy rejected) - runtime: ModelRegistry + per-env Algorithm strategy object as the sole interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and never branch on algorithm config; liveness drives cache salting, sampling logprobs, and off-policy aging (frozen-sourced rollouts no longer age) - wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl, time/scoring metric - fixes found by the new SDFT smoke: resolved-config round-trip (shorthands are now write-only / excluded from dumps) and apply_chat_template returning BatchEncoding on newer transformers - configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml running SDFT against the live policy); docs/skills updated Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry 0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32 trainable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gy ontology Every training signal is an advantage — varying in granularity (group-scalar vs per-token) and evaluation site (orchestrator vs trainer). The advantage union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs -> demo_ref_kl), the action-token loss core derives from the strategy instead of being configured (loss.action deleted), and runtime AdvantageStrategy objects own both execution points: group-time assign() and ship-time score(). Also fixes a shorthand-folding regression: resolve_preset's component assignment polluted model_fields_set, so any [orchestrator.advantage] shorthand differing from the preset raised a bogus conflict error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py
A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones (grpo/echo) extended ref_logprobs without backfilling or padding, shifting it out of alignment with input_ids. Mirror the rewards/loss_core_ids pattern with 0.0 placeholders (already the outside-the-mask filler used by the demo scorer and pad_micro_batch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
can put into orch configs? i think rn we have one config per entrypoint
| Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields. | ||
| Set `[trainer.loss] type = "default"` and configure via the knobs above. The `ce` and `ref_kl` cores are fixed and unaffected by `[trainer.loss]`. | ||
|
|
||
| ### Custom Loss |
There was a problem hiding this comment.
we could make it obsolete by making the RL loss more configurable in the algo config I guess
The config surface key is now [orchestrator.algo] (per-env: algo = {...});
the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL,
TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids,
advantage.action_loss_type). Also scrubs stale token-scorer mentions from
the ref_kl error message and the configs skill.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and `[orchestrator.student]` fold in as aliases, with the canonical key winning at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys still re-nest ([orchestrator.model] name = ...). Shared-field propagation checks all spellings for conflicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two envs with different algorithms in one run — exercises heterogeneous train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…r references Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… registry prime-rl now assumes it only ever hosts the trainable policy. Frozen models are external endpoints declared inline on the algorithm component that uses them (FrozenModelConfig: model.name + required client.base_url) — no more [orchestrator.models] namespace or runtime ModelRegistry. Each env's Algorithm builds and readies its own frozen pools in async setup(); the dispatcher reads algorithm.sampling_pool and gets the policy pool directly. References are "policy" | inline config; demo_ref_kl now defaults to "policy" (the SDFT setting needs zero config). The algo.model shorthand folds with fill-or-agree semantics, which also fixes the two Bugbot findings (redundant-but-consistent model rejected; advantage shorthand clearing a folded model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A frozen model reference is the client config we already have plus the one
request-level datum it lacks: the served model's name. Drops the nested
{model, client} shape — TOML reads `[orchestrator.algo.model]` with
name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning,
which still read the deleted [orchestrator.models] dict.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, no aliases
HostedModelConfig was a pairing wrapper ({model, client}) that made the
canonical path stutter (orchestrator.model.model.name) and needed a
before-validator apologizing for it (re-nesting flat keys, folding the
policy/student aliases and the [orchestrator.client] shorthand). The
flat spelling was what everyone wrote anyway — make it the schema:
HostedModelConfig is now ModelConfig + a client field, mirroring
FrozenModelConfig (name + endpoint, no new declaration scheme), and
fold_policy_shortcuts is deleted along with the policy/student aliases
and the client shorthand. Resolved dumps stop stuttering too.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection surface generalizes from the observations="tool"|"all" binary to a role table: each env-provided message role (system / user / assistant / tool) trains at its own alpha, selected via the renderer's per-token attribution. An optional filter hook (import_path + kwargs, matching the custom advantage/loss precedent) narrows the selection per rollout with one keep-mask per trajectory step. - completion_obs_mask (bool) -> completion_obs_weights (float): the per-token weight carries its role's alpha, so stamping folds it into ce_weights directly and stamp_loss_routing drops the scalar observation_weight parameter. Orchestrator-internal as before. - The echo preset is unchanged in meaning: tool-response bodies at 0.1. Setting any role replaces the whole table. - Echo now always requires the renderer (role selection needs attribution); the blanket "all" mode is gone — assemble the roles you want instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eted The advantage type now names the algorithm — group_norm -> grpo, ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes renamed to match) — and each type's class defaults are its vetted setting, so 'type = "opd"' with a teacher IS on-policy distillation. With type-plus-defaults equal to the preset for every algorithm, the preset layer had nothing left to do: AlgorithmName, _PRESETS, the name field, and the atomicity guard are deleted. The model/teacher shorthand survives and now folds by the type's own declarations (model_role -> advantage.model for opd/opsd; source_role -> sampling.source for sft). sampling.source loses its None state (it existed only for preset resolution); sft without a frozen source is rejected at validation — CE on the policy's own tokens was never the vetted meaning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8e7e51f. Configure here.
- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the missing max_rl/reward/custom rows, fix the frozen-model base_url and custom-Algorithm wording, make the length_penalty example self-sufficient, drop the Per-Env Advantage section (duplicate of Per-Env Algorithms) - configs/debug/algorithms: README gains max_rl and uses the real type names, comments lose leftover preset vocabulary, one wandb project for the whole folder - docs/training.md / skills/configs/SKILL.md: complete type lists and a union example that parses on its own Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fully truncated distillation sample (prompt >= trainer seq_len) loses all its nonzero ce/ref_kl tokens to prepare_sample's truncation while its stamped all-zero rl_weights suppress the rl branch; with every component empty, compute_loss returned the Python float 0.0 and loss.backward() crashed. Seed the rl accumulator with a graph-attached zero so the degenerate batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- ref_kl_loss_fn emitted the same trust-region metric keys as the rl loss fn into one shared dict, so mixed batches (per-env algorithms) averaged two different trust-region definitions into one wandb series. Namespaced as ref_kl/*; the wandb noise filter gets matching prefixes and the ref_kl value series is unchanged. - prepare_sample: a sample with rl member tokens but no advantage now raises instead of silently training with advantage 0.0 — the orchestrator always stamps a scalar, so a missing one is a producer bug (ce/ref_kl-only samples still default to 0.0 legitimately). - pad_micro_batch: padding fills every weight stream with 0.0 instead of the pack-boundary defaults; padding is loss-masked so this is training-equivalent, and padded pure-ce batches now read as rl-empty in token export, which keys off nonzero weights. - test_prepare_batch_packs_mixed_components: sorted() multiset checks replaced with exact positional asserts in both pack orders, pinning STREAM_FILL backfill alignment across the bin boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| super().__init__(config, policy_pool, renderer) | ||
| assert isinstance(config.advantage, OPSDAdvantageConfig) | ||
| assert renderer is not None, "opsd requires the renderer (validated at config time)" | ||
| self.demo_key = config.advantage.demo_key |
There was a problem hiding this comment.
is this the privileged informatioN? if so, dont like it
| continue | ||
| # Frozen-sourced rollouts never go stale — their sampler doesn't | ||
| # change with policy updates. | ||
| if not self.train_envs.get(meta.env_name).sampler.samples_from_live_policy: |
There was a problem hiding this comment.
Staleness measures drift between the sampling policy and the trainer policy — a frozen sampler doesn't drift, so off-policy aging doesn't apply to its rollouts (and salting its prefix cache per policy version would just evict a perfectly valid cache every weight update). Frozen-sourced rollouts (sft) train as fixed-teacher data, like offline SFT; the importance-ratio machinery is off for them anyway (ce component, no sampling logprobs).
| config: TrainEnvConfig | ||
|
|
||
| def __init__(self, config: TrainEnvConfig): | ||
| def __init__(self, config: TrainEnvConfig, sampler: Sampler, algorithm: Algorithm): |
There was a problem hiding this comment.
seems weird to have train env contain sampler + algorithm? maybe not.. have to think
There was a problem hiding this comment.
I think it actually makes sense. It should contain everything that is needed to return training samples and just be the env + all the stuff you need at train time, hence train env
algo/algorithm.py (all eight classes in one file) splits into one module per algorithm — grpo, echo, max_rl, opd, opsd, sft, reward, custom — plus base.py holding Algorithm, connect_frozen_pool, and score_train_batch. The dispatch table and build_algorithm move to the package __init__; the shared group-norm assign moves to advantage.py as assign_group_norm. No behavior change; external imports all go through the package and are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection state (echo_roles / echo_filter_fn on the base class) was pipeline-visible configuration for behavior that lived in trajectories.py. Replace it with Algorithm.observation_weights(output) — one per-token ce-weight list per trajectory step; None (the default) masks all observations out. EchoAlgorithm owns the whole selection (role table, attribution lookup, user filter + its shape validation); interleave_rollout just validates alignment and slices each extension span; the train sink calls the hook and passes data. A custom algorithm can now implement any observation-token policy by overriding one method instead of forking interleave_rollout. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… phase entry points The three hooks are stages of one compilation (rollouts in, component weight streams out), but the sink still hand-composed phase 1 (observation_weights + interleave_rollout). Algorithm.build_samples now drives it, completing the pattern: the pipeline hands the algorithm its rollout / group / batch (build_samples / finalize_group / score) and never composes algorithm internals; subclasses override only the hooks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cher] The alias existed; the shipped configs still used the role-neutral 'model' spelling. Configs should say what they mean — every teacher-meaning table flips to 'teacher' (per Mika's review). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
samsja
left a comment
There was a problem hiding this comment.
prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:
should we have utility or docs tho to still help doing this ?
mikasenghaas
left a comment
There was a problem hiding this comment.
i still feel the algo api is a bit complicated to understand
|
|
||
| # Per-token advantages (full sequence length). ``None`` broadcasts the | ||
| # rollout-level ``advantage`` scalar over the sequence. | ||
| token_advantages: list[float] | None = None |
There was a problem hiding this comment.
yeah lets unify and always do token-level
There was a problem hiding this comment.
Done in 039b400 — and it went further than a rename: the scalar is gone everywhere, not just on the wire. One advantages: list[float] | None stream end to end (wire, rollout, advantage-fn API — AdvantageOutputs deleted, fns return per-token lists with inputs.broadcast(...) for uniform group credit). None = no rl credit (opd/opsd), exactly like the absent weight streams.
| # Set up the loss function for the RL loss type (ce / ref_kl are fixed) | ||
| logger.info(f"Setting up loss function ({config.loss})") | ||
| loss_fns = setup_loss_fns(config.loss) | ||
| rl_loss_fn = setup_rl_loss_fn(config.loss) |
There was a problem hiding this comment.
no because its just the custom loss fn for rl
There is never a scalar advantage anywhere in the pipeline: - Wire: TrainingSample.advantage + token_advantages collapse into one advantages: list[float] | None stream (the fourth stream next to the rl/ce/ref_kl weights). None = no rl credit assigned (opd/opsd) — legal only for samples without live rl member tokens; prepare_sample keeps the producer-bug tripwire. - TrainRollout carries the same single field, aligned to its samples' completion tokens (concatenated in step order); rollout dumps keep a scalar view (mean) for logging. - Advantage-fn API: AdvantageOutputs deleted. Functions return list[list[float]] aligned to inputs.completion_lengths, with inputs.broadcast(...) spreading uniform group credit — GRPO's reward-minus-mean is internal math the fn broadcasts on the way out. - stamp_advantages (replaces spread_token_advantages) pads prompt positions with 0.0 and slices the stream across samples. - ZeroAdvantageFilter checks for all-zero streams; logged advantage distributions use per-rollout means. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ca2f729 to
039b400
Compare

What
Makes prime-rl's training algorithm a first-class, hackable abstraction — removes model roles from the pipeline, unifies every training signal under one concept (the advantage: credit assignment and loss routing, fused), and makes each algorithm a named runtime class that owns its methods. There is no registry and no separate "token scorer": there is the live policy — the only model prime-rl ever hosts — and per-env algorithms whose model references are either
"policy"or an inline externally-hosted frozen endpoint.An algorithm is a bundle of two components, configured under
[orchestrator.algo](per-env:algo = {...}or theadvantage = {...}shorthand on the env) and resolved per env. The advantagetypenames the algorithm —grpo,max_rl,opd,opsd,sft,echo,reward,custom— and each type's class defaults are its vetted setting; there is no separate preset layer:"policy"or an inline frozen endpoint. At runtime this builds the env'sSampler(orchestrator/sampler.py) — the pool rollouts come from, the liveness consequences (logprobs, prefix-cache salting, staleness), and the home of future sampling strategies (replay buffers, branching).rl/ce/ref_kl) and what happens to env-provided observation tokens (masked out by default;echotrains on them with weighted CE — selected by message role via the renderer's per-token attribution, each role at its ownalpha, tool-response bodies at λ=0.1 by default, optionally narrowed by a user-supplied per-rollout token filter).The training loss is a sum of three components —
rl(DPPO+KL or custom),ce(masked NLL),ref_kl(reverse KL to a reference as the PG signal) — each normalized by its own global token count:The orchestrator stamps per-token component weight streams (
rl_weights/ce_weights/ref_kl_weights) plus the per-token advantage stream (advantages— there is no scalar advantage anywhere; uniform group credit is broadcast over completion tokens at assignment); a weight scales that component's per-token loss,0.0removes the token from the component's mask and denominator, and components may overlap on the same token (gradients sum). Per-component normalization means components never dilute each other: echo's observation tokens no longer shrink the rl term's effective per-token learning rate (previously both shared one global denominator, so the rl gradient scaled with the batch's obs/action ratio), and a supervised env packed next to a GRPO env doesn't soften its gradient.The trainer is algorithm-blind (component weight streams ship per token on the wire; the trainer executes the three fixed components), and the orchestrator pipeline is too: dispatcher, train sink, and orchestrator call hooks on each env's
Sampler+Algorithmobjects and never branch on algorithm config or model roles.The algorithm classes
Each algorithm is a named runtime class — the algorithm object is the algorithm. Dispatch is keyed on
advantage.type— it names the algorithm, and each config class's defaults are its vetted parameterization:advantage.typeassign(group time)score(ship time)grpoGRPOAlgorithmrlmax_rlMaxRLAlgorithmrlgroup_sizetruncation of the max-likelihood objective)echoEchoAlgorithmrl+ceroles.<role>.alpha, tool bodies @ 0.1 default), optional user token filterrewardRewardAlgorithmrlopdOPDAlgorithmref_kladvantages=None, filters skip)opsdOPSDAlgorithmref_kladvantages=None, filters skip)"policy")sftSFTDistillAlgorithmcecustomCustomAlgorithmrlinputs.completion_lengths(inputs.broadcast(...)spreads uniform group credit)Reading a class top to bottom reads the algorithm — one module per algorithm under
orchestrator/algo/(grpo.py,opd.py, …, with the base class and pipeline hooks inbase.pyand dispatch in the package__init__); writing your own is subclassingAlgorithmand overriding the same hooks — a third hook,observation_weights(output), decides per-token ce weights for env-provided observation tokens at sample-construction time (defaultNone= masked; echo overrides it with role selection + the user filter). The hooks are stages of one compilation the base class drives: the pipeline hands the algorithm its rollout / group / batch (build_samples/finalize_group/score) and never composes algorithm internals itself. Duplication of orchestration between similar algorithms (OPD vs OPSD) is accepted so each class stays self-contained; shared math (group normalization, prefill alignment, length penalties) lives as plain functions inalgo/advantage.py. Algorithms take exactly two runtime resources —Algorithm(config, policy_pool, renderer); text → token ids always goes through the renderer, the same path the policy's own prompts take (opsdandechorequire one, validated at config time — demo-conditioned scoring or role-attributed selection under MITO would diverge from the policy's own rendering, so they're rejected rather than approximated).One deliberate expressiveness trade: loss routing is not a free config axis (there is no
algo.loss; you can't writeopd+ observation-CE in TOML) — routing variation is algorithm variation, expressed as a class. ECHO is a proper advantage type, not a flag on GRPO. Its selection surface is configurable where it matters: per-rolealpha(roles.system/user/assistant/tool— setting any role replaces the whole table) and an optionalfilterhook (import_path+kwargs, called once per rollout asfilter_fn(rollout, **kwargs) -> list[list[bool]], one keep-mask per trajectory step spanning that step'sprompt_ids + completion_ids) — the raw rollout exposes message text and sampling logprobs, so warning filters and low-probability filters are user code, no framework surface.The algorithms
advantage.typegrpo(default)rlmax_rl(arXiv:2602.02710)rlopdref_kl(needsteacher)sftteacherfolds intosampling.source)ceopsd(SDFT, arXiv:2601.19897)ref_klvs"policy"by defaultecho(ECHO)rlon actions + per-role α·ceon env-provided tokensThere is no preset layer: the type IS the algorithm, its class defaults ARE the vetted setting, and every key beyond
typeis visibly the user's own assembly (an earlier iteration had named atomic presets; with type-plus-defaults equal to the preset for every algorithm, the layer had nothing left to do and was deleted). Per-env algorithms compose in one run. Because a reference can be the literal"policy",opsdruns the SDFT paper's setting with zero extra deployments — that's its default.Model references — no registry; roles are algorithm-local
prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:
ModelReference = "policy" | FrozenModelConfig, whereFrozenModelConfigis just the existingClientConfigplus the served model'sname— no new declaration scheme.base_url(or an elastic deployment) is required: a frozen reference with no endpoint fails at parse time.algo.modelis shorthand that folds into the slot the advantage type declares for its reference (model_role→advantage.modelfor opd/opsd,source_role→sampling.sourcefor sft); redundant-but-consistent explicit settings are accepted, contradictions rejected."teacher"(model_role/source_roleClassVars), which makes[orchestrator.algo.teacher]a parse-time alias for themodelshorthand and puts the same word in validation errors ("advantage 'opd' needs a teacher — set 'teacher' on the algorithm ..."). Roles stay strictly algorithm-local: no role ever reaches flow code or the wire.opdpointed at"policy"is rejected as degenerate (KL ≡ 0), frozen sampling can't feed anrl-type strategy (no policy sampling logprobs for importance ratios),sftwithout a frozen source is rejected (CE on the policy's own tokens is not a distillation target),opsdandechorequire a renderer, and group-relative advantage withgroup_size=1warns loudly.How
prime_rl.configs.algorithm): oneadvantagediscriminated union absorbs the former token scorers (logprobs/demo_logprobs→opd/opsdwith aModelReference) and loss routing (EchoAdvantageConfig(GRPOAdvantageConfig)carries the role table + filter); the action-token loss component is anaction_loss_typeclass property of the strategy, never configured;algo.model(aliasteacher) folds fill-or-agree into the slot the type's ClassVars declare (model_role/source_role).advantageshorthands ([orchestrator.advantage], per-envadvantage = {...}) fold intoalgo.advantageon raw input; an env-level shorthand assembles the env's own algorithm rather than copy-modifying the inherited one, and env algorithm inheritance is a one-loop after-validator. EveryAlgorithmConfigis built exactly once with everything in place: no preset tables, no name field, no merge machinery, no__pydantic_fields_set__surgery.orchestrator/algo/package +orchestrator/sampler.py):Algorithm— base class with the two execution points:assign(rollouts)at group finalization (per-token advantage streams, synchronous) andasync score(rollouts)at batch-ship time (reference prefill logprobs, bounded concurrency);finalize_groupstamps the wire fields (the advantage stream — prompt-padded and sliced across samples — plus component weight streams).setup()connects anInferencePoolto the algorithm's inline frozen reference (connect_frozen_pool— client-side only: prime-rl connects and waits, never launches) and tracksconnected_poolsfor shutdown. The named classes above own theirassign/scorebodies outright;build_algorithmdispatches onadvantage.type.Sampler— one per env, owns the rollout source: the generating pool (policy, or a frozen pool it connects in its ownsetup()),samples_from_live_policy, andsampling_args()(strips the logprobs knob for frozen endpoints). The dispatcher readsenv.samplerfor pool/liveness; the algorithm never sees sampling.finalize_group; orchestrator setup is oneasyncio.gatherover both objects'setup()andfinalize_train_batchdoes one unconditional per-envscore_batchgather (time/scoring).TrainingSample/MicroBatchcarry optional per-tokenrl_weights/ce_weights/ref_kl_weights, the per-tokenadvantagesstream, andref_logprobs. Advantage is per-token from end to end: the scalaradvantagefield and the optionaltoken_advantagesare collapsed into oneadvantages: list[float] | None(None = no rl credit assigned — opd/opsd; legal only for samples without live rl member tokens, the trainer raises otherwise). Absent weight streams mean rl weight 1.0 on every trainable token, so plain GRPO ships one per-token stream (advantages — same order ascompletion_logprobs) and nothing else extra. Membership is per token, so samples of different algorithms pack freely into one micro batch (no mode-segregated bins). Reference-KL strategies must ship reference logprobs (not precomputed advantages) because the trainer evaluates the KL against live policy logprobs each microbatch.compute_lossruns each component over its weight stream (rl mask =loss_mask & rl_weights != 0; ce/ref_kl mask =weights != 0) and sums the per-component means; the three global denominators come from one batched all-reduce ([N_rl, N_ce, N_ref_kl]), so every rank issues the same collective. The no-stream hot path is byte-identical to the old single-loss_scalemath with zero extra device syncs. Mixed-algorithm packing fix: bins mixing ref-bearing (e.g. OPD) and ref-less (e.g. GRPO) samples now keepref_logprobsposition-aligned withinput_ids(0.0 placeholders, backfill-or-pad like the sibling per-token arrays), with a regression test and a post-pack invariant assert that every per-token array on every micro batch matcheslen(input_ids).Breaking changes (intentionally, no deprecation aliases)
orchestrator.training_modeis deleted (wasrl/opd/sft) →[orchestrator.algo.advantage] type = ....typenames the algorithm:group_norm→grpo,ref_kl→opd,demo_ref_kl→opsd,supervised→sft(config classes renamed to match). The presetnamefield is deleted — type-plus-defaults is the algorithm.observations/observation_weightare replaced byroles.<role>.alpha(tool bodies @ 0.1 default; setting any role replaces the table) plus an optionalfilterhook. Echo always requiresorchestrator.renderer(the no-attribution"all"mode is gone).sftrequires a frozen sampling source — CE on the policy's own tokens is rejected at validation (previously silently assemblable).[orchestrator.teacher]is deleted → inline[orchestrator.algo.model](alias:[orchestrator.algo.teacher]) withname+base_url.[orchestrator.model], flat:name/lora/vlm/trust_remote_codedirectly on it, the deployment under[orchestrator.model.client]— no moremodel.modelstutter, in configs or resolved dumps (HostedModelConfigis nowModelConfig+ aclientfield, mirroringFrozenModelConfig). The[orchestrator.policy]/[orchestrator.student]aliases, the[orchestrator.client]shorthand, and the flat-key re-nesting shim are deleted.algorithm.token_scoreris deleted — scorers are advantage strategies now (logprobs→ref_kl,demo_logprobs→demo_ref_kl).algo.lossdoes not exist — the action-token loss component derives from the advantage strategy, and observation-token routing is theechoadvantage type (observation_weight).default→group_norm;noneis deleted (the ref_kl-family strategies carry its no-scalar semantics; nothing else used it).TrainingSample.advantage(scalar) +token_advantages→ one per-tokenadvantagesstream;AdvantageOutputsis deleted — custom advantage functions returnlist[list[float]]aligned toAdvantageInputs.completion_lengths(inputs.broadcast(...)for uniform credit), and advantage-based filters/metrics derive from the streams (zero-advantage filter = all-zero stream; logged distributions = per-rollout means).TrainingSample.teacher_logprobs→ref_logprobs; the loss-type partition (loss_type/token_loss_types/token_loss_weights/loss_type_ids,LOSS_CORE_*) is replaced by the component weight streamsrl_weights/ce_weights/ref_kl_weights(the partition was the degenerate disjoint case); metrictime/teacher_logprobs→time/scoring.configs/debug/training_modes/→configs/debug/algorithms/(incl. newself_distill.tomlrunning SDFT against the live policy); CI integration configs migrated.Configs are
extra="forbid", so stale configs fail loudly at parse time with the exact unknown key.Merged with
mainincluding #2720 (teacherless SFT). Its spelling is deliberately not carried forward:sftmeans distillation from a frozen teacher, and pointing it at"policy"is rejected at validation — CE on the policy's own rollouts is a different algorithm, and if wanted it gets its own advantage type/class rather than a degenerate spelling ofsft.Also merged
mainincluding #2641 (multiplexed trainer token export): the export machinery composes with the component weight streams — each exported record carries the per-token rl/ce/ref_kl arrays, and micro-batches carryrun_id/run_stepalongside the streams (training_modestays deleted).Validation
Pre-review polish round: a full-branch review pass (parsimony / stale-reference / docs-accuracy sweep) landed two follow-ups —
48ac7ffb9scrubs leftover preset vocabulary and stale type names from docs and debug configs (ref_kl→opdin the advantage tables, missingmax_rl/reward/customrows, one wandb project for the debug folder), andf5be6f4b3fixes a degenerate-batch crash: a micro batch whose components are all empty (e.g. a distillation sample whose prompt alone exceeds trainerseq_len, so truncation strips every nonzero ce/ref_kl token while the stamped all-zero rl stream skips the rl branch) returned a Python float fromcompute_lossand crashedbackward(). The loss is now seeded with a graph-attached zero, so the batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync — pinned bytest_empty_components_keep_backward_valid, which fails on the prior code. A follow-up (0606f1272) namespacesref_kl_loss_fn's trust-region metrics asref_kl/*(mixed batches previously averaged the rl and ref_kl trust-region definitions into one wandb series), turns a missing advantage on an rl-member sample into a loudValueErrorinstead of silent zero-gradient training, pads weight streams with 0.0 so padded pure-ce batches read as rl-empty in token export, and pins the pack-boundarySTREAM_FILLbackfill with positional asserts in both pack orders. Verified by the full trainer+orchestrator unit suites (188 passed) and a 5-step opsd GPU smoke with the renamed keys confirmed in the wandb run.On the type-rename + preset-layer deletion (full CPU unit suite 485 green excl. the pre-existing tilelang
test_qwen3_5_moe*box failures; ruff clean;--dry-runparse of every flipped debug TOML; 5-step grpo GPU smoke viatype = "grpo", reward 0.10→0.41, 128/128 trainable, exit 0).039b4009fcollapses the scalar/per-token advantage duality entirely — advantage is a per-token stream from the advantage function to the trainer. The wire'sadvantagescalar + optionaltoken_advantagesbecome oneadvantagesstream;TrainRolloutcarries the same single field;AdvantageOutputsis deleted in favor of functions returninglist[list[float]]aligned toAdvantageInputs.completion_lengths(withinputs.broadcast(...)for uniform group credit — GRPO's reward-minus-mean is internal math broadcast on the way out); the zero-advantage filter checks all-zero streams and logged distributions use per-rollout means. Verified by the full unit suite (488 passed) and two GPU smokes on the new wire: grpo (broadcast stream, eval 0.82 in 10 steps) and self_distill (advantages=None+ ref_kl, all samples trainable, the missing-advantage tripwire silent).On the echo selection surface (20-step GPU run on multi-turn
alphabet-sortwith an assembled user-role table, exit 0) — verified at the wire: all 32 samples of a shipped batch carryce_weightswith 5,240 nonzero tokens at exactly the configured α=0.1, and the orchestrator-internalcompletion_obs_weightsfield leaked into zero shipped samples. The filter hook is covered by shape-violation and composition unit tests.On the final two-component shape (full CPU unit suite green — 526 passed, pre-existing
test_qwen3_5_moe*tilelang failures on this box excluded; ruff clean; 5-step GPU smokes, 2-GPU, all exit 0):alphabet-sort(3.1–3.6 turns, 32/32 trainable every step) —EchoAlgorithm+Samplersplit end-to-end: observation tokens tagged, ce-weighted, and trained while staying out of the rl denominator.OPSDAlgorithm.score()against the live policy pool, demo-conditioned prefix rendered viarenderer.render_ids(the policy's own prompt-rendering path).On the preset-resolution refactor: full unit suite incl. the config-load sweep over every shipped TOML, resolved-config round-trip (
model_dump→ re-validate), legacy[[env]]layout, env-shorthand-on-inherited-preset and conflict tests, plus a 5-step grpo smoke (eval 0.13→0.55, 128/128 trainable).On the component-sum shape (5-step smokes, all exit 0): grpo eval 0.13→0.53 (no-stream hot path — math identical to the old single-denominator path for single-component runs); echo with the ce stream carrying observation tokens (loss ≈ 0.03 while GRPO advantages are ~0, i.e. the gradient is the CE component); opd eval 0.067→0.684 with the frozen reference on :8001 (
ref_klstream on action tokens, rl stream zeroed, frozen pool initialized from the inline config).Earlier 50-step runs validated the same orchestrator machinery on the loss-partition iteration of this branch (per-component normalization is numerically identical for these single-component runs):
Qwen3-0.6B-Reverse-Text-RLserver on :8001 declared inline — algorithm-owned pool wiring + bounded ship-time scoring +ref_klrouting.supervisedadvantage + CE.configs/debug/algorithms/mixed_grpo_opd.toml): two envs with different algorithms in one run, 0.091→0.835 — every batch mixes ref-bearing (opd, 30–60% per step) and ref-less (grpo) samples, exercising heterogeneous packing end-to-end with the post-pack alignment assert live.scalar / scalar·0.5per-token advantages (eval 0.099→0.681), verified via trainer token export: all 128 step-0 sequences carry the exact alternating pattern over the loss-masked region — custom-fn per-token lists →TrainingSample.advantages→ trainer, prompt positions padded out (API has since collapsed to per-token streams everywhere).demo_ref_klscores against the live policy, no-scalar rollouts stay trainable). Known issue (not this PR's machinery): over 50 steps it degrades (eval 0.078→0.0, truncation →99%) — a verbosity spiral plausibly driven by the debug config's 128-token cap (~80% truncation at step 0) + the self-referential reference. Tracked for follow-up (truncation-aware demo scoring / larger budget / EMA reference).ref_kl_loss_fnzeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter, which wrongly dropped uniform-reward OPD groups carrying full teacher-KL signal. Removing them ties the baseline (0.825 vs 0.828); OPD now shipsadvantage=Nonelike OPSD andref_kl_loss_fnreads no scalars (its trust region is the low side explicitly — bit-identical math).Deferred (next PRs)
The two-component model is built so each of these lands as its own PR without re-touching the wire:
Sampler— today the Sampler answers one question (which pool, and is it live); next it absorbs within-env example iteration and group production, plus a sink→samplerobserve()feedback edge at group finalization. In increasing order of machinery that unlocks: difficulty-pool curricula (per-example selection state fed back from group rewards), static dataset sources (supervised training from demonstrations instead of a frozen endpoint), and replay buffers / offline experience (a store between stamping and batch assembly — advantages and weight streams are already stamped-then-frozen at group finalization, which is the prerequisite).tool_namesfiltering (non-breaking optional field on the tool role;message_tool_namesalready rides in the attribution) and θ-dependent low-probability filters, which need a trainer-side per-component knob since the denominator collective happens pre-forward.Σwdenominators — a globally-folded λ cancels under it, so it pairs with a trainer-side coefficient); per-slot loss-fn configs for ce/ref_kl (the ref_kl trust region — one-sided today — becomes a deliberate choice there); barrier-blind sink (the group wait becomes a dependency declared by the algorithm;finalize_group→finalize); EMA/lagged reference endpoints; exposing frozen endpoints to envs as judge clients.🤖 Generated with Claude Code
Note
High Risk
Breaking orchestrator/trainer config and wire format at the core RL training path; incorrect migration or mixed-algorithm packing could silently change gradients or crash distributed training.
Overview
Replaces
orchestrator.training_mode(rl/opd/sft) with a composable[orchestrator.algo]bundle:sampling(policy vs inline frozen endpoint) andadvantage.type(grpo,max_rl,opd,opsd,sft,echo,reward,custom), which names the algorithm and routes tokens torl,ce, orref_klloss components.Config breaking changes: top-level
[orchestrator.model](flat policy + client) replacesstudent/teacherblocks; frozen teachers move to[orchestrator.algo.teacher](name+base_url). Debug configs move fromtraining_modes/toalgorithms/with expanded recipes (echo, self-distill, mixed GRPO+OPD).Runtime: new
orchestrator/algo/named classes per type (assign/score/observation_weights), per-envSamplerfor rollout source, dispatcher off-policy rules keyed on liveness, batchscore_train_batchinstead of mode-specific teacher logprobs. Trainer treats batches via per-tokenrl_weights/ce_weights/ref_kl_weightsandref_logprobs(renamed from teacher logprobs); docs and skills updated accordingly.Reviewed by Cursor Bugbot for commit 639b7b5. Bugbot is set up for automated code reviews on this repo. Configure here.