feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo)#2746
Open
hallerite wants to merge 47 commits into
Open
Conversation
…d, sft_distill, self_distill, echo) Replace the global training_mode enum with a per-env Algorithm abstraction: a preset bundle of (1) sampling source, (2) scoring (group advantage + async token scorer), and (3) per-token loss routing. The trainer becomes algorithm-blind: routing ships per token on the wire and the trainer executes three fixed loss cores (rl / ce / teacher_kl). - configs: new prime_rl.configs.algorithm with AlgorithmConfig presets, component-level overrides, compatibility validation (incl. the group-relative-advantage-with-group_size=1 footgun warning); training_mode kept as a deprecated alias - orchestrator: per-env algorithm; dispatcher selects student/teacher pool per env (no mode branches); OPD teacher logprobs moved out of finalize_train_batch into a bounded-concurrency token scorer; demo-conditioned teacher scorer for SDFT; interleave_rollout can tag env-observation tokens for ECHO - wire: TrainingSample/MicroBatch carry loss_core + optional per-token cores/weights/advantages (omit_defaults — plain GRPO wire unchanged); packer no longer bins by mode - trainer: unified per-token loss routing, bit-for-bit with the previous rl/opd/sft loss fns on pure batches Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke runs for grpo (reverse_text), opd (teacher pool + alias path), and echo (multi-turn alphabet-sort, per-token routing verified on the wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fb0da20 to
f35e3a8
Compare
…hm strategy object "Teacher" is no longer a concept anywhere in the system. There is the live policy (reserved registry key "policy") and named frozen hosted models under [orchestrator.models.<key>]; algorithm components hold references into that registry. The same entry can serve any number of envs' algorithms, and self_distill can point its demo scorer at "policy" itself — the SDFT paper's setting, zero extra deployments. - configs: scorer types logprobs/demo_logprobs with required model refs; sampling.source is a registry key; algorithm.model shorthand folds into the unresolved component; orchestrator.teacher and training_mode deleted; student renamed policy; registry validation (refs resolve, entries used, "policy" reserved, degenerate logprobs@policy rejected) - runtime: ModelRegistry + per-env Algorithm strategy object as the sole interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and never branch on algorithm config; liveness drives cache salting, sampling logprobs, and off-policy aging (frozen-sourced rollouts no longer age) - wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl, time/scoring metric - fixes found by the new SDFT smoke: resolved-config round-trip (shorthands are now write-only / excluded from dumps) and apply_chat_template returning BatchEncoding on newer transformers - configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml running SDFT against the live policy); docs/skills updated Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry 0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32 trainable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gy ontology Every training signal is an advantage — varying in granularity (group-scalar vs per-token) and evaluation site (orchestrator vs trainer). The advantage union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs -> demo_ref_kl), the action-token loss core derives from the strategy instead of being configured (loss.action deleted), and runtime AdvantageStrategy objects own both execution points: group-time assign() and ship-time score(). Also fixes a shorthand-folding regression: resolve_preset's component assignment polluted model_fields_set, so any [orchestrator.advantage] shorthand differing from the preset raised a bogus conflict error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction # Conflicts: # packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py # src/prime_rl/orchestrator/dispatcher.py # src/prime_rl/orchestrator/orchestrator.py
A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones (grpo/echo) extended ref_logprobs without backfilling or padding, shifting it out of alignment with input_ids. Mirror the rewards/loss_core_ids pattern with 0.0 placeholders (already the outside-the-mask filler used by the demo scorer and pad_micro_batch). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Misaligned parallel arrays (the ref_logprobs packing bug class) now fail loudly at pack time instead of corrupting training silently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Member
There was a problem hiding this comment.
can put into orch configs? i think rn we have one config per entrypoint
| Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields. | ||
| Set `[trainer.loss] type = "default"` and configure via the knobs above. The `ce` and `ref_kl` cores are fixed and unaffected by `[trainer.loss]`. | ||
|
|
||
| ### Custom Loss |
Member
Author
There was a problem hiding this comment.
we could make it obsolete by making the RL loss more configurable in the algo config I guess
The config surface key is now [orchestrator.algo] (per-env: algo = {...});
the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL,
TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids,
advantage.action_loss_type). Also scrubs stale token-scorer mentions from
the ref_kl error message and the configs skill.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and `[orchestrator.student]` fold in as aliases, with the canonical key winning at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys still re-nest ([orchestrator.model] name = ...). Shared-field propagation checks all spellings for conflicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
Two envs with different algorithms in one run — exercises heterogeneous train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
samsja
reviewed
Jun 10, 2026
…r references Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
samsja
reviewed
Jun 10, 2026
… registry prime-rl now assumes it only ever hosts the trainable policy. Frozen models are external endpoints declared inline on the algorithm component that uses them (FrozenModelConfig: model.name + required client.base_url) — no more [orchestrator.models] namespace or runtime ModelRegistry. Each env's Algorithm builds and readies its own frozen pools in async setup(); the dispatcher reads algorithm.sampling_pool and gets the policy pool directly. References are "policy" | inline config; demo_ref_kl now defaults to "policy" (the SDFT setting needs zero config). The algo.model shorthand folds with fill-or-agree semantics, which also fixes the two Bugbot findings (redundant-but-consistent model rejected; advantage shorthand clearing a folded model). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A frozen model reference is the client config we already have plus the one
request-level datum it lacks: the served model's name. Drops the nested
{model, client} shape — TOML reads `[orchestrator.algo.model]` with
name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning,
which still read the deleted [orchestrator.models] dict.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ction # Conflicts: # skills/configs/SKILL.md # src/prime_rl/trainer/rl/data.py # tests/unit/test_configs.py
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the implicit success probability instead of pass@1: the policy gradient averaged over successful rollouts only is unbiased for the order-group_size truncation of the ML objective's pass@k expansion. In estimator form that is one change to GRPO — normalize the centered group reward by the group MEAN instead of the standard deviation, upweighting low-pass-rate examples like 1/p. group_size becomes the truncation order (REINFORCE at 1, exact ML in the limit). New 'max_rl' advantage type + preset: MaxRLAdvantageConfig, max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs rows, and a unit test for the estimator. Groups with zero mean reward carry zero advantages (the paper's no-success convention — the zero-advantage filter drops them). Everything else rides the existing GRPO path: policy sampling (enforced by the rl-component guard), rl loss component, group barrier. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…self
A preset name with explicit advantage/sampling keys is now a parse-time
error instead of a merge: a modified preset is not the preset, so the
config must state what it actually runs. Only the model/teacher
shorthand may accompany a name (the distillation presets are incomplete
without an endpoint by design). Assembly stays cheap — presets are thin
deltas, so a variant costs one explicit 'type' key.
Deletes the merge machinery: _merge_preset_delta and the
discriminator-aware typeless override (advantage = { max_concurrent }
under opd silently inheriting ref_kl) are gone; the preset validator
inserts components, never merges. 'name' becomes write-only input sugar
(excluded from dumps, like 'model') so resolved configs round-trip as
plain component assemblies; the orchestrator startup log now reports
advantage types instead of preset labels. The advantage shorthand gets
a preset-aware error instead of silently relabeling an inherited
preset.
echo.toml's lambda override becomes the assembled spelling, and the
debug configs spell algo as an [orchestrator.algo] section instead of
an inline table.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The echo preset now means the vetted ECHO setting: weighted CE on tool/terminal response bodies (lambda = 0.1), selected via the renderer's per-token attribution (message_indices / message_roles / is_content from RendererClient rollouts). EchoAdvantageConfig grows observations: 'tool' (default; requires the renderer, validated at config time) | 'all' (every env-provided token — the previous behavior). interleave_rollout takes the mode instead of a bool and tags tool spans token-exactly: response bodies when is_content is available, whole tool messages otherwise; MITO rollouts raise loudly. Env-shorthand assembly fix: an env advantage shorthand now assembles the env's own algorithm instead of copy-modifying the inherited preset (atomicity); self_distill.toml spells its demo_key variant assembled; echo.toml (alphabet-sort, user-role feedback) assembles with observations = 'all'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, no aliases
HostedModelConfig was a pairing wrapper ({model, client}) that made the
canonical path stutter (orchestrator.model.model.name) and needed a
before-validator apologizing for it (re-nesting flat keys, folding the
policy/student aliases and the [orchestrator.client] shorthand). The
flat spelling was what everyone wrote anyway — make it the schema:
HostedModelConfig is now ModelConfig + a client field, mirroring
FrozenModelConfig (name + endpoint, no new declaration scheme), and
fold_policy_shortcuts is deleted along with the policy/student aliases
and the client shorthand. Resolved dumps stop stuttering too.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection surface generalizes from the observations="tool"|"all" binary to a role table: each env-provided message role (system / user / assistant / tool) trains at its own alpha, selected via the renderer's per-token attribution. An optional filter hook (import_path + kwargs, matching the custom advantage/loss precedent) narrows the selection per rollout with one keep-mask per trajectory step. - completion_obs_mask (bool) -> completion_obs_weights (float): the per-token weight carries its role's alpha, so stamping folds it into ce_weights directly and stamp_loss_routing drops the scalar observation_weight parameter. Orchestrator-internal as before. - The echo preset is unchanged in meaning: tool-response bodies at 0.1. Setting any role replaces the whole table. - Echo now always requires the renderer (role selection needs attribution); the blanket "all" mode is gone — assemble the roles you want instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eted The advantage type now names the algorithm — group_norm -> grpo, ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes renamed to match) — and each type's class defaults are its vetted setting, so 'type = "opd"' with a teacher IS on-policy distillation. With type-plus-defaults equal to the preset for every algorithm, the preset layer had nothing left to do: AlgorithmName, _PRESETS, the name field, and the atomicity guard are deleted. The model/teacher shorthand survives and now folds by the type's own declarations (model_role -> advantage.model for opd/opsd; source_role -> sampling.source for sft). sampling.source loses its None state (it existed only for preset resolution); sft without a frozen source is rejected at validation — CE on the policy's own tokens was never the vetted meaning. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8e7e51f. Configure here.
- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the missing max_rl/reward/custom rows, fix the frozen-model base_url and custom-Algorithm wording, make the length_penalty example self-sufficient, drop the Per-Env Advantage section (duplicate of Per-Env Algorithms) - configs/debug/algorithms: README gains max_rl and uses the real type names, comments lose leftover preset vocabulary, one wandb project for the whole folder - docs/training.md / skills/configs/SKILL.md: complete type lists and a union example that parses on its own Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fully truncated distillation sample (prompt >= trainer seq_len) loses all its nonzero ce/ref_kl tokens to prepare_sample's truncation while its stamped all-zero rl_weights suppress the rl branch; with every component empty, compute_loss returned the Python float 0.0 and loss.backward() crashed. Seed the rl accumulator with a graph-attached zero so the degenerate batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- ref_kl_loss_fn emitted the same trust-region metric keys as the rl loss fn into one shared dict, so mixed batches (per-env algorithms) averaged two different trust-region definitions into one wandb series. Namespaced as ref_kl/*; the wandb noise filter gets matching prefixes and the ref_kl value series is unchanged. - prepare_sample: a sample with rl member tokens but no advantage now raises instead of silently training with advantage 0.0 — the orchestrator always stamps a scalar, so a missing one is a producer bug (ce/ref_kl-only samples still default to 0.0 legitimately). - pad_micro_batch: padding fills every weight stream with 0.0 instead of the pack-boundary defaults; padding is loss-masked so this is training-equivalent, and padded pure-ce batches now read as rl-empty in token export, which keys off nonzero weights. - test_prepare_batch_packs_mixed_components: sorted() multiset checks replaced with exact positional asserts in both pack orders, pinning STREAM_FILL backfill alignment across the bin boundary. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
Makes prime-rl's training algorithm a first-class, hackable abstraction — removes model roles from the pipeline, unifies every training signal under one concept (the advantage: credit assignment and loss routing, fused), and makes each algorithm a named runtime class that owns its methods. There is no registry and no separate "token scorer": there is the live policy — the only model prime-rl ever hosts — and per-env algorithms whose model references are either
"policy"or an inline externally-hosted frozen endpoint.An algorithm is a bundle of two components, configured under
[orchestrator.algo](per-env:algo = {...}or theadvantage = {...}shorthand on the env) and resolved per env. The advantagetypenames the algorithm —grpo,max_rl,opd,opsd,sft,echo,reward,custom— and each type's class defaults are its vetted setting; there is no separate preset layer:"policy"or an inline frozen endpoint. At runtime this builds the env'sSampler(orchestrator/sampler.py) — the pool rollouts come from, the liveness consequences (logprobs, prefix-cache salting, staleness), and the home of future sampling strategies (replay buffers, branching).rl/ce/ref_kl) and what happens to env-provided observation tokens (masked out by default;echotrains on them with weighted CE — selected by message role via the renderer's per-token attribution, each role at its ownalpha, tool-response bodies at λ=0.1 by default, optionally narrowed by a user-supplied per-rollout token filter).The training loss is a sum of three components —
rl(DPPO+KL or custom),ce(masked NLL),ref_kl(reverse KL to a reference as the PG signal) — each normalized by its own global token count:The orchestrator stamps per-token component weight streams (
rl_weights/ce_weights/ref_kl_weights); a weight scales that component's per-token loss,0.0removes the token from the component's mask and denominator, and components may overlap on the same token (gradients sum). Per-component normalization means components never dilute each other: echo's observation tokens no longer shrink the rl term's effective per-token learning rate (previously both shared one global denominator, so the rl gradient scaled with the batch's obs/action ratio), and a supervised env packed next to a GRPO env doesn't soften its gradient.The trainer is algorithm-blind (component weight streams ship per token on the wire; the trainer executes the three fixed components), and the orchestrator pipeline is too: dispatcher, train sink, and orchestrator call hooks on each env's
Sampler+Algorithmobjects and never branch on algorithm config or model roles.The algorithm classes
Each algorithm is a named runtime class — the algorithm object is the algorithm. Dispatch is keyed on
advantage.type— it names the algorithm, and each config class's defaults are its vetted parameterization:advantage.typeassign(group time)score(ship time)grpoGRPOAlgorithmrlmax_rlMaxRLAlgorithmrlgroup_sizetruncation of the max-likelihood objective)echoEchoAlgorithmrl+ceroles.<role>.alpha, tool bodies @ 0.1 default), optional user token filterrewardRewardAlgorithmrlopdOPDAlgorithmref_kladvantage=None, filters skip)opsdOPSDAlgorithmref_kladvantage=None, filters skip)"policy")sftSFTDistillAlgorithmcecustomCustomAlgorithmrlAdvantageOutputs.token_advantages, completion-aligned)Reading a class top to bottom reads the algorithm; writing your own is subclassing
Algorithmand overriding the same two methods. Duplication of orchestration between similar algorithms (OPD vs OPSD) is accepted so each class stays self-contained; shared math (group normalization, prefill alignment, length penalties) lives as plain functions inalgo/advantage.py. Algorithms take exactly two runtime resources —Algorithm(config, policy_pool, renderer); text → token ids always goes through the renderer, the same path the policy's own prompts take (opsdandechorequire one, validated at config time — demo-conditioned scoring or role-attributed selection under MITO would diverge from the policy's own rendering, so they're rejected rather than approximated).One deliberate expressiveness trade: loss routing is not a free config axis (there is no
algo.loss; you can't writeopd+ observation-CE in TOML) — routing variation is algorithm variation, expressed as a class. ECHO is a proper advantage type, not a flag on GRPO. Its selection surface is configurable where it matters: per-rolealpha(roles.system/user/assistant/tool— setting any role replaces the whole table) and an optionalfilterhook (import_path+kwargs, called once per rollout asfilter_fn(rollout, **kwargs) -> list[list[bool]], one keep-mask per trajectory step spanning that step'sprompt_ids + completion_ids) — the raw rollout exposes message text and sampling logprobs, so warning filters and low-probability filters are user code, no framework surface.The algorithms
advantage.typegrpo(default)rlmax_rl(arXiv:2602.02710)rlopdref_kl(needsteacher)sftteacherfolds intosampling.source)ceopsd(SDFT, arXiv:2601.19897)ref_klvs"policy"by defaultecho(ECHO)rlon actions + per-role α·ceon env-provided tokensThere is no preset layer: the type IS the algorithm, its class defaults ARE the vetted setting, and every key beyond
typeis visibly the user's own assembly (an earlier iteration had named atomic presets; with type-plus-defaults equal to the preset for every algorithm, the layer had nothing left to do and was deleted). Per-env algorithms compose in one run. Because a reference can be the literal"policy",opsdruns the SDFT paper's setting with zero extra deployments — that's its default.Model references — no registry; roles are algorithm-local
prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:
ModelReference = "policy" | FrozenModelConfig, whereFrozenModelConfigis just the existingClientConfigplus the served model'sname— no new declaration scheme.base_url(or an elastic deployment) is required: a frozen reference with no endpoint fails at parse time.algo.modelis shorthand that folds into the slot the advantage type declares for its reference (model_role→advantage.modelfor opd/opsd,source_role→sampling.sourcefor sft); redundant-but-consistent explicit settings are accepted, contradictions rejected."teacher"(model_role/source_roleClassVars), which makes[orchestrator.algo.teacher]a parse-time alias for themodelshorthand and puts the same word in validation errors ("advantage 'opd' needs a teacher — set 'teacher' on the algorithm ..."). Roles stay strictly algorithm-local: no role ever reaches flow code or the wire.opdpointed at"policy"is rejected as degenerate (KL ≡ 0), frozen sampling can't feed anrl-type strategy (no policy sampling logprobs for importance ratios),sftwithout a frozen source is rejected (CE on the policy's own tokens is not a distillation target),opsdandechorequire a renderer, and group-relative advantage withgroup_size=1warns loudly.How
prime_rl.configs.algorithm): oneadvantagediscriminated union absorbs the former token scorers (logprobs/demo_logprobs→opd/opsdwith aModelReference) and loss routing (EchoAdvantageConfig(GRPOAdvantageConfig)carries the role table + filter); the action-token loss component is anaction_loss_typeclass property of the strategy, never configured;algo.model(aliasteacher) folds fill-or-agree into the slot the type's ClassVars declare (model_role/source_role).advantageshorthands ([orchestrator.advantage], per-envadvantage = {...}) fold intoalgo.advantageon raw input; an env-level shorthand assembles the env's own algorithm rather than copy-modifying the inherited one, and env algorithm inheritance is a one-loop after-validator. EveryAlgorithmConfigis built exactly once with everything in place: no preset tables, no name field, no merge machinery, no__pydantic_fields_set__surgery.orchestrator/algo/package +orchestrator/sampler.py):Algorithm— base class with the two execution points:assign(rollouts)at group finalization (scalar advantages, synchronous) andasync score(rollouts)at batch-ship time (reference prefill logprobs, bounded concurrency);finalize_groupstamps the wire fields (advantage spreading + component weight streams).setup()connects anInferencePoolto the algorithm's inline frozen reference (connect_frozen_pool— client-side only: prime-rl connects and waits, never launches) and tracksconnected_poolsfor shutdown. The named classes above own theirassign/scorebodies outright;build_algorithmdispatches onadvantage.type.Sampler— one per env, owns the rollout source: the generating pool (policy, or a frozen pool it connects in its ownsetup()),samples_from_live_policy, andsampling_args()(strips the logprobs knob for frozen endpoints). The dispatcher readsenv.samplerfor pool/liveness; the algorithm never sees sampling.finalize_group; orchestrator setup is oneasyncio.gatherover both objects'setup()andfinalize_train_batchdoes one unconditional per-envscore_batchgather (time/scoring).TrainingSample/MicroBatchcarry optional per-tokenrl_weights/ce_weights/ref_kl_weights/token_advantagesandref_logprobs; absent streams mean rl weight 1.0 on every trainable token, so plain GRPO ships nothing extra (one nil byte per trailing field —array_likestructs encode positionally;omit_defaultsdoesn't trim them). Membership is per token, so samples of different algorithms pack freely into one micro batch (no mode-segregated bins). Reference-KL strategies must ship reference logprobs (not precomputed advantages) because the trainer evaluates the KL against live policy logprobs each microbatch.compute_lossruns each component over its weight stream (rl mask =loss_mask & rl_weights != 0; ce/ref_kl mask =weights != 0) and sums the per-component means; the three global denominators come from one batched all-reduce ([N_rl, N_ce, N_ref_kl]), so every rank issues the same collective. The no-stream hot path is byte-identical to the old single-loss_scalemath with zero extra device syncs. Mixed-algorithm packing fix: bins mixing ref-bearing (e.g. OPD) and ref-less (e.g. GRPO) samples now keepref_logprobsposition-aligned withinput_ids(0.0 placeholders, backfill-or-pad like the sibling per-token arrays), with a regression test and a post-pack invariant assert that every per-token array on every micro batch matcheslen(input_ids).Breaking changes (intentionally, no deprecation aliases)
orchestrator.training_modeis deleted (wasrl/opd/sft) →[orchestrator.algo.advantage] type = ....typenames the algorithm:group_norm→grpo,ref_kl→opd,demo_ref_kl→opsd,supervised→sft(config classes renamed to match). The presetnamefield is deleted — type-plus-defaults is the algorithm.observations/observation_weightare replaced byroles.<role>.alpha(tool bodies @ 0.1 default; setting any role replaces the table) plus an optionalfilterhook. Echo always requiresorchestrator.renderer(the no-attribution"all"mode is gone).sftrequires a frozen sampling source — CE on the policy's own tokens is rejected at validation (previously silently assemblable).[orchestrator.teacher]is deleted → inline[orchestrator.algo.model](alias:[orchestrator.algo.teacher]) withname+base_url.[orchestrator.model], flat:name/lora/vlm/trust_remote_codedirectly on it, the deployment under[orchestrator.model.client]— no moremodel.modelstutter, in configs or resolved dumps (HostedModelConfigis nowModelConfig+ aclientfield, mirroringFrozenModelConfig). The[orchestrator.policy]/[orchestrator.student]aliases, the[orchestrator.client]shorthand, and the flat-key re-nesting shim are deleted.algorithm.token_scoreris deleted — scorers are advantage strategies now (logprobs→ref_kl,demo_logprobs→demo_ref_kl).algo.lossdoes not exist — the action-token loss component derives from the advantage strategy, and observation-token routing is theechoadvantage type (observation_weight).default→group_norm;noneis deleted (the ref_kl-family strategies carry its no-scalar semantics; nothing else used it).TrainingSample.teacher_logprobs→ref_logprobs; the loss-type partition (loss_type/token_loss_types/token_loss_weights/loss_type_ids,LOSS_CORE_*) is replaced by the component weight streamsrl_weights/ce_weights/ref_kl_weights(the partition was the degenerate disjoint case); metrictime/teacher_logprobs→time/scoring.configs/debug/training_modes/→configs/debug/algorithms/(incl. newself_distill.tomlrunning SDFT against the live policy); CI integration configs migrated.Configs are
extra="forbid", so stale configs fail loudly at parse time with the exact unknown key.Merged with
mainincluding #2720 (teacherless SFT). Its spelling is deliberately not carried forward:sftmeans distillation from a frozen teacher, and pointing it at"policy"is rejected at validation — CE on the policy's own rollouts is a different algorithm, and if wanted it gets its own advantage type/class rather than a degenerate spelling ofsft.Also merged
mainincluding #2641 (multiplexed trainer token export): the export machinery composes with the component weight streams — each exported record carries the per-token rl/ce/ref_kl arrays, and micro-batches carryrun_id/run_stepalongside the streams (training_modestays deleted).Validation
Pre-review polish round: a full-branch review pass (parsimony / stale-reference / docs-accuracy sweep) landed two follow-ups —
48ac7ffb9scrubs leftover preset vocabulary and stale type names from docs and debug configs (ref_kl→opdin the advantage tables, missingmax_rl/reward/customrows, one wandb project for the debug folder), andf5be6f4b3fixes a degenerate-batch crash: a micro batch whose components are all empty (e.g. a distillation sample whose prompt alone exceeds trainerseq_len, so truncation strips every nonzero ce/ref_kl token while the stamped all-zero rl stream skips the rl branch) returned a Python float fromcompute_lossand crashedbackward(). The loss is now seeded with a graph-attached zero, so the batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync — pinned bytest_empty_components_keep_backward_valid, which fails on the prior code. A follow-up (0606f1272) namespacesref_kl_loss_fn's trust-region metrics asref_kl/*(mixed batches previously averaged the rl and ref_kl trust-region definitions into one wandb series), turns a missing advantage on an rl-member sample into a loudValueErrorinstead of silent zero-gradient training, pads weight streams with 0.0 so padded pure-ce batches read as rl-empty in token export, and pins the pack-boundarySTREAM_FILLbackfill with positional asserts in both pack orders. Verified by the full trainer+orchestrator unit suites (188 passed) and a 5-step opsd GPU smoke with the renamed keys confirmed in the wandb run.On the type-rename + preset-layer deletion (full CPU unit suite 485 green excl. the pre-existing tilelang
test_qwen3_5_moe*box failures; ruff clean;--dry-runparse of every flipped debug TOML; 5-step grpo GPU smoke viatype = "grpo", reward 0.10→0.41, 128/128 trainable, exit 0).On the echo selection surface (20-step GPU run on multi-turn
alphabet-sortwith an assembled user-role table, exit 0) — verified at the wire: all 32 samples of a shipped batch carryce_weightswith 5,240 nonzero tokens at exactly the configured α=0.1, and the orchestrator-internalcompletion_obs_weightsfield leaked into zero shipped samples. The filter hook is covered by shape-violation and composition unit tests.On the final two-component shape (full CPU unit suite green — 526 passed, pre-existing
test_qwen3_5_moe*tilelang failures on this box excluded; ruff clean; 5-step GPU smokes, 2-GPU, all exit 0):alphabet-sort(3.1–3.6 turns, 32/32 trainable every step) —EchoAlgorithm+Samplersplit end-to-end: observation tokens tagged, ce-weighted, and trained while staying out of the rl denominator.OPSDAlgorithm.score()against the live policy pool, demo-conditioned prefix rendered viarenderer.render_ids(the policy's own prompt-rendering path).On the preset-resolution refactor: full unit suite incl. the config-load sweep over every shipped TOML, resolved-config round-trip (
model_dump→ re-validate), legacy[[env]]layout, env-shorthand-on-inherited-preset and conflict tests, plus a 5-step grpo smoke (eval 0.13→0.55, 128/128 trainable).On the component-sum shape (5-step smokes, all exit 0): grpo eval 0.13→0.53 (no-stream hot path — math identical to the old single-denominator path for single-component runs); echo with the ce stream carrying observation tokens (loss ≈ 0.03 while GRPO advantages are ~0, i.e. the gradient is the CE component); opd eval 0.067→0.684 with the frozen reference on :8001 (
ref_klstream on action tokens, rl stream zeroed, frozen pool initialized from the inline config).Earlier 50-step runs validated the same orchestrator machinery on the loss-partition iteration of this branch (per-component normalization is numerically identical for these single-component runs):
Qwen3-0.6B-Reverse-Text-RLserver on :8001 declared inline — algorithm-owned pool wiring + bounded ship-time scoring +ref_klrouting.supervisedadvantage + CE.configs/debug/algorithms/mixed_grpo_opd.toml): two envs with different algorithms in one run, 0.091→0.835 — every batch mixes ref-bearing (opd, 30–60% per step) and ref-less (grpo) samples, exercising heterogeneous packing end-to-end with the post-pack alignment assert live.scalar / scalar·0.5per-token advantages (eval 0.099→0.681), verified via trainer token export: all 128 step-0 sequences carry the exact alternating pattern over the loss-masked region —AdvantageOutputs.token_advantages→TrainingSample.token_advantages→ trainer, prompt positions padded out.demo_ref_klscores against the live policy, no-scalar rollouts stay trainable). Known issue (not this PR's machinery): over 50 steps it degrades (eval 0.078→0.0, truncation →99%) — a verbosity spiral plausibly driven by the debug config's 128-token cap (~80% truncation at step 0) + the self-referential reference. Tracked for follow-up (truncation-aware demo scoring / larger budget / EMA reference).ref_kl_loss_fnzeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter, which wrongly dropped uniform-reward OPD groups carrying full teacher-KL signal. Removing them ties the baseline (0.825 vs 0.828); OPD now shipsadvantage=Nonelike OPSD andref_kl_loss_fnreads no scalars (its trust region is the low side explicitly — bit-identical math).Deferred (next PRs)
The two-component model is built so each of these lands as its own PR without re-touching the wire:
Sampler— today the Sampler answers one question (which pool, and is it live); next it absorbs within-env example iteration and group production, plus a sink→samplerobserve()feedback edge at group finalization. In increasing order of machinery that unlocks: difficulty-pool curricula (per-example selection state fed back from group rewards), static dataset sources (supervised training from demonstrations instead of a frozen endpoint), and replay buffers / offline experience (a store between stamping and batch assembly — advantages and weight streams are already stamped-then-frozen at group finalization, which is the prerequisite).tool_namesfiltering (non-breaking optional field on the tool role;message_tool_namesalready rides in the attribution) and θ-dependent low-probability filters, which need a trainer-side per-component knob since the denominator collective happens pre-forward.Σwdenominators — a globally-folded λ cancels under it, so it pairs with a trainer-side coefficient); per-slot loss-fn configs for ce/ref_kl (the ref_kl trust region — one-sided today — becomes a deliberate choice there); barrier-blind sink (the group wait becomes a dependency declared by the algorithm;finalize_group→finalize); EMA/lagged reference endpoints; exposing frozen endpoints to envs as judge clients.🤖 Generated with Claude Code
Note
High Risk
Large breaking config and orchestrator/trainer contract change (wire fields, validation, multi-algorithm packing); incorrect migration or edge cases in mixed batches could affect training correctness.
Overview
Replaces the
rl/opd/sfttraining-mode switch with a first-class[orchestrator.algo]bundle: sampling (policy vs inline frozen endpoint) and advantage (grpo,max_rl,opd,opsd,sft,echo,reward,custom) that jointly pick credit assignment and which loss component trains each token.Config & hosting: Drops
orchestrator.teacherand nestedstudentin favor of flat[orchestrator.model](trainable policy only) plus[orchestrator.algo.model]/teacherfor external frozen servers. CI and debug TOMLs move fromtraining_modes/toconfigs/debug/algorithms/with new recipes (MaxRL, ECHO, self-distill, mixed GRPO+OPD).Runtime: Per-env
Sampler+ namedAlgorithmclasses (assignat group time,scoreat batch ship for reference prefill). Dispatcher/sink/orchestrator branch on liveness (policy vs frozen), not modes. Wire shipsref_logprobsand optionalrl_weights/ce_weights/ref_kl_weights; trainer sums rl + ce + ref_kl with per-component token normalization. Metrics renametime/teacher_logprobs→time/scoring.Docs (
algorithms.md,training.md, scaling/inference paths) andskills/configs/SKILL.mdare rewritten around the new abstraction.Reviewed by Cursor Bugbot for commit 0606f12. Bugbot is set up for automated code reviews on this repo. Configure here.