Skip to content

feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo)#2746

Open
hallerite wants to merge 47 commits into
mainfrom
feat/algorithm-abstraction
Open

feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo)#2746
hallerite wants to merge 47 commits into
mainfrom
feat/algorithm-abstraction

Conversation

@hallerite

@hallerite hallerite commented Jun 9, 2026

Copy link
Copy Markdown
Member

What

Makes prime-rl's training algorithm a first-class, hackable abstraction — removes model roles from the pipeline, unifies every training signal under one concept (the advantage: credit assignment and loss routing, fused), and makes each algorithm a named runtime class that owns its methods. There is no registry and no separate "token scorer": there is the live policy — the only model prime-rl ever hosts — and per-env algorithms whose model references are either "policy" or an inline externally-hosted frozen endpoint.

An algorithm is a bundle of two components, configured under [orchestrator.algo] (per-env: algo = {...} or the advantage = {...} shorthand on the env) and resolved per env. The advantage type names the algorithmgrpo, max_rl, opd, opsd, sft, echo, reward, custom — and each type's class defaults are its vetted setting; there is no separate preset layer:

  1. Sampling — how train rollouts are produced: which model generates them, "policy" or an inline frozen endpoint. At runtime this builds the env's Sampler (orchestrator/sampler.py) — the pool rollouts come from, the liveness consequences (logprobs, prefix-cache salting, staleness), and the home of future sampling strategies (replay buffers, branching).
  2. Advantage — the per-token training signal: one mapping from a finalized rollout to per-token (loss component, weight) pairs. Credit assignment computes the magnitude, loss routing picks the component — two coordinates of the same output, so they are one config component and one runtime object. Group-relative strategies compute scalars on the orchestrator and ship numbers; reference-KL strategies query a reference model at batch-ship time and ship its prefill logprobs for the trainer to evaluate against the live policy. The strategy determines the action-token loss component (rl / ce / ref_kl) and what happens to env-provided observation tokens (masked out by default; echo trains on them with weighted CE — selected by message role via the renderer's per-token attribution, each role at its own alpha, tool-response bodies at λ=0.1 by default, optionally narrowed by a user-supplied per-rollout token filter).

The training loss is a sum of three componentsrl (DPPO+KL or custom), ce (masked NLL), ref_kl (reverse KL to a reference as the PG signal) — each normalized by its own global token count:

$$\mathcal{L} = \tfrac{\sum \mathcal{L}_{rl}}{N_{rl}} + \tfrac{\sum \mathcal{L}_{ce}}{N_{ce}} + \tfrac{\sum \mathcal{L}_{ref_kl}}{N_{ref_kl}}$$

The orchestrator stamps per-token component weight streams (rl_weights / ce_weights / ref_kl_weights); a weight scales that component's per-token loss, 0.0 removes the token from the component's mask and denominator, and components may overlap on the same token (gradients sum). Per-component normalization means components never dilute each other: echo's observation tokens no longer shrink the rl term's effective per-token learning rate (previously both shared one global denominator, so the rl gradient scaled with the batch's obs/action ratio), and a supervised env packed next to a GRPO env doesn't soften its gradient.

[orchestrator.algo.advantage]
type = "opd"

[orchestrator.algo.teacher]   # alias for `model`; folds into advantage.model
name = "Qwen/Qwen3-32B"
base_url = ["http://localhost:8001/v1"]

The trainer is algorithm-blind (component weight streams ship per token on the wire; the trainer executes the three fixed components), and the orchestrator pipeline is too: dispatcher, train sink, and orchestrator call hooks on each env's Sampler + Algorithm objects and never branch on algorithm config or model roles.

The algorithm classes

Each algorithm is a named runtime class — the algorithm object is the algorithm. Dispatch is keyed on advantage.type — it names the algorithm, and each config class's defaults are its vetted parameterization:

advantage.type Class Component assign (group time) score (ship time)
grpo GRPOAlgorithm rl group-norm scalars (optional length penalty)
max_rl MaxRLAlgorithm rl mean-normalized group scalars (arXiv:2602.02710: unbiased for the order-group_size truncation of the max-likelihood objective)
echo EchoAlgorithm rl + ce group-norm scalars; env-provided tokens selected by message role (roles.<role>.alpha, tool bodies @ 0.1 default), optional user token filter
reward RewardAlgorithm rl raw reward
opd OPDAlgorithm ref_kl none (advantage=None, filters skip) own-context prefill under the teacher
opsd OPSDAlgorithm ref_kl none (advantage=None, filters skip) demo-conditioned prefill under the teacher (default "policy")
sft SFTDistillAlgorithm ce group-norm (feeds reward-based filtering)
custom CustomAlgorithm rl your function — scalar per rollout, optionally per-token (AdvantageOutputs.token_advantages, completion-aligned)

Reading a class top to bottom reads the algorithm; writing your own is subclassing Algorithm and overriding the same two methods. Duplication of orchestration between similar algorithms (OPD vs OPSD) is accepted so each class stays self-contained; shared math (group normalization, prefill alignment, length penalties) lives as plain functions in algo/advantage.py. Algorithms take exactly two runtime resources — Algorithm(config, policy_pool, renderer); text → token ids always goes through the renderer, the same path the policy's own prompts take (opsd and echo require one, validated at config time — demo-conditioned scoring or role-attributed selection under MITO would diverge from the policy's own rendering, so they're rejected rather than approximated).

One deliberate expressiveness trade: loss routing is not a free config axis (there is no algo.loss; you can't write opd + observation-CE in TOML) — routing variation is algorithm variation, expressed as a class. ECHO is a proper advantage type, not a flag on GRPO. Its selection surface is configurable where it matters: per-role alpha (roles.system/user/assistant/tool — setting any role replaces the whole table) and an optional filter hook (import_path + kwargs, called once per rollout as filter_fn(rollout, **kwargs) -> list[list[bool]], one keep-mask per trajectory step spanning that step's prompt_ids + completion_ids) — the raw rollout exposes message text and sampling logprobs, so warning filters and low-probability filters are user code, no framework surface.

The algorithms

advantage.type Sampling Loss
grpo (default) policy rl
max_rl (arXiv:2602.02710) policy rl
opd policy ref_kl (needs teacher)
sft frozen model (teacher folds into sampling.source) ce
opsd (SDFT, arXiv:2601.19897) policy ref_kl vs "policy" by default
echo (ECHO) policy rl on actions + per-role α·ce on env-provided tokens

There is no preset layer: the type IS the algorithm, its class defaults ARE the vetted setting, and every key beyond type is visibly the user's own assembly (an earlier iteration had named atomic presets; with type-plus-defaults equal to the preset for every algorithm, the layer had nothing left to do and was deleted). Per-env algorithms compose in one run. Because a reference can be the literal "policy", opsd runs the SDFT paper's setting with zero extra deployments — that's its default.

Model references — no registry; roles are algorithm-local

prime-rl assumes it is never responsible for hosting any model other than the trainable policy. Everything else is an external OpenAI-compatible endpoint, declared inline where it's used:

  • ModelReference = "policy" | FrozenModelConfig, where FrozenModelConfig is just the existing ClientConfig plus the served model's name — no new declaration scheme. base_url (or an elastic deployment) is required: a frozen reference with no endpoint fails at parse time.
  • algo.model is shorthand that folds into the slot the advantage type declares for its reference (model_roleadvantage.model for opd/opsd, source_rolesampling.source for sft); redundant-but-consistent explicit settings are accepted, contradictions rejected.
  • Algorithms declare what they need. The distillation algorithms declare their reference's role as "teacher" (model_role / source_role ClassVars), which makes [orchestrator.algo.teacher] a parse-time alias for the model shorthand and puts the same word in validation errors ("advantage 'opd' needs a teacher — set 'teacher' on the algorithm ..."). Roles stay strictly algorithm-local: no role ever reaches flow code or the wire.
  • Liveness is a property of the reference, not a role: policy-sourced rollouts get version-salted prefix caches, carry sampling logprobs, and age off-policy; frozen-sourced rollouts get a stable prefix cache, skip the logprobs knob, and are never off-policy-cancelled. The pipeline branches on liveness alone.
  • Config-time validation: opd pointed at "policy" is rejected as degenerate (KL ≡ 0), frozen sampling can't feed an rl-type strategy (no policy sampling logprobs for importance ratios), sft without a frozen source is rejected (CE on the policy's own tokens is not a distillation target), opsd and echo require a renderer, and group-relative advantage with group_size=1 warns loudly.

How

  • Configs (prime_rl.configs.algorithm): one advantage discriminated union absorbs the former token scorers (logprobs/demo_logprobsopd/opsd with a ModelReference) and loss routing (EchoAdvantageConfig(GRPOAdvantageConfig) carries the role table + filter); the action-token loss component is an action_loss_type class property of the strategy, never configured; algo.model (alias teacher) folds fill-or-agree into the slot the type's ClassVars declare (model_role / source_role).
  • No preset machinery at all: the advantage shorthands ([orchestrator.advantage], per-env advantage = {...}) fold into algo.advantage on raw input; an env-level shorthand assembles the env's own algorithm rather than copy-modifying the inherited one, and env algorithm inheritance is a one-loop after-validator. Every AlgorithmConfig is built exactly once with everything in place: no preset tables, no name field, no merge machinery, no __pydantic_fields_set__ surgery.
  • Runtime (the orchestrator/algo/ package + orchestrator/sampler.py):
    • Algorithm — base class with the two execution points: assign(rollouts) at group finalization (scalar advantages, synchronous) and async score(rollouts) at batch-ship time (reference prefill logprobs, bounded concurrency); finalize_group stamps the wire fields (advantage spreading + component weight streams). setup() connects an InferencePool to the algorithm's inline frozen reference (connect_frozen_pool — client-side only: prime-rl connects and waits, never launches) and tracks connected_pools for shutdown. The named classes above own their assign/score bodies outright; build_algorithm dispatches on advantage.type.
    • Sampler — one per env, owns the rollout source: the generating pool (policy, or a frozen pool it connects in its own setup()), samples_from_live_policy, and sampling_args() (strips the logprobs knob for frozen endpoints). The dispatcher reads env.sampler for pool/liveness; the algorithm never sees sampling.
    • The pipeline has zero algorithm conditionals: the dispatcher derives cache-salt/aging from the sampler's liveness; the sink calls finalize_group; orchestrator setup is one asyncio.gather over both objects' setup() and finalize_train_batch does one unconditional per-env score_batch gather (time/scoring).
  • Wire: unchanged from before this PR for the GRPO path — TrainingSample/MicroBatch carry optional per-token rl_weights / ce_weights / ref_kl_weights / token_advantages and ref_logprobs; absent streams mean rl weight 1.0 on every trainable token, so plain GRPO ships nothing extra (one nil byte per trailing field — array_like structs encode positionally; omit_defaults doesn't trim them). Membership is per token, so samples of different algorithms pack freely into one micro batch (no mode-segregated bins). Reference-KL strategies must ship reference logprobs (not precomputed advantages) because the trainer evaluates the KL against live policy logprobs each microbatch.
  • Trainer: compute_loss runs each component over its weight stream (rl mask = loss_mask & rl_weights != 0; ce/ref_kl mask = weights != 0) and sums the per-component means; the three global denominators come from one batched all-reduce ([N_rl, N_ce, N_ref_kl]), so every rank issues the same collective. The no-stream hot path is byte-identical to the old single-loss_scale math with zero extra device syncs. Mixed-algorithm packing fix: bins mixing ref-bearing (e.g. OPD) and ref-less (e.g. GRPO) samples now keep ref_logprobs position-aligned with input_ids (0.0 placeholders, backfill-or-pad like the sibling per-token arrays), with a regression test and a post-pack invariant assert that every per-token array on every micro batch matches len(input_ids).

Breaking changes (intentionally, no deprecation aliases)

  • orchestrator.training_mode is deleted (was rl/opd/sft) → [orchestrator.algo.advantage] type = ....
  • The advantage type names the algorithm: group_normgrpo, ref_klopd, demo_ref_klopsd, supervisedsft (config classes renamed to match). The preset name field is deleted — type-plus-defaults is the algorithm.
  • Echo selection is a role table: observations / observation_weight are replaced by roles.<role>.alpha (tool bodies @ 0.1 default; setting any role replaces the table) plus an optional filter hook. Echo always requires orchestrator.renderer (the no-attribution "all" mode is gone).
  • sft requires a frozen sampling source — CE on the policy's own tokens is rejected at validation (previously silently assemblable).
  • [orchestrator.teacher] is deleted → inline [orchestrator.algo.model] (alias: [orchestrator.algo.teacher]) with name + base_url.
  • The policy sub-config is [orchestrator.model], flat: name / lora / vlm / trust_remote_code directly on it, the deployment under [orchestrator.model.client] — no more model.model stutter, in configs or resolved dumps (HostedModelConfig is now ModelConfig + a client field, mirroring FrozenModelConfig). The [orchestrator.policy] / [orchestrator.student] aliases, the [orchestrator.client] shorthand, and the flat-key re-nesting shim are deleted.
  • algorithm.token_scorer is deleted — scorers are advantage strategies now (logprobsref_kl, demo_logprobsdemo_ref_kl).
  • algo.loss does not exist — the action-token loss component derives from the advantage strategy, and observation-token routing is the echo advantage type (observation_weight).
  • Advantage types: defaultgroup_norm; none is deleted (the ref_kl-family strategies carry its no-scalar semantics; nothing else used it).
  • Wire/trainer: TrainingSample.teacher_logprobsref_logprobs; the loss-type partition (loss_type / token_loss_types / token_loss_weights / loss_type_ids, LOSS_CORE_*) is replaced by the component weight streams rl_weights / ce_weights / ref_kl_weights (the partition was the degenerate disjoint case); metric time/teacher_logprobstime/scoring.
  • configs/debug/training_modes/configs/debug/algorithms/ (incl. new self_distill.toml running SDFT against the live policy); CI integration configs migrated.

Configs are extra="forbid", so stale configs fail loudly at parse time with the exact unknown key.

Merged with main including #2720 (teacherless SFT). Its spelling is deliberately not carried forward: sft means distillation from a frozen teacher, and pointing it at "policy" is rejected at validation — CE on the policy's own rollouts is a different algorithm, and if wanted it gets its own advantage type/class rather than a degenerate spelling of sft.

Also merged main including #2641 (multiplexed trainer token export): the export machinery composes with the component weight streams — each exported record carries the per-token rl/ce/ref_kl arrays, and micro-batches carry run_id/run_step alongside the streams (training_mode stays deleted).

Note: this PR absorbed #2764 (named algorithm classes → two-component model → Sampler split), #2778 (MaxRL), and #2782 (echo per-role selection + filter hook); those branches were merged into this one and GitHub marked their PRs merged.

Validation

Pre-review polish round: a full-branch review pass (parsimony / stale-reference / docs-accuracy sweep) landed two follow-ups — 48ac7ffb9 scrubs leftover preset vocabulary and stale type names from docs and debug configs (ref_klopd in the advantage tables, missing max_rl/reward/custom rows, one wandb project for the debug folder), and f5be6f4b3 fixes a degenerate-batch crash: a micro batch whose components are all empty (e.g. a distillation sample whose prompt alone exceeds trainer seq_len, so truncation strips every nonzero ce/ref_kl token while the stamped all-zero rl stream skips the rl branch) returned a Python float from compute_loss and crashed backward(). The loss is now seeded with a graph-attached zero, so the batch trains as a zero-gradient no-op (main's behavior) and every rank still runs backward, keeping FSDP collectives in sync — pinned by test_empty_components_keep_backward_valid, which fails on the prior code. A follow-up (0606f1272) namespaces ref_kl_loss_fn's trust-region metrics as ref_kl/* (mixed batches previously averaged the rl and ref_kl trust-region definitions into one wandb series), turns a missing advantage on an rl-member sample into a loud ValueError instead of silent zero-gradient training, pads weight streams with 0.0 so padded pure-ce batches read as rl-empty in token export, and pins the pack-boundary STREAM_FILL backfill with positional asserts in both pack orders. Verified by the full trainer+orchestrator unit suites (188 passed) and a 5-step opsd GPU smoke with the renamed keys confirmed in the wandb run.

On the type-rename + preset-layer deletion (full CPU unit suite 485 green excl. the pre-existing tilelang test_qwen3_5_moe* box failures; ruff clean; --dry-run parse of every flipped debug TOML; 5-step grpo GPU smoke via type = "grpo", reward 0.10→0.41, 128/128 trainable, exit 0).

On the echo selection surface (20-step GPU run on multi-turn alphabet-sort with an assembled user-role table, exit 0) — verified at the wire: all 32 samples of a shipped batch carry ce_weights with 5,240 nonzero tokens at exactly the configured α=0.1, and the orchestrator-internal completion_obs_weights field leaked into zero shipped samples. The filter hook is covered by shape-violation and composition unit tests.

On the final two-component shape (full CPU unit suite green — 526 passed, pre-existing test_qwen3_5_moe* tilelang failures on this box excluded; ruff clean; 5-step GPU smokes, 2-GPU, all exit 0):

  • echo on multi-turn alphabet-sort (3.1–3.6 turns, 32/32 trainable every step) — EchoAlgorithm + Sampler split end-to-end: observation tokens tagged, ce-weighted, and trained while staying out of the rl denominator.
  • self_distillOPSDAlgorithm.score() against the live policy pool, demo-conditioned prefix rendered via renderer.render_ids (the policy's own prompt-rendering path).
  • grpo — the no-stream hot path (eval 0.16→0.40 in 5 steps).

On the preset-resolution refactor: full unit suite incl. the config-load sweep over every shipped TOML, resolved-config round-trip (model_dump → re-validate), legacy [[env]] layout, env-shorthand-on-inherited-preset and conflict tests, plus a 5-step grpo smoke (eval 0.13→0.55, 128/128 trainable).

On the component-sum shape (5-step smokes, all exit 0): grpo eval 0.13→0.53 (no-stream hot path — math identical to the old single-denominator path for single-component runs); echo with the ce stream carrying observation tokens (loss ≈ 0.03 while GRPO advantages are ~0, i.e. the gradient is the CE component); opd eval 0.067→0.684 with the frozen reference on :8001 (ref_kl stream on action tokens, rl stream zeroed, frozen pool initialized from the inline config).

Earlier 50-step runs validated the same orchestrator machinery on the loss-partition iteration of this branch (per-component normalization is numerically identical for these single-component runs):

  • grpo: eval reward 0.139→0.847 — default path end-to-end, wire unchanged.
  • opd: 0.106→0.828 with the frozen Qwen3-0.6B-Reverse-Text-RL server on :8001 declared inline — algorithm-owned pool wiring + bounded ship-time scoring + ref_kl routing.
  • sft_distill: 0.113→0.835 — frozen-sourced sampling (stable prefix cache, no off-policy aging) + supervised advantage + CE.
  • mixed grpo+opd (new configs/debug/algorithms/mixed_grpo_opd.toml): two envs with different algorithms in one run, 0.091→0.835 — every batch mixes ref-bearing (opd, 30–60% per step) and ref-less (grpo) samples, exercising heterogeneous packing end-to-end with the post-pack alignment assert live.
  • custom per-token advantages: 5-step run with a custom strategy emitting alternating scalar / scalar·0.5 per-token advantages (eval 0.099→0.681), verified via trainer token export: all 128 step-0 sequences carry the exact alternating pattern over the loss-masked region — AdvantageOutputs.token_advantagesTrainingSample.token_advantages → trainer, prompt positions padded out.
  • self_distill vs policy: wiring healthy (no frozen pool created, demo_ref_kl scores against the live policy, no-scalar rollouts stay trainable). Known issue (not this PR's machinery): over 50 steps it degrades (eval 0.078→0.0, truncation →99%) — a verbosity spiral plausibly driven by the debug config's 128-token cap (~80% truncation at step 0) + the self-referential reference. Tracked for follow-up (truncation-aware demo scoring / larger budget / EMA reference).
  • atomic presets + echo tool-mode (full unit suite 483 green; 5-step grpo smoke on the refactored preset path, eval 0.10→0.53, exit 0). A second 50-step round of all presets on the merged branch reproduced round-1 results (grpo 0.845, opd 0.837, sft_distill 0.822, self_distill collapse).
  • opd without scalars (50-step A/B vs the group-norm baseline): group-norm scalars on OPD rollouts were dead weight — ref_kl_loss_fn zeroes the scalar gradient, so they only steered the DPPO mask direction and the zero-advantage filter, which wrongly dropped uniform-reward OPD groups carrying full teacher-KL signal. Removing them ties the baseline (0.825 vs 0.828); OPD now ships advantage=None like OPSD and ref_kl_loss_fn reads no scalars (its trust region is the low side explicitly — bit-identical math).

Deferred (next PRs)

The two-component model is built so each of these lands as its own PR without re-touching the wire:

  • Sampling strategies behind the Sampler — today the Sampler answers one question (which pool, and is it live); next it absorbs within-env example iteration and group production, plus a sink→sampler observe() feedback edge at group finalization. In increasing order of machinery that unlocks: difficulty-pool curricula (per-example selection state fed back from group rewards), static dataset sources (supervised training from demonstrations instead of a frozen endpoint), and replay buffers / offline experience (a store between stamping and batch assembly — advantages and weight streams are already stamped-then-frozen at group finalization, which is the prerequisite).
  • Trainer collapse — the three fixed components reduce to two cores (policy-gradient × supervised) composed with per-token factor streams. Dissolves the rl-slot-only custom-loss wart and lets the stability guards (importance correction, trust region) compose around custom losses instead of being replaced by them.
  • Remaining ECHO surface — per-role α, arbitrary roles, and content/sampling-logprob filters shipped in this PR; still open: tool_names filtering (non-breaking optional field on the tool role; message_tool_names already rides in the attribution) and θ-dependent low-probability filters, which need a trainer-side per-component knob since the denominator collective happens pre-forward.
  • Smaller: config knobs for component sums (per-env α/β/γ folded into the streams orchestrator-side — the wire and trainer already support overlap); opt-in weight-sum normalization (Σw denominators — a globally-folded λ cancels under it, so it pairs with a trainer-side coefficient); per-slot loss-fn configs for ce/ref_kl (the ref_kl trust region — one-sided today — becomes a deliberate choice there); barrier-blind sink (the group wait becomes a dependency declared by the algorithm; finalize_groupfinalize); EMA/lagged reference endpoints; exposing frozen endpoints to envs as judge clients.

🤖 Generated with Claude Code


Note

High Risk
Large breaking config and orchestrator/trainer contract change (wire fields, validation, multi-algorithm packing); incorrect migration or edge cases in mixed batches could affect training correctness.

Overview
Replaces the rl / opd / sft training-mode switch with a first-class [orchestrator.algo] bundle: sampling (policy vs inline frozen endpoint) and advantage (grpo, max_rl, opd, opsd, sft, echo, reward, custom) that jointly pick credit assignment and which loss component trains each token.

Config & hosting: Drops orchestrator.teacher and nested student in favor of flat [orchestrator.model] (trainable policy only) plus [orchestrator.algo.model] / teacher for external frozen servers. CI and debug TOMLs move from training_modes/ to configs/debug/algorithms/ with new recipes (MaxRL, ECHO, self-distill, mixed GRPO+OPD).

Runtime: Per-env Sampler + named Algorithm classes (assign at group time, score at batch ship for reference prefill). Dispatcher/sink/orchestrator branch on liveness (policy vs frozen), not modes. Wire ships ref_logprobs and optional rl_weights / ce_weights / ref_kl_weights; trainer sums rl + ce + ref_kl with per-component token normalization. Metrics rename time/teacher_logprobstime/scoring.

Docs (algorithms.md, training.md, scaling/inference paths) and skills/configs/SKILL.md are rewritten around the new abstraction.

Reviewed by Cursor Bugbot for commit 0606f12. Bugbot is set up for automated code reviews on this repo. Configure here.

…d, sft_distill, self_distill, echo)

Replace the global training_mode enum with a per-env Algorithm abstraction:
a preset bundle of (1) sampling source, (2) scoring (group advantage +
async token scorer), and (3) per-token loss routing. The trainer becomes
algorithm-blind: routing ships per token on the wire and the trainer
executes three fixed loss cores (rl / ce / teacher_kl).

- configs: new prime_rl.configs.algorithm with AlgorithmConfig presets,
  component-level overrides, compatibility validation (incl. the
  group-relative-advantage-with-group_size=1 footgun warning);
  training_mode kept as a deprecated alias
- orchestrator: per-env algorithm; dispatcher selects student/teacher pool
  per env (no mode branches); OPD teacher logprobs moved out of
  finalize_train_batch into a bounded-concurrency token scorer;
  demo-conditioned teacher scorer for SDFT; interleave_rollout can tag
  env-observation tokens for ECHO
- wire: TrainingSample/MicroBatch carry loss_core + optional per-token
  cores/weights/advantages (omit_defaults — plain GRPO wire unchanged);
  packer no longer bins by mode
- trainer: unified per-token loss routing, bit-for-bit with the previous
  rl/opd/sft loss fns on pure batches

Validated: 443 CPU unit tests + GPU loss/batch tests; live 2-GPU smoke
runs for grpo (reverse_text), opd (teacher pool + alias path), and echo
(multi-turn alphabet-sort, per-token routing verified on the wire).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/algorithm-abstraction branch from fb0da20 to f35e3a8 Compare June 9, 2026 19:59
…hm strategy object

"Teacher" is no longer a concept anywhere in the system. There is the live
policy (reserved registry key "policy") and named frozen hosted models under
[orchestrator.models.<key>]; algorithm components hold references into that
registry. The same entry can serve any number of envs' algorithms, and
self_distill can point its demo scorer at "policy" itself — the SDFT paper's
setting, zero extra deployments.

- configs: scorer types logprobs/demo_logprobs with required model refs;
  sampling.source is a registry key; algorithm.model shorthand folds into the
  unresolved component; orchestrator.teacher and training_mode deleted;
  student renamed policy; registry validation (refs resolve, entries used,
  "policy" reserved, degenerate logprobs@policy rejected)
- runtime: ModelRegistry + per-env Algorithm strategy object as the sole
  interpreter of AlgorithmConfig; dispatcher/sink/orchestrator call hooks and
  never branch on algorithm config; liveness drives cache salting, sampling
  logprobs, and off-policy aging (frozen-sourced rollouts no longer age)
- wire/trainer: ref_logprobs, LOSS_CORE_REF_KL, loss action ref_kl,
  time/scoring metric
- fixes found by the new SDFT smoke: resolved-config round-trip (shorthands
  are now write-only / excluded from dumps) and apply_chat_template returning
  BatchEncoding on newer transformers
- configs/debug/training_modes -> configs/debug/algorithms (+ self_distill.toml
  running SDFT against the live policy); docs/skills updated

Smokes (2 GPU, 5 steps each): grpo 0.120->0.382, opd-via-registry
0.147->0.647, self_distill-vs-policy 0.068->0.181, echo multi-turn 32/32
trainable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite changed the title feat: Algorithm abstraction — sampling/scoring/loss presets (grpo, opd, sft_distill, self_distill, echo) feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026
…gy ontology

Every training signal is an advantage — varying in granularity (group-scalar
vs per-token) and evaluation site (orchestrator vs trainer). The advantage
union absorbs the token scorers (logprobs -> ref_kl, demo_logprobs ->
demo_ref_kl), the action-token loss core derives from the strategy instead of
being configured (loss.action deleted), and runtime AdvantageStrategy objects
own both execution points: group-time assign() and ship-time score().

Also fixes a shorthand-folding regression: resolve_preset's component
assignment polluted model_fields_set, so any [orchestrator.advantage]
shorthand differing from the preset raised a bogus conflict error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite changed the title feat: algorithm abstraction — model registry (no model roles) + per-env presets (grpo, opd, sft_distill, self_distill, echo) feat: algorithm abstraction — model registry + unified advantage strategies (grpo, opd, sft_distill, self_distill, echo) Jun 9, 2026
@hallerite hallerite marked this pull request as ready for review June 9, 2026 22:50
Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
…ction

# Conflicts:
#	packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
#	src/prime_rl/orchestrator/dispatcher.py
#	src/prime_rl/orchestrator/orchestrator.py
Comment thread src/prime_rl/trainer/batch.py
A bin mixing ref-bearing samples (opd/self_distill) with ref-less ones
(grpo/echo) extended ref_logprobs without backfilling or padding, shifting
it out of alignment with input_ids. Mirror the rewards/loss_core_ids
pattern with 0.0 placeholders (already the outside-the-mask filler used by
the demo scorer and pad_micro_batch).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
Misaligned parallel arrays (the ref_logprobs packing bug class) now fail
loudly at pack time instead of corrupting training silently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can put into orch configs? i think rn we have one config per entrypoint

Comment thread examples/glm5_pd_disag/rl.toml Outdated
Comment thread configs/debug/algorithms/echo.toml Outdated
Comment thread docs/algorithms.md Outdated
Comment thread docs/algorithms.md Outdated
Comment thread docs/algorithms.md
Set `[trainer.loss] type = "default"` and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields.
Set `[trainer.loss] type = "default"` and configure via the knobs above. The `ce` and `ref_kl` cores are fixed and unaffected by `[trainer.loss]`.

### Custom Loss

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this obsolete?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make it obsolete by making the RL loss more configurable in the algo config I guess

hallerite and others added 2 commits June 10, 2026 00:57
The config surface key is now [orchestrator.algo] (per-env: algo = {...});
the wire/trainer routing vocabulary is loss_type (LOSS_TYPE_RL/CE/REF_KL,
TrainingSample.loss_type, token_loss_types, MicroBatch.loss_type_ids,
advantage.action_loss_type). Also scrubs stale token-scorer mentions from
the ref_kl error message and the configs skill.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The field is now `model` (HostedModelConfig); `[orchestrator.policy]` and
`[orchestrator.student]` fold in as aliases, with the canonical key winning
at the leaf so CLI --model.<k> overrides aliased TOML. Flat ModelConfig keys
still re-nest ([orchestrator.model] name = ...). Shared-field propagation
checks all spellings for conflicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread configs/ci/integration/reverse_text_rl_opd/start.toml Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py Outdated
Two envs with different algorithms in one run — exercises heterogeneous
train batches (ref_logprobs-bearing OPD samples packed with ref-less GRPO
samples). Validated 50 steps on 2 GPUs, eval 0.652->0.836.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py Outdated
…r references

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread src/prime_rl/transport/types.py Outdated
… registry

prime-rl now assumes it only ever hosts the trainable policy. Frozen models
are external endpoints declared inline on the algorithm component that uses
them (FrozenModelConfig: model.name + required client.base_url) — no more
[orchestrator.models] namespace or runtime ModelRegistry. Each env's
Algorithm builds and readies its own frozen pools in async setup(); the
dispatcher reads algorithm.sampling_pool and gets the policy pool directly.

References are "policy" | inline config; demo_ref_kl now defaults to
"policy" (the SDFT setting needs zero config). The algo.model shorthand
folds with fill-or-agree semantics, which also fixes the two Bugbot
findings (redundant-but-consistent model rejected; advantage shorthand
clearing a folded model).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread src/prime_rl/entrypoints/rl.py
A frozen model reference is the client config we already have plus the one
request-level datum it lacks: the served model's name. Drops the nested
{model, client} shape — TOML reads `[orchestrator.algo.model]` with
name + base_url. Also fixes the rl entrypoint's frozen-endpoint warning,
which still read the deleted [orchestrator.models] dict.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
hallerite and others added 2 commits June 11, 2026 21:50
…ction

# Conflicts:
#	skills/configs/SKILL.md
#	src/prime_rl/trainer/rl/data.py
#	tests/unit/test_configs.py
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the
implicit success probability instead of pass@1: the policy gradient
averaged over successful rollouts only is unbiased for the
order-group_size truncation of the ML objective's pass@k expansion. In
estimator form that is one change to GRPO — normalize the centered group
reward by the group MEAN instead of the standard deviation, upweighting
low-pass-rate examples like 1/p. group_size becomes the truncation order
(REINFORCE at 1, exact ML in the limit).

New 'max_rl' advantage type + preset: MaxRLAdvantageConfig,
max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs
rows, and a unit test for the estimator. Groups with zero mean reward
carry zero advantages (the paper's no-success convention — the
zero-advantage filter drops them). Everything else rides the existing
GRPO path: policy sampling (enforced by the rl-component guard), rl
loss component, group barrier.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
hallerite and others added 4 commits June 11, 2026 22:28
…self

A preset name with explicit advantage/sampling keys is now a parse-time
error instead of a merge: a modified preset is not the preset, so the
config must state what it actually runs. Only the model/teacher
shorthand may accompany a name (the distillation presets are incomplete
without an endpoint by design). Assembly stays cheap — presets are thin
deltas, so a variant costs one explicit 'type' key.

Deletes the merge machinery: _merge_preset_delta and the
discriminator-aware typeless override (advantage = { max_concurrent }
under opd silently inheriting ref_kl) are gone; the preset validator
inserts components, never merges. 'name' becomes write-only input sugar
(excluded from dumps, like 'model') so resolved configs round-trip as
plain component assemblies; the orchestrator startup log now reports
advantage types instead of preset labels. The advantage shorthand gets
a preset-aware error instead of silently relabeling an inherited
preset.

echo.toml's lambda override becomes the assembled spelling, and the
debug configs spell algo as an [orchestrator.algo] section instead of
an inline table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread configs/debug/algorithms/self_distill.toml Outdated
hallerite and others added 8 commits June 11, 2026 22:44
The echo preset now means the vetted ECHO setting: weighted CE on
tool/terminal response bodies (lambda = 0.1), selected via the
renderer's per-token attribution (message_indices / message_roles /
is_content from RendererClient rollouts). EchoAdvantageConfig grows
observations: 'tool' (default; requires the renderer, validated at
config time) | 'all' (every env-provided token — the previous
behavior). interleave_rollout takes the mode instead of a bool and
tags tool spans token-exactly: response bodies when is_content is
available, whole tool messages otherwise; MITO rollouts raise loudly.

Env-shorthand assembly fix: an env advantage shorthand now assembles
the env's own algorithm instead of copy-modifying the inherited
preset (atomicity); self_distill.toml spells its demo_key variant
assembled; echo.toml (alphabet-sort, user-role feedback) assembles
with observations = 'all'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e, no aliases

HostedModelConfig was a pairing wrapper ({model, client}) that made the
canonical path stutter (orchestrator.model.model.name) and needed a
before-validator apologizing for it (re-nesting flat keys, folding the
policy/student aliases and the [orchestrator.client] shorthand). The
flat spelling was what everyone wrote anyway — make it the schema:
HostedModelConfig is now ModelConfig + a client field, mirroring
FrozenModelConfig (name + endpoint, no new declaration scheme), and
fold_policy_shortcuts is deleted along with the policy/student aliases
and the client shorthand. Resolved dumps stop stuttering too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
hallerite and others added 4 commits June 11, 2026 23:09
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Echo's selection surface generalizes from the observations="tool"|"all"
binary to a role table: each env-provided message role (system / user /
assistant / tool) trains at its own alpha, selected via the renderer's
per-token attribution. An optional filter hook (import_path + kwargs,
matching the custom advantage/loss precedent) narrows the selection per
rollout with one keep-mask per trajectory step.

- completion_obs_mask (bool) -> completion_obs_weights (float): the
  per-token weight carries its role's alpha, so stamping folds it into
  ce_weights directly and stamp_loss_routing drops the scalar
  observation_weight parameter. Orchestrator-internal as before.
- The echo preset is unchanged in meaning: tool-response bodies at 0.1.
  Setting any role replaces the whole table.
- Echo now always requires the renderer (role selection needs
  attribution); the blanket "all" mode is gone — assemble the roles
  you want instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
hallerite and others added 2 commits June 12, 2026 00:18
…eted

The advantage type now names the algorithm — group_norm -> grpo,
ref_kl -> opd, demo_ref_kl -> opsd, supervised -> sft (config classes
renamed to match) — and each type's class defaults are its vetted
setting, so 'type = "opd"' with a teacher IS on-policy distillation.

With type-plus-defaults equal to the preset for every algorithm, the
preset layer had nothing left to do: AlgorithmName, _PRESETS, the name
field, and the atomicity guard are deleted. The model/teacher shorthand
survives and now folds by the type's own declarations (model_role ->
advantage.model for opd/opsd; source_role -> sampling.source for sft).
sampling.source loses its None state (it existed only for preset
resolution); sft without a frozen source is rejected at validation —
CE on the policy's own tokens was never the vetted meaning.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8e7e51f. Configure here.

Comment thread packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
@hallerite hallerite requested review from mikasenghaas and samsja June 12, 2026 00:53
hallerite and others added 3 commits June 12, 2026 01:04
- docs/algorithms.md: ref_kl -> opd in the advantage tables, add the
  missing max_rl/reward/custom rows, fix the frozen-model base_url and
  custom-Algorithm wording, make the length_penalty example
  self-sufficient, drop the Per-Env Advantage section (duplicate of
  Per-Env Algorithms)
- configs/debug/algorithms: README gains max_rl and uses the real type
  names, comments lose leftover preset vocabulary, one wandb project
  for the whole folder
- docs/training.md / skills/configs/SKILL.md: complete type lists and
  a union example that parses on its own

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fully truncated distillation sample (prompt >= trainer seq_len) loses
all its nonzero ce/ref_kl tokens to prepare_sample's truncation while
its stamped all-zero rl_weights suppress the rl branch; with every
component empty, compute_loss returned the Python float 0.0 and
loss.backward() crashed. Seed the rl accumulator with a graph-attached
zero so the degenerate batch trains as a zero-gradient no-op (main's
behavior) and every rank still runs backward, keeping FSDP collectives
in sync.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- ref_kl_loss_fn emitted the same trust-region metric keys as the rl
  loss fn into one shared dict, so mixed batches (per-env algorithms)
  averaged two different trust-region definitions into one wandb
  series. Namespaced as ref_kl/*; the wandb noise filter gets matching
  prefixes and the ref_kl value series is unchanged.
- prepare_sample: a sample with rl member tokens but no advantage now
  raises instead of silently training with advantage 0.0 — the
  orchestrator always stamps a scalar, so a missing one is a producer
  bug (ce/ref_kl-only samples still default to 0.0 legitimately).
- pad_micro_batch: padding fills every weight stream with 0.0 instead
  of the pack-boundary defaults; padding is loss-masked so this is
  training-equivalent, and padded pure-ce batches now read as rl-empty
  in token export, which keys off nonzero weights.
- test_prepare_batch_packs_mixed_components: sorted() multiset checks
  replaced with exact positional asserts in both pack orders, pinning
  STREAM_FILL backfill alignment across the bin boundary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants