Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions configs/debug/algorithms/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Minimal end-to-end configs for the algorithms against bundled verifiers envs, us
| `sft_distill_external.toml` | `sft` | PI inference (`openai/gpt-5-mini`) | external OAI endpoint; no local server |
| `self_distill.toml` | `opsd` | none (`model = "policy"`) | SDFT against the live policy; demo from reverse-text's `answer` field |
| `echo.toml` | `echo` | none | multi-turn `alphabet-sort`; CE on observation tokens |
| `rlcsd.toml` | `rlcsd` | none (`model = "policy"`) | contrastive self-distillation modulating GRPO; hints from sibling rollouts |
| `mixed_grpo_opd.toml` | `grpo` + `opd` (per env) | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | two envs, one run; heterogeneous batches (with/without `ref_logprobs`) |

The policy inference server is auto-launched on GPU 0 at `http://localhost:8000/v1` with `gpu_memory_utilization=0.5`. The local frozen model (used by `opd*.toml`, `sft_distill.toml` / `sft_distill_lora.toml`, and `mixed_grpo_opd.toml`) is **not** auto-launched — start it manually on GPU 1.
Expand Down Expand Up @@ -58,6 +59,9 @@ uv run rl @ configs/debug/algorithms/self_distill.toml
# ECHO (no frozen model; multi-turn env)
uv run rl @ configs/debug/algorithms/echo.toml

# RLCSD (no frozen model; teacher = live policy on sibling hints)
uv run rl @ configs/debug/algorithms/rlcsd.toml

# Mixed per-env algorithms: GRPO + OPD in one run (needs the frozen model on port 8001)
uv run rl @ configs/debug/algorithms/mixed_grpo_opd.toml
```
Expand Down
65 changes: 65 additions & 0 deletions configs/debug/algorithms/rlcsd.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# RLCSD (arXiv:2606.11709) on reverse-text: GRPO anchored by the verifier,
# with a contrastive self-distillation signal modulating the advantage at
# high-signal tokens. The teacher is the live policy conditioned on correct /
# incorrect sibling rollouts — no extra server needed.
# uv run rl @ configs/debug/algorithms/rlcsd.toml

max_steps = 20
seq_len = 2048

[model]
name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT"

[wandb]
project = "algorithms-debug"
name = "debug-rlcsd"

[orchestrator]
batch_size = 128
group_size = 16

# The default template's continuation instruction is math-flavored
# (\boxed{}); reverse-text just wants the answer re-attempted. Reverse-text's
# reward is continuous (LCS) and exact reversals are rare at 0.6B, so the
# binary-verifier default threshold of 1.0 would leave every correct-hint
# pool empty — mostly-right reversals (>= 0.5) serve as positive hints.
[orchestrator.algo.advantage]
type = "rlcsd"
correct_threshold = 0.5
# Negative hints must be clearly failed reversals (< 0.2), not borderline:
# the band in between serves as neither hint, so noise contrasts stop firing
# as groups tighten around the threshold.
min_contrast_gap = 0.3
template = """{question}

Here is a reference solution to this problem:
=== Reference Solution Begin ===
{hint}
=== Reference Solution End ===

After reading the reference solution above, answer the original problem yourself."""

[orchestrator.renderer]
name = "qwen3"

[orchestrator.train.sampling]
max_completion_tokens = 128

[[orchestrator.train.env]]
id = "reverse-text"

[orchestrator.eval]
interval = 5
num_examples = 128

[orchestrator.eval.sampling]
max_completion_tokens = 128

[[orchestrator.eval.env]]
id = "reverse-text"

[trainer.optim]
lr = 1e-6

[inference]
gpu_memory_utilization = 0.5
5 changes: 5 additions & 0 deletions docs/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ type = "grpo" # the default
| `sft` | *(the teacher)* | `ce` on actions | Hard distillation: a frozen model generates rollouts, the policy trains with CE on its tokens. Needs a `teacher` (folds into `sampling.source`). |
| `opsd` | policy | `ref_kl` on actions | SDFT ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)): the model is its own reference, conditioned on an expert demonstration. Defaults to the live policy (the paper's setting, no extra deployment); set an inline `model` to score under a frozen copy instead. |
| `echo` | policy | `rl` on actions + weighted `ce` on observations | ECHO: standard GRPO plus a cross-entropy loss on env-provided tokens already present in the rollout, selected by message role (needs the renderer's role attribution). Defaults to tool-response bodies at `alpha = 0.1` (ECHO's λ); set `roles` to train other roles, each at its own weight. |
| `rlcsd` | policy | `rl` on actions | RLCSD ([arXiv:2606.11709](https://arxiv.org/abs/2606.11709)): GRPO anchored by the verifier, with a contrastive self-distillation signal — the teacher's logprobs under a correct sibling-rollout hint vs. under K incorrect sibling hints — modulating the advantage magnitude at high-signal tokens (sign-preserving). The identical hint template makes the privilege-induced style shift cancel in the subtraction, concentrating the signal on task-bearing tokens. Teacher defaults to the live policy. |
| `reward` | policy | `rl` on actions | REINFORCE-style: advantage = raw reward, no group baseline. |
| `custom` | policy | `rl` on actions | Your own advantage function (`import_path`), per-token advantages per rollout — see [Custom Advantage](#custom-advantage). |

Expand Down Expand Up @@ -140,6 +141,7 @@ At runtime, each env's resolved config builds two objects: a `Sampler` (`prime_r
| `max_rl` | `MaxRLAlgorithm` | mean-normalized group credit | — |
| `opd` | `OPDAlgorithm` | — | own-context prefill under the teacher |
| `opsd` | `OPSDAlgorithm` | — | demo-conditioned prefill under the teacher |
| `rlcsd` | `RLCSDAlgorithm` | std-normalized group credit | contrastive hinted prefills → per-token advantage modulation |
| `sft` | `SFTDistillAlgorithm` | group-norm credit (feeds filters) | — |
| `reward` | `RewardAlgorithm` | raw reward | — |
| `custom` | `CustomAlgorithm` | your function | — |
Expand Down Expand Up @@ -282,6 +284,7 @@ The advantage strategy is the `advantage` component of the [algorithm](#the-algo
| `reward` | `rl` | Advantage = raw reward, no baseline. |
| `opd` | `ref_kl` | On-policy distillation: per-token reverse KL to a reference model (`model`, an inline frozen hosted model), evaluated in the trainer from shipped reference logprobs. No credit — rollouts keep `advantages = None` (advantage-based filters never fire) and ship no advantage stream; `group_size` only fans out sampling. |
| `opsd` | `ref_kl` | SDFT: per-token reverse KL to a demo-conditioned reference. No credit — rollouts keep `advantages = None` (advantage-based filters never fire) and ship no advantage stream. |
| `rlcsd` | `rl` | Std-normalized group credit, modulated per token at ship time by the contrastive hinted-teacher signal (`λ·tanh(e_ctr/τ)`, masked at `δ`, sign-preserving clamp, two-path normalization via `η`). |
| `sft` | `ce` | Cross-entropy on the sampled tokens. The loss ignores advantages, but group-relative credit is still assigned so reward-based filtering keeps working. |
| `custom` | `rl` | Your function (below); per-token advantages per rollout. |

Expand Down Expand Up @@ -356,6 +359,8 @@ demo_key = "demonstration"
max_concurrent = 64
```

`rlcsd` also scores at ship time — `1 + num_negative_hints` hinted prefills per rollout, hints drawn from the rollout's own group siblings — but ships modulated per-token advantages instead of reference logprobs.

Only batch survivors get scored — rollouts that are filtered or cancelled never cost reference compute. The time shows up as `time/scoring` in the step timing.

## Filters
Expand Down
1 change: 1 addition & 0 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ The RL entrypoint supports several training algorithms, switched via `[orchestra
| `sft` | Required, any OpenAI-compatible endpoint | Hard-distill: a frozen model generates rollouts, the policy trains on its tokens |
| `opsd` | `"policy"` (the default, no deployment) or a vLLM endpoint serving a frozen copy | [SDFT](https://arxiv.org/abs/2601.19897): the model is its own reference conditioned on expert demonstrations |
| `echo` | None | GRPO plus cross-entropy on env-observation tokens |
| `rlcsd` | `"policy"` (the default) or a vLLM endpoint | [RLCSD](https://arxiv.org/abs/2606.11709): GRPO with a contrastive self-distillation signal from sibling-rollout hints modulating per-token advantages |

`reward` (raw-reward credit, no baseline) and `custom` (your own advantage function) complete the set — see [Algorithms § The Algorithms](algorithms.md#the-algorithms).

Expand Down
75 changes: 75 additions & 0 deletions packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,80 @@ class OPSDAdvantageConfig(BaseConfig):
"""Maximum concurrent prefill requests per batch."""


class RLCSDAdvantageConfig(BaseConfig):
type: Literal["rlcsd"] = "rlcsd"
"""RLCSD (arXiv:2606.11709): GRPO with a contrastive self-distillation
modulation. The teacher scores each rollout's tokens under a correct
sibling rollout as a hint and under ``num_negative_hints`` incorrect
sibling hints (identical template, so the privilege-induced style shift
cancels in the subtraction); the squashed contrast ``λ·tanh(e/τ)``
modulates the group-relative advantage at tokens where it exceeds
``delta``, with a sign-preserving clamp so the verifier keeps the update
direction. Ships per-token advantages on the ``rl`` loss component.
Groups without both correct and incorrect rollouts get no modulation
(uniform groups already die in the zero-advantage filter, matching the
paper's group-discard rule)."""

action_loss_type: ClassVar[ActionLossType] = "rl"
group_relative: ClassVar[bool] = True
model_role: ClassVar[str] = "teacher"

model: ModelReference = "policy"
"""The teacher the hinted distributions are computed under. ``"policy"``
(the default) approximates the paper's setting — there the teacher is a
snapshot of the student refreshed every 10 steps; the live policy
refreshes every weight update. Set an inline frozen hosted model to
contrast under a fixed teacher instead."""

num_negative_hints: int = Field(4, ge=1)
"""K: incorrect sibling hints whose probabilities average into the
negative branch — marginalizing over error types stabilizes the
contrast."""

tau: float = Field(0.02, gt=0)
"""Soft-threshold slope of the tanh squash on the raw contrast."""

lam: float = Field(0.5, gt=0)
"""Scale of the modulation: ``r_t = lam · tanh(e_ctr / tau)`` ∈ (-lam, lam)."""

delta: float = Field(0.02, ge=0)
"""Modulation mask threshold: only tokens with ``|r_t| > delta`` get
their advantage modulated (~20-30% of tokens at the defaults)."""

eta: float = Field(1.0, ge=0)
"""Weight of the modulated path relative to the unmodulated path; both
paths are normalized independently per rollout so the modulated tokens
never dilute."""

correct_threshold: float = 1.0
"""Rollouts with ``reward >= correct_threshold`` form the correct hint
pool — the binary verifier generalized to continuous rewards."""

min_contrast_gap: float = Field(0.0, ge=0)
"""Exclusion band below ``correct_threshold``: negative hints need
``reward < correct_threshold - min_contrast_gap``, so borderline rollouts
never serve as wrong hints and near-threshold noise stops producing
contrast as the group tightens. ``0.0`` (the default) disables the band;
on binary rewards any value in (0, 1] is equivalent to it."""

template: str = (
"{question}\n\n"
"Here is a reference solution to this problem:\n"
"=== Reference Solution Begin ===\n{hint}\n=== Reference Solution End ===\n\n"
"After reading the reference solution above, make sure you understand the "
"reasoning behind each step. Please reason step by step, and put your final "
"answer within \\boxed{{}}."
)
"""Template for the hinted teacher context. Receives ``{question}`` (the
original user message text) and ``{hint}`` (a sibling rollout's full
completion text). Byte-for-byte identical for correct and incorrect
hints — that symmetry is what cancels the style component."""

max_concurrent: int = Field(32, ge=1)
"""Maximum concurrent prefill requests per batch (each rollout costs
``1 + num_negative_hints`` prefills)."""


class SFTAdvantageConfig(BaseConfig):
type: Literal["sft"] = "sft"
"""SFT distillation: cross-entropy on the sampled tokens. The ``ce``
Expand Down Expand Up @@ -299,6 +373,7 @@ class CustomAdvantageConfig(BaseConfig):
| RewardAdvantageConfig
| OPDAdvantageConfig
| OPSDAdvantageConfig
| RLCSDAdvantageConfig
| SFTAdvantageConfig
| CustomAdvantageConfig,
Field(discriminator="type"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -744,10 +744,10 @@ def validate_renderer_for_demo_scoring(self):
if self.renderer is not None:
return self
for env in self.train.env:
if env.algo is not None and env.algo.advantage.type == "opsd":
if env.algo is not None and env.algo.advantage.type in ("opsd", "rlcsd"):
raise ValueError(
f"env '{env.resolved_name}' uses opsd, which renders its demo-conditioned "
"scoring prefix client-side and requires orchestrator.renderer — remove "
f"env '{env.resolved_name}' uses {env.algo.advantage.type}, which renders its "
"hinted scoring prefixes client-side and requires orchestrator.renderer — remove "
"'renderer = \"None\"'."
)
if env.algo is not None and env.algo.advantage.type == "echo":
Expand Down
2 changes: 1 addition & 1 deletion skills/configs/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ CLI: `--env.0.id reverse-text --env.1.id math-env`.

**Discriminated unions** — set the `type` field to pick the variant (`[orchestrator.advantage] type = "max_rl"`). Omit `type` to keep the default variant.

**Algorithms** — `[orchestrator.algo.advantage] type = "grpo" | "max_rl" | "opd" | "opsd" | "sft" | "echo" | "reward" | "custom"` — the advantage type names the algorithm (credit assignment + loss routing, fused), and each type's class defaults are its vetted setting; any other key you set is your own assembly (e.g. `[orchestrator.algo.advantage.roles.user] alpha = 0.1` for echo — setting any echo role replaces the whole role table). There is no preset layer. Per-env override: `[[orchestrator.train.env]]` `advantage = { type = "echo" }` (the env assembles its own algorithm). prime-rl only hosts the trainable policy; frozen models are inline external endpoints on the algorithm — `[orchestrator.algo.teacher]` (alias for `model`) with `name` + `base_url` folds into the slot the type declares (`advantage.model` for opd/opsd, `sampling.source` for sft). `model = "policy"` points a component at the live policy (opsd's default). See `docs/algorithms.md`.
**Algorithms** — `[orchestrator.algo.advantage] type = "grpo" | "max_rl" | "opd" | "opsd" | "rlcsd" | "sft" | "echo" | "reward" | "custom"` — the advantage type names the algorithm (credit assignment + loss routing, fused), and each type's class defaults are its vetted setting; any other key you set is your own assembly (e.g. `[orchestrator.algo.advantage.roles.user] alpha = 0.1` for echo — setting any echo role replaces the whole role table). There is no preset layer. Per-env override: `[[orchestrator.train.env]]` `advantage = { type = "echo" }` (the env assembles its own algorithm). prime-rl only hosts the trainable policy; frozen models are inline external endpoints on the algorithm — `[orchestrator.algo.teacher]` (alias for `model`) with `name` + `base_url` folds into the slot the type declares (`advantage.model` for opd/opsd, `sampling.source` for sft). `model = "policy"` points a component at the live policy (opsd's default). See `docs/algorithms.md`.

**`BaseModel | None` fields** — bare flag enables defaults; nested override enables and sets:

Expand Down
12 changes: 12 additions & 0 deletions src/prime_rl/orchestrator/algo/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,21 @@
:class:`~prime_rl.orchestrator.sampler.Sampler`):

- one module per algorithm (``grpo``, ``echo``, ``max_rl``, ``opd``,
<<<<<<< HEAD
``opsd``, ``rlcsd``, ``sft``, ``reward``, ``custom``) — each named class
owns its hooks (``observation_weights`` / ``assign_advantages`` /
``score``) and declares what it needs (loss component, a "teacher", ...).
One instance per env, built by :func:`build_algorithm`. Custom credit
assignment plugs in through the ``custom`` advantage type
(:class:`CustomAlgorithm` imports a user function by path).
=======
``opsd``, ``sft``, ``reward``, ``custom``) — each named class owns its
two hooks (``assign_advantages`` / ``query_references``) and declares what
it needs (loss component, a "teacher", ...). One instance per env, built
by :func:`build_algorithm`. Custom credit assignment plugs in through the
``custom`` advantage type (:class:`CustomAlgorithm` imports a user
function by path).
>>>>>>> feat/algorithm-abstraction
- ``base`` — the :class:`Algorithm` base class and the pipeline phase
functions (:func:`finalize_group` / :func:`finalize_batch`).
- ``advantage`` — pure advantage math: the custom-function interface
Expand Down Expand Up @@ -45,6 +54,7 @@
from prime_rl.orchestrator.algo.opd import OPDAlgorithm
from prime_rl.orchestrator.algo.opsd import OPSDAlgorithm
from prime_rl.orchestrator.algo.reward import RewardAlgorithm
from prime_rl.orchestrator.algo.rlcsd import RLCSDAlgorithm
from prime_rl.orchestrator.algo.routing import stamp_advantages, stamp_loss_routing
from prime_rl.orchestrator.algo.sft import SFTDistillAlgorithm

Expand All @@ -62,6 +72,7 @@
"max_rl": MaxRLAlgorithm,
"opd": OPDAlgorithm,
"opsd": OPSDAlgorithm,
"rlcsd": RLCSDAlgorithm,
"sft": SFTDistillAlgorithm,
"reward": RewardAlgorithm,
"custom": CustomAlgorithm,
Expand All @@ -86,6 +97,7 @@ def build_algorithm(config: AlgorithmConfig, policy_pool: InferencePool, rendere
"MaxRLAlgorithm",
"OPDAlgorithm",
"OPSDAlgorithm",
"RLCSDAlgorithm",
"RewardAlgorithm",
"SFTDistillAlgorithm",
"apply_advantage_fn",
Expand Down
Loading
Loading