PrimeIntellect-ai · hallerite · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/configs/debug/algorithms/README.md b/configs/debug/algorithms/README.md
@@ -13,6 +13,7 @@ Minimal end-to-end configs for the algorithms against bundled verifiers envs, us
 | `sft_distill_external.toml` | `sft` | PI inference (`openai/gpt-5-mini`) | external OAI endpoint; no local server |
 | `self_distill.toml` | `opsd` | none (`model = "policy"`) | SDFT against the live policy; demo from reverse-text's `answer` field |
 | `echo.toml` | `echo` | none | multi-turn `alphabet-sort`; CE on observation tokens |
+| `rlcsd.toml` | `rlcsd` | none (`model = "policy"`) | contrastive self-distillation modulating GRPO; hints from sibling rollouts |
 | `mixed_grpo_opd.toml` | `grpo` + `opd` (per env) | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | two envs, one run; heterogeneous batches (with/without `ref_logprobs`) |
 
 The policy inference server is auto-launched on GPU 0 at `http://localhost:8000/v1` with `gpu_memory_utilization=0.5`. The local frozen model (used by `opd*.toml`, `sft_distill.toml` / `sft_distill_lora.toml`, and `mixed_grpo_opd.toml`) is **not** auto-launched — start it manually on GPU 1.
@@ -58,6 +59,9 @@ uv run rl @ configs/debug/algorithms/self_distill.toml
 # ECHO (no frozen model; multi-turn env)
 uv run rl @ configs/debug/algorithms/echo.toml
 
+# RLCSD (no frozen model; teacher = live policy on sibling hints)
+uv run rl @ configs/debug/algorithms/rlcsd.toml
+
 # Mixed per-env algorithms: GRPO + OPD in one run (needs the frozen model on port 8001)
 uv run rl @ configs/debug/algorithms/mixed_grpo_opd.toml
 ```

diff --git a/configs/debug/algorithms/rlcsd.toml b/configs/debug/algorithms/rlcsd.toml
@@ -0,0 +1,65 @@
+# RLCSD (arXiv:2606.11709) on reverse-text: GRPO anchored by the verifier,
+# with a contrastive self-distillation signal modulating the advantage at
+# high-signal tokens. The teacher is the live policy conditioned on correct /
+# incorrect sibling rollouts — no extra server needed.
+#   uv run rl @ configs/debug/algorithms/rlcsd.toml
+
+max_steps = 20
+seq_len = 2048
+
+[model]
+name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT"
+
+[wandb]
+project = "algorithms-debug"
+name = "debug-rlcsd"
+
+[orchestrator]
+batch_size = 128
+group_size = 16
+
+# The default template's continuation instruction is math-flavored
+# (\boxed{}); reverse-text just wants the answer re-attempted. Reverse-text's
+# reward is continuous (LCS) and exact reversals are rare at 0.6B, so the
+# binary-verifier default threshold of 1.0 would leave every correct-hint
+# pool empty — mostly-right reversals (>= 0.5) serve as positive hints.
+[orchestrator.algo.advantage]
+type = "rlcsd"
+correct_threshold = 0.5
+# Negative hints must be clearly failed reversals (< 0.2), not borderline:
+# the band in between serves as neither hint, so noise contrasts stop firing
+# as groups tighten around the threshold.
+min_contrast_gap = 0.3
+template = """{question}
+
+Here is a reference solution to this problem:
+=== Reference Solution Begin ===
+{hint}
+=== Reference Solution End ===
+
+After reading the reference solution above, answer the original problem yourself."""
+
+[orchestrator.renderer]
+name = "qwen3"
+
+[orchestrator.train.sampling]
+max_completion_tokens = 128
+
+[[orchestrator.train.env]]
+id = "reverse-text"
+
+[orchestrator.eval]
+interval = 5
+num_examples = 128
+
+[orchestrator.eval.sampling]
+max_completion_tokens = 128
+
+[[orchestrator.eval.env]]
+id = "reverse-text"
+
+[trainer.optim]
+lr = 1e-6
+
+[inference]
+gpu_memory_utilization = 0.5
diff --git a/docs/algorithms.md b/docs/algorithms.md
@@ -71,6 +71,7 @@ type = "grpo"  # the default
 | `sft` | *(the teacher)* | `ce` on actions | Hard distillation: a frozen model generates rollouts, the policy trains with CE on its tokens. Needs a `teacher` (folds into `sampling.source`). |
 | `opsd` | policy | `ref_kl` on actions | SDFT ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)): the model is its own reference, conditioned on an expert demonstration. Defaults to the live policy (the paper's setting, no extra deployment); set an inline `model` to score under a frozen copy instead. |
 | `echo` | policy | `rl` on actions + weighted `ce` on observations | ECHO: standard GRPO plus a cross-entropy loss on env-provided tokens already present in the rollout, selected by message role (needs the renderer's role attribution). Defaults to tool-response bodies at `alpha = 0.1` (ECHO's λ); set `roles` to train other roles, each at its own weight. |
+| `rlcsd` | policy | `rl` on actions | RLCSD ([arXiv:2606.11709](https://arxiv.org/abs/2606.11709)): GRPO anchored by the verifier, with a contrastive self-distillation signal — the teacher's logprobs under a correct sibling-rollout hint vs. under K incorrect sibling hints — modulating the advantage magnitude at high-signal tokens (sign-preserving). The identical hint template makes the privilege-induced style shift cancel in the subtraction, concentrating the signal on task-bearing tokens. Teacher defaults to the live policy. |
 | `reward` | policy | `rl` on actions | REINFORCE-style: advantage = raw reward, no group baseline. |
 | `custom` | policy | `rl` on actions | Your own advantage function (`import_path`), per-token advantages per rollout — see [Custom Advantage](#custom-advantage). |
 
@@ -140,6 +141,7 @@ At runtime, each env's resolved config builds two objects: a `Sampler` (`prime_r
 | `max_rl` | `MaxRLAlgorithm` | mean-normalized group credit | — |
 | `opd` | `OPDAlgorithm` | — | own-context prefill under the teacher |
 | `opsd` | `OPSDAlgorithm` | — | demo-conditioned prefill under the teacher |
+| `rlcsd` | `RLCSDAlgorithm` | std-normalized group credit | contrastive hinted prefills → per-token advantage modulation |
 | `sft` | `SFTDistillAlgorithm` | group-norm credit (feeds filters) | — |
 | `reward` | `RewardAlgorithm` | raw reward | — |
 | `custom` | `CustomAlgorithm` | your function | — |
@@ -282,6 +284,7 @@ The advantage strategy is the `advantage` component of the [algorithm](#the-algo
 | `reward` | `rl` | Advantage = raw reward, no baseline. |
 | `opd` | `ref_kl` | On-policy distillation: per-token reverse KL to a reference model (`model`, an inline frozen hosted model), evaluated in the trainer from shipped reference logprobs. No credit — rollouts keep `advantages = None` (advantage-based filters never fire) and ship no advantage stream; `group_size` only fans out sampling. |
 | `opsd` | `ref_kl` | SDFT: per-token reverse KL to a demo-conditioned reference. No credit — rollouts keep `advantages = None` (advantage-based filters never fire) and ship no advantage stream. |
+| `rlcsd` | `rl` | Std-normalized group credit, modulated per token at ship time by the contrastive hinted-teacher signal (`λ·tanh(e_ctr/τ)`, masked at `δ`, sign-preserving clamp, two-path normalization via `η`). |
 | `sft` | `ce` | Cross-entropy on the sampled tokens. The loss ignores advantages, but group-relative credit is still assigned so reward-based filtering keeps working. |
 | `custom` | `rl` | Your function (below); per-token advantages per rollout. |
 
@@ -356,6 +359,8 @@ demo_key = "demonstration"
 max_concurrent = 64
 ```
 
+`rlcsd` also scores at ship time — `1 + num_negative_hints` hinted prefills per rollout, hints drawn from the rollout's own group siblings — but ships modulated per-token advantages instead of reference logprobs.
+
 Only batch survivors get scored — rollouts that are filtered or cancelled never cost reference compute. The time shows up as `time/scoring` in the step timing.
 
 ## Filters

diff --git a/docs/training.md b/docs/training.md
@@ -93,6 +93,7 @@ The RL entrypoint supports several training algorithms, switched via `[orchestra
 | `sft` | Required, any OpenAI-compatible endpoint | Hard-distill: a frozen model generates rollouts, the policy trains on its tokens |
 | `opsd` | `"policy"` (the default, no deployment) or a vLLM endpoint serving a frozen copy | [SDFT](https://arxiv.org/abs/2601.19897): the model is its own reference conditioned on expert demonstrations |
 | `echo` | None | GRPO plus cross-entropy on env-observation tokens |
+| `rlcsd` | `"policy"` (the default) or a vLLM endpoint | [RLCSD](https://arxiv.org/abs/2606.11709): GRPO with a contrastive self-distillation signal from sibling-rollout hints modulating per-token advantages |
 
 `reward` (raw-reward credit, no baseline) and `custom` (your own advantage function) complete the set — see [Algorithms § The Algorithms](algorithms.md#the-algorithms).
 

diff --git a/packages/prime-rl-configs/src/prime_rl/configs/algorithm.py b/packages/prime-rl-configs/src/prime_rl/configs/algorithm.py
@@ -261,6 +261,80 @@ class OPSDAdvantageConfig(BaseConfig):
     """Maximum concurrent prefill requests per batch."""
 
 
+class RLCSDAdvantageConfig(BaseConfig):
+    type: Literal["rlcsd"] = "rlcsd"
+    """RLCSD (arXiv:2606.11709): GRPO with a contrastive self-distillation
+    modulation. The teacher scores each rollout's tokens under a correct
+    sibling rollout as a hint and under ``num_negative_hints`` incorrect
+    sibling hints (identical template, so the privilege-induced style shift
+    cancels in the subtraction); the squashed contrast ``λ·tanh(e/τ)``
+    modulates the group-relative advantage at tokens where it exceeds
+    ``delta``, with a sign-preserving clamp so the verifier keeps the update
+    direction. Ships per-token advantages on the ``rl`` loss component.
+    Groups without both correct and incorrect rollouts get no modulation
+    (uniform groups already die in the zero-advantage filter, matching the
+    paper's group-discard rule)."""
+
+    action_loss_type: ClassVar[ActionLossType] = "rl"
+    group_relative: ClassVar[bool] = True
+    model_role: ClassVar[str] = "teacher"
+
+    model: ModelReference = "policy"
+    """The teacher the hinted distributions are computed under. ``"policy"``
+    (the default) approximates the paper's setting — there the teacher is a
+    snapshot of the student refreshed every 10 steps; the live policy
+    refreshes every weight update. Set an inline frozen hosted model to
+    contrast under a fixed teacher instead."""
+
+    num_negative_hints: int = Field(4, ge=1)
+    """K: incorrect sibling hints whose probabilities average into the
+    negative branch — marginalizing over error types stabilizes the
+    contrast."""
+
+    tau: float = Field(0.02, gt=0)
+    """Soft-threshold slope of the tanh squash on the raw contrast."""
+
+    lam: float = Field(0.5, gt=0)
+    """Scale of the modulation: ``r_t = lam · tanh(e_ctr / tau)`` ∈ (-lam, lam)."""
+
+    delta: float = Field(0.02, ge=0)
+    """Modulation mask threshold: only tokens with ``|r_t| > delta`` get
+    their advantage modulated (~20-30% of tokens at the defaults)."""
+
+    eta: float = Field(1.0, ge=0)
+    """Weight of the modulated path relative to the unmodulated path; both
+    paths are normalized independently per rollout so the modulated tokens
+    never dilute."""
+
+    correct_threshold: float = 1.0
+    """Rollouts with ``reward >= correct_threshold`` form the correct hint
+    pool — the binary verifier generalized to continuous rewards."""
+
+    min_contrast_gap: float = Field(0.0, ge=0)
+    """Exclusion band below ``correct_threshold``: negative hints need
+    ``reward < correct_threshold - min_contrast_gap``, so borderline rollouts
+    never serve as wrong hints and near-threshold noise stops producing
+    contrast as the group tightens. ``0.0`` (the default) disables the band;
+    on binary rewards any value in (0, 1] is equivalent to it."""
+
+    template: str = (
+        "{question}\n\n"
+        "Here is a reference solution to this problem:\n"
+        "=== Reference Solution Begin ===\n{hint}\n=== Reference Solution End ===\n\n"
+        "After reading the reference solution above, make sure you understand the "
+        "reasoning behind each step. Please reason step by step, and put your final "
+        "answer within \\boxed{{}}."
+    )
+    """Template for the hinted teacher context. Receives ``{question}`` (the
+    original user message text) and ``{hint}`` (a sibling rollout's full
+    completion text). Byte-for-byte identical for correct and incorrect
+    hints — that symmetry is what cancels the style component."""
+
+    max_concurrent: int = Field(32, ge=1)
+    """Maximum concurrent prefill requests per batch (each rollout costs
+    ``1 + num_negative_hints`` prefills)."""
+
+
 class SFTAdvantageConfig(BaseConfig):
     type: Literal["sft"] = "sft"
     """SFT distillation: cross-entropy on the sampled tokens. The ``ce``
@@ -299,6 +373,7 @@ class CustomAdvantageConfig(BaseConfig):
     | RewardAdvantageConfig
     | OPDAdvantageConfig
     | OPSDAdvantageConfig
+    | RLCSDAdvantageConfig
     | SFTAdvantageConfig
     | CustomAdvantageConfig,
     Field(discriminator="type"),

diff --git a/packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py b/packages/prime-rl-configs/src/prime_rl/configs/orchestrator.py
@@ -744,10 +744,10 @@ def validate_renderer_for_demo_scoring(self):
         if self.renderer is not None:
             return self
         for env in self.train.env:
-            if env.algo is not None and env.algo.advantage.type == "opsd":
+            if env.algo is not None and env.algo.advantage.type in ("opsd", "rlcsd"):
                 raise ValueError(
-                    f"env '{env.resolved_name}' uses opsd, which renders its demo-conditioned "
-                    "scoring prefix client-side and requires orchestrator.renderer — remove "
+                    f"env '{env.resolved_name}' uses {env.algo.advantage.type}, which renders its "
+                    "hinted scoring prefixes client-side and requires orchestrator.renderer — remove "
                     "'renderer = \"None\"'."
                 )
             if env.algo is not None and env.algo.advantage.type == "echo":

diff --git a/skills/configs/SKILL.md b/skills/configs/SKILL.md
@@ -51,7 +51,7 @@ CLI: `--env.0.id reverse-text --env.1.id math-env`.
 
 **Discriminated unions** — set the `type` field to pick the variant (`[orchestrator.advantage] type = "max_rl"`). Omit `type` to keep the default variant.
 
-**Algorithms** — `[orchestrator.algo.advantage] type = "grpo" | "max_rl" | "opd" | "opsd" | "sft" | "echo" | "reward" | "custom"` — the advantage type names the algorithm (credit assignment + loss routing, fused), and each type's class defaults are its vetted setting; any other key you set is your own assembly (e.g. `[orchestrator.algo.advantage.roles.user] alpha = 0.1` for echo — setting any echo role replaces the whole role table). There is no preset layer. Per-env override: `[[orchestrator.train.env]]` `advantage = { type = "echo" }` (the env assembles its own algorithm). prime-rl only hosts the trainable policy; frozen models are inline external endpoints on the algorithm — `[orchestrator.algo.teacher]` (alias for `model`) with `name` + `base_url` folds into the slot the type declares (`advantage.model` for opd/opsd, `sampling.source` for sft). `model = "policy"` points a component at the live policy (opsd's default). See `docs/algorithms.md`.
+**Algorithms** — `[orchestrator.algo.advantage] type = "grpo" | "max_rl" | "opd" | "opsd" | "rlcsd" | "sft" | "echo" | "reward" | "custom"` — the advantage type names the algorithm (credit assignment + loss routing, fused), and each type's class defaults are its vetted setting; any other key you set is your own assembly (e.g. `[orchestrator.algo.advantage.roles.user] alpha = 0.1` for echo — setting any echo role replaces the whole role table). There is no preset layer. Per-env override: `[[orchestrator.train.env]]` `advantage = { type = "echo" }` (the env assembles its own algorithm). prime-rl only hosts the trainable policy; frozen models are inline external endpoints on the algorithm — `[orchestrator.algo.teacher]` (alias for `model`) with `name` + `base_url` folds into the slot the type declares (`advantage.model` for opd/opsd, `sampling.source` for sft). `model = "policy"` points a component at the live policy (opsd's default). See `docs/algorithms.md`.
 
 **`BaseModel | None` fields** — bare flag enables defaults; nested override enables and sets:
 

diff --git a/src/prime_rl/orchestrator/algo/__init__.py b/src/prime_rl/orchestrator/algo/__init__.py
@@ -6,12 +6,21 @@
 :class:`~prime_rl.orchestrator.sampler.Sampler`):
 
 - one module per algorithm (``grpo``, ``echo``, ``max_rl``, ``opd``,
+<<<<<<< HEAD
+  ``opsd``, ``rlcsd``, ``sft``, ``reward``, ``custom``) — each named class
+  owns its hooks (``observation_weights`` / ``assign_advantages`` /
+  ``score``) and declares what it needs (loss component, a "teacher", ...).
+  One instance per env, built by :func:`build_algorithm`. Custom credit
+  assignment plugs in through the ``custom`` advantage type
+  (:class:`CustomAlgorithm` imports a user function by path).
+=======
   ``opsd``, ``sft``, ``reward``, ``custom``) — each named class owns its
   two hooks (``assign_advantages`` / ``query_references``) and declares what
   it needs (loss component, a "teacher", ...). One instance per env, built
   by :func:`build_algorithm`. Custom credit assignment plugs in through the
   ``custom`` advantage type (:class:`CustomAlgorithm` imports a user
   function by path).
+>>>>>>> feat/algorithm-abstraction
 - ``base`` — the :class:`Algorithm` base class and the pipeline phase
   functions (:func:`finalize_group` / :func:`finalize_batch`).
 - ``advantage`` — pure advantage math: the custom-function interface
@@ -45,6 +54,7 @@
 from prime_rl.orchestrator.algo.opd import OPDAlgorithm
 from prime_rl.orchestrator.algo.opsd import OPSDAlgorithm
 from prime_rl.orchestrator.algo.reward import RewardAlgorithm
+from prime_rl.orchestrator.algo.rlcsd import RLCSDAlgorithm
 from prime_rl.orchestrator.algo.routing import stamp_advantages, stamp_loss_routing
 from prime_rl.orchestrator.algo.sft import SFTDistillAlgorithm
 
@@ -62,6 +72,7 @@
     "max_rl": MaxRLAlgorithm,
     "opd": OPDAlgorithm,
     "opsd": OPSDAlgorithm,
+    "rlcsd": RLCSDAlgorithm,
     "sft": SFTDistillAlgorithm,
     "reward": RewardAlgorithm,
     "custom": CustomAlgorithm,
@@ -86,6 +97,7 @@ def build_algorithm(config: AlgorithmConfig, policy_pool: InferencePool, rendere
     "MaxRLAlgorithm",
     "OPDAlgorithm",
     "OPSDAlgorithm",
+    "RLCSDAlgorithm",
     "RewardAlgorithm",
     "SFTDistillAlgorithm",
     "apply_advantage_fn",