feat(algo): RLCSD — contrastive on-policy self-distillation by hallerite · Pull Request #2788 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-12T20:07:48Z

What

Implements RLCSD (arXiv:2606.11709, THU-BPM/Tongyi) as advantage type rlcsd, stacked on the algorithm abstraction (#2746).

The paper's diagnosis: OPSD's teacher-student gap concentrates on style tokens — a hinted model writes shorter, more direct text regardless of what the hint says — which destabilizes training and shrinks response length ("privilege-induced style drift"). The fix: contrast the teacher's distribution under a correct sibling-rollout hint against its mean distribution under K incorrect sibling hints, rendered with a byte-identical template so the style shift cancels in the subtraction. The cleaned signal modulates the GRPO advantage at high-signal tokens; the verifier keeps the update direction (sign-preserving clamp).

[orchestrator.algo.advantage]
type = "rlcsd"   # teacher defaults to the live policy — no extra deployment

How it maps onto the abstraction

One config class + one algorithm module; zero pipeline or trainer changes:

assign_advantages (group time): std-normalized group-relative credit (Eq. 8), broadcast per token. Uniform groups die in the existing zero-advantage filter — the paper notes the equivalence of its group-discard rule itself.
query_references (ship time, batch survivors only): regroup by group_id, draw a correct hint and K incorrect hints from the rollout's siblings (self excluded — self-conditioning makes the teacher degenerately over-confident), render the hinted contexts through the policy renderer, prefill-score the rollout's tokens under each (1+K prefills per rollout, bounded by max_concurrent). Then per token: e_ctr = log p(pos) − log mean_k p(neg_k) (Eq. 7, log-mean-exp), r_t = λ·tanh(e_ctr/τ) (Eq. 9), mask |r_t| > δ (Eq. 10), sign-preserving clamp onto the per-token base advantages (Eq. 11) — overwriting the sample's advantages stream on the rl component. (Since the base merged the per-token-stream collapse in 039b400, the modulation composes with any base credit, not just uniform group-norm.)
Two-path normalization (Eq. 15) folds exactly into the advantage magnitudes: the clipped surrogate is positively homogeneous in A, so per-rollout weights L/|U| (plain path) and η·L/|M| (modulated path) reproduce the independent normalization without touching the trainer.
Rollouts whose group offers no contrast keep their plain group-norm stream (logged per batch).

Knobs at the paper's defaults: num_negative_hints=4, tau=0.02, lam=0.5, delta=0.02, eta=1.0, plus the hint template and correct_threshold (generalizes the binary verifier to continuous rewards: reward >= threshold → correct hint pool). Requires the renderer; single-step trajectories only (like opsd).

Deliberate deviations from the paper

Trust region: the paper wraps the modulated advantages in PPO clipping; here they feed the framework's rl component (DPPO + KL by default), which plays the same role. The paper's exact clip is reachable via [trainer.loss] type = "custom".
Teacher refresh: the paper snapshots the student every 10 steps; model = "policy" (the default) refreshes every weight update. A frozen endpoint gives a fixed teacher.
Cross-rollout weighting: folding the path normalization into advantages makes rollouts contribute length-weighted under the global rl normalization (the paper averages rollouts unweighted) — same class of choice as DR-GRPO vs GRPO token weighting.

Validation

Unit: log-mean-exp identity, two-path weights, sign-preserving clamp, mask threshold, no-trainable-tokens guard; rlcsd row in the vetted-defaults parametrization; full orchestrator+config suites green (227) incl. the debug TOML parse.
20-step GPU runs on reverse-text (0.6B, group 16, teacher = live policy), wandb algo-debug:
- correct_threshold at the binary default never partitions reverse-text's continuous rewards → the no-contrast degradation path verified end-to-end: trains as clean std-normalized GRPO (0.16 → 0.63 reward, eval 0.81), no crash. This run motivated the per-batch contrast-availability log.
- At correct_threshold = 0.5: 98.9% of 2,544 shipped samples carry non-uniform per-token advantages (wire-verified; pre-dates the field rename to advantages), eval 0.095 → 0.81, stable throughout. Modulated-token fraction ~66% vs the paper's 20-30% — δ was calibrated on math at 1.7B-8B; on a toy task the contrast is small-but-everywhere. δ likely wants per-task calibration; the knob exists.

A 50-step run (rlcsd-50step, exit 0) added two findings:

The contrast self-anneals. Modulation rate by step: 99% → 100% → 96% → 84% → 35% → 48% (steps 0/10/20/30/40/49) as the pass rate climbed to 125/128 above threshold — groups go uniform-correct, the incorrect-hint pools empty, and the algorithm slides back to plain GRPO exactly when there is nothing left to contrast. No knob does this; it falls out of sourcing hints from the group.
No length collapse (the OPSD pathology the paper targets): completion length 114 → ~57 tokens vs pure GRPO's 115 → ~45 on the same task — the shortening is task-driven convergence, and RLCSD stays slightly longer than the baseline. Eval 0.099 → 0.834.

Gap-gated negatives A/B (min_contrast_gap, added in 02afca9 after discussion): negatives must be clearly wrong (reward < threshold − gap), the band between serves as neither hint; positives stay absolutely gated. 50-step A/B at gap 0.3 vs 0.0, modulation rate by step:

step	0	10	20	30	40	49
gap 0.0	99%	100%	96%	84%	35%	48%
gap 0.3	100%	100%	86%	37%	36%	23%

Identical early signal (wide groups have plenty of clear failures), then a sharper, near-monotone anneal — the late-training noise tail (contrast against near-miss negatives) halves, eval unchanged (0.825 vs 0.834). The cut contrast was confirmed non-contributing, and fewer modulated rollouts also means fewer ship-time prefills as training saturates. Default stays 0.0 (= the paper exactly); on binary rewards the knob is a no-op by construction.

reverse-text can't show the paper's headline gains (it saturates; the wins are on hard reasoning with binary verifiers) — a real test wants a math env at ≥1.7B.

🤖 Generated with Claude Code

Implements RLCSD (arXiv:2606.11709) as advantage type 'rlcsd': GRPO anchored by the verifier, with a contrastive self-distillation signal modulating the advantage magnitude at high-signal tokens. - assign: std-normalized group-relative scalars (paper Eq. 8). - score (ship time, survivors only): each rollout's tokens are prefill-scored under the teacher conditioned on a correct sibling rollout and on K incorrect siblings (byte-identical hint template, so the privilege-induced style shift cancels in the subtraction); e_ctr = pos logprob - log mean prob over negatives (Eq. 7), squashed via lam*tanh(e/tau), masked at delta, added to the scalar advantage with a sign-preserving clamp (Eq. 11), and shipped as per-token advantages on the rl component. - The paper's two-path independent normalization (Eq. 15) folds into the advantage magnitudes (the clipped surrogate is positively homogeneous in A), so the trainer stays untouched. - Teacher defaults to the live policy (the paper snapshots the student every 10 steps; a frozen endpoint works for a fixed teacher). - Groups without contrast (no correct or no incorrect sibling) keep plain GRPO scalars — logged per batch; uniform groups die in the zero-advantage filter, matching the paper's group-discard rule. - correct_threshold generalizes the binary verifier to continuous rewards (reward >= threshold -> correct hint pool). Knobs (paper defaults): num_negative_hints=4, tau=0.02, lam=0.5, delta=0.02, eta=1.0, template, max_concurrent=32. Requires the renderer (hinted prefixes are rendered client-side); single-step trajectories only, like opsd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

min_contrast_gap adds an exclusion band below correct_threshold: negative hints need reward < threshold - gap, so borderline rollouts never serve as wrong hints and near-threshold noise stops producing contrast as groups tighten. 0.0 (the default) reproduces the plain threshold split exactly; on binary rewards any gap in (0, 1] is equivalent to it. The positive pool stays absolutely gated — a hint presented to the teacher as correct must be verified correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… to end RLCSD adapted to the stream API: _std_norm_advantage_fn broadcasts via inputs.broadcast, _modulated_token_advantages takes the per-token base stream (composes with any base credit, not just uniform group-norm), and score overwrites sample.advantages directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…tage, ...) RLCSD adapted: constructed from its advantage config, owns self.teacher_pool via the setup()/connect() pattern like opd/opsd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

RLCSD adapted: assign -> assign_advantages, apply_advantage_fn helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

RLCSD adapted: score -> query_references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

RLCSD adapted: _std_norm_advantage_fn takes TrainRollouts, broadcast helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hallerite and others added 3 commits June 12, 2026 20:07

hallerite force-pushed the feat/algorithm-abstraction branch from ca2f729 to 039b400 Compare June 12, 2026 22:24

hallerite force-pushed the feat/rlcsd-algorithm branch from 3352e1e to c8b3ae8 Compare June 12, 2026 22:24

hallerite and others added 6 commits June 12, 2026 23:19

Merge feat/algorithm-abstraction: renderer decouple + Algorithm(advan…

25581c4

…tage, ...) RLCSD adapted: constructed from its advantage config, owns self.teacher_pool via the setup()/connect() pattern like opd/opsd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge feat/algorithm-abstraction: hooks-only class, phase functions

10d8af3

RLCSD adapted: assign -> assign_advantages, apply_advantage_fn helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge feat/algorithm-abstraction: two hooks pinned by the filter barrier

614996e

RLCSD adapted: score -> query_references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'feat/algorithm-abstraction' into feat/rlcsd-algorithm

c3bf7ec

Merge branch 'feat/algorithm-abstraction' into feat/rlcsd-algorithm

0a58791

Merge feat/algorithm-abstraction: one rollout vocabulary, finalize_batch

8997a87

RLCSD adapted: _std_norm_advantage_fn takes TrainRollouts, broadcast helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(algo): RLCSD — contrastive on-policy self-distillation#2788

feat(algo): RLCSD — contrastive on-policy self-distillation#2788
hallerite wants to merge 9 commits into
feat/algorithm-abstractionfrom
feat/rlcsd-algorithm

hallerite commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How it maps onto the abstraction

Deliberate deviations from the paper

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 12, 2026 •

edited

Loading