feat(algo): RLCSD — contrastive on-policy self-distillation#2788
Draft
hallerite wants to merge 9 commits into
Draft
feat(algo): RLCSD — contrastive on-policy self-distillation#2788hallerite wants to merge 9 commits into
hallerite wants to merge 9 commits into
Conversation
Implements RLCSD (arXiv:2606.11709) as advantage type 'rlcsd': GRPO anchored by the verifier, with a contrastive self-distillation signal modulating the advantage magnitude at high-signal tokens. - assign: std-normalized group-relative scalars (paper Eq. 8). - score (ship time, survivors only): each rollout's tokens are prefill-scored under the teacher conditioned on a correct sibling rollout and on K incorrect siblings (byte-identical hint template, so the privilege-induced style shift cancels in the subtraction); e_ctr = pos logprob - log mean prob over negatives (Eq. 7), squashed via lam*tanh(e/tau), masked at delta, added to the scalar advantage with a sign-preserving clamp (Eq. 11), and shipped as per-token advantages on the rl component. - The paper's two-path independent normalization (Eq. 15) folds into the advantage magnitudes (the clipped surrogate is positively homogeneous in A), so the trainer stays untouched. - Teacher defaults to the live policy (the paper snapshots the student every 10 steps; a frozen endpoint works for a fixed teacher). - Groups without contrast (no correct or no incorrect sibling) keep plain GRPO scalars — logged per batch; uniform groups die in the zero-advantage filter, matching the paper's group-discard rule. - correct_threshold generalizes the binary verifier to continuous rewards (reward >= threshold -> correct hint pool). Knobs (paper defaults): num_negative_hints=4, tau=0.02, lam=0.5, delta=0.02, eta=1.0, template, max_concurrent=32. Requires the renderer (hinted prefixes are rendered client-side); single-step trajectories only, like opsd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
min_contrast_gap adds an exclusion band below correct_threshold: negative hints need reward < threshold - gap, so borderline rollouts never serve as wrong hints and near-threshold noise stops producing contrast as groups tighten. 0.0 (the default) reproduces the plain threshold split exactly; on binary rewards any gap in (0, 1] is equivalent to it. The positive pool stays absolutely gated — a hint presented to the teacher as correct must be verified correct. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… to end RLCSD adapted to the stream API: _std_norm_advantage_fn broadcasts via inputs.broadcast, _modulated_token_advantages takes the per-token base stream (composes with any base credit, not just uniform group-norm), and score overwrites sample.advantages directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ca2f729 to
039b400
Compare
3352e1e to
c8b3ae8
Compare
…tage, ...) RLCSD adapted: constructed from its advantage config, owns self.teacher_pool via the setup()/connect() pattern like opd/opsd. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: assign -> assign_advantages, apply_advantage_fn helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: score -> query_references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: _std_norm_advantage_fn takes TrainRollouts, broadcast helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements RLCSD (arXiv:2606.11709, THU-BPM/Tongyi) as advantage type
rlcsd, stacked on the algorithm abstraction (#2746).The paper's diagnosis: OPSD's teacher-student gap concentrates on style tokens — a hinted model writes shorter, more direct text regardless of what the hint says — which destabilizes training and shrinks response length ("privilege-induced style drift"). The fix: contrast the teacher's distribution under a correct sibling-rollout hint against its mean distribution under K incorrect sibling hints, rendered with a byte-identical template so the style shift cancels in the subtraction. The cleaned signal modulates the GRPO advantage at high-signal tokens; the verifier keeps the update direction (sign-preserving clamp).
How it maps onto the abstraction
One config class + one algorithm module; zero pipeline or trainer changes:
assign_advantages(group time): std-normalized group-relative credit (Eq. 8), broadcast per token. Uniform groups die in the existing zero-advantage filter — the paper notes the equivalence of its group-discard rule itself.query_references(ship time, batch survivors only): regroup bygroup_id, draw a correct hint and K incorrect hints from the rollout's siblings (self excluded — self-conditioning makes the teacher degenerately over-confident), render the hinted contexts through the policy renderer, prefill-score the rollout's tokens under each (1+Kprefills per rollout, bounded bymax_concurrent). Then per token:e_ctr = log p(pos) − log mean_k p(neg_k)(Eq. 7, log-mean-exp),r_t = λ·tanh(e_ctr/τ)(Eq. 9), mask|r_t| > δ(Eq. 10), sign-preserving clamp onto the per-token base advantages (Eq. 11) — overwriting the sample'sadvantagesstream on therlcomponent. (Since the base merged the per-token-stream collapse in 039b400, the modulation composes with any base credit, not just uniform group-norm.)L/|U|(plain path) andη·L/|M|(modulated path) reproduce the independent normalization without touching the trainer.Knobs at the paper's defaults:
num_negative_hints=4, tau=0.02, lam=0.5, delta=0.02, eta=1.0, plus the hinttemplateandcorrect_threshold(generalizes the binary verifier to continuous rewards:reward >= threshold→ correct hint pool). Requires the renderer; single-step trajectories only (likeopsd).Deliberate deviations from the paper
rlcomponent (DPPO + KL by default), which plays the same role. The paper's exact clip is reachable via[trainer.loss] type = "custom".model = "policy"(the default) refreshes every weight update. A frozen endpoint gives a fixed teacher.Validation
rlcsdrow in the vetted-defaults parametrization; full orchestrator+config suites green (227) incl. the debug TOML parse.algo-debug:correct_thresholdat the binary default never partitions reverse-text's continuous rewards → the no-contrast degradation path verified end-to-end: trains as clean std-normalized GRPO (0.16 → 0.63 reward, eval 0.81), no crash. This run motivated the per-batch contrast-availability log.correct_threshold = 0.5: 98.9% of 2,544 shipped samples carry non-uniform per-token advantages (wire-verified; pre-dates the field rename toadvantages), eval 0.095 → 0.81, stable throughout. Modulated-token fraction ~66% vs the paper's 20-30% — δ was calibrated on math at 1.7B-8B; on a toy task the contrast is small-but-everywhere. δ likely wants per-task calibration; the knob exists.A 50-step run (
rlcsd-50step, exit 0) added two findings:Gap-gated negatives A/B (
min_contrast_gap, added in 02afca9 after discussion): negatives must be clearly wrong (reward < threshold − gap), the band between serves as neither hint; positives stay absolutely gated. 50-step A/B at gap 0.3 vs 0.0, modulation rate by step:Identical early signal (wide groups have plenty of clear failures), then a sharper, near-monotone anneal — the late-training noise tail (contrast against near-miss negatives) halves, eval unchanged (0.825 vs 0.834). The cut contrast was confirmed non-contributing, and fewer modulated rollouts also means fewer ship-time prefills as training saturates. Default stays 0.0 (= the paper exactly); on binary rewards the knob is a no-op by construction.
reverse-text can't show the paper's headline gains (it saturates; the wins are on hard reasoning with binary verifiers) — a real test wants a math env at ≥1.7B.
🤖 Generated with Claude Code