Skip to content

feat(algo): RLCSD — contrastive on-policy self-distillation#2788

Draft
hallerite wants to merge 9 commits into
feat/algorithm-abstractionfrom
feat/rlcsd-algorithm
Draft

feat(algo): RLCSD — contrastive on-policy self-distillation#2788
hallerite wants to merge 9 commits into
feat/algorithm-abstractionfrom
feat/rlcsd-algorithm

Conversation

@hallerite

@hallerite hallerite commented Jun 12, 2026

Copy link
Copy Markdown
Member

What

Implements RLCSD (arXiv:2606.11709, THU-BPM/Tongyi) as advantage type rlcsd, stacked on the algorithm abstraction (#2746).

The paper's diagnosis: OPSD's teacher-student gap concentrates on style tokens — a hinted model writes shorter, more direct text regardless of what the hint says — which destabilizes training and shrinks response length ("privilege-induced style drift"). The fix: contrast the teacher's distribution under a correct sibling-rollout hint against its mean distribution under K incorrect sibling hints, rendered with a byte-identical template so the style shift cancels in the subtraction. The cleaned signal modulates the GRPO advantage at high-signal tokens; the verifier keeps the update direction (sign-preserving clamp).

[orchestrator.algo.advantage]
type = "rlcsd"   # teacher defaults to the live policy — no extra deployment

How it maps onto the abstraction

One config class + one algorithm module; zero pipeline or trainer changes:

  • assign_advantages (group time): std-normalized group-relative credit (Eq. 8), broadcast per token. Uniform groups die in the existing zero-advantage filter — the paper notes the equivalence of its group-discard rule itself.
  • query_references (ship time, batch survivors only): regroup by group_id, draw a correct hint and K incorrect hints from the rollout's siblings (self excluded — self-conditioning makes the teacher degenerately over-confident), render the hinted contexts through the policy renderer, prefill-score the rollout's tokens under each (1+K prefills per rollout, bounded by max_concurrent). Then per token: e_ctr = log p(pos) − log mean_k p(neg_k) (Eq. 7, log-mean-exp), r_t = λ·tanh(e_ctr/τ) (Eq. 9), mask |r_t| > δ (Eq. 10), sign-preserving clamp onto the per-token base advantages (Eq. 11) — overwriting the sample's advantages stream on the rl component. (Since the base merged the per-token-stream collapse in 039b400, the modulation composes with any base credit, not just uniform group-norm.)
  • Two-path normalization (Eq. 15) folds exactly into the advantage magnitudes: the clipped surrogate is positively homogeneous in A, so per-rollout weights L/|U| (plain path) and η·L/|M| (modulated path) reproduce the independent normalization without touching the trainer.
  • Rollouts whose group offers no contrast keep their plain group-norm stream (logged per batch).

Knobs at the paper's defaults: num_negative_hints=4, tau=0.02, lam=0.5, delta=0.02, eta=1.0, plus the hint template and correct_threshold (generalizes the binary verifier to continuous rewards: reward >= threshold → correct hint pool). Requires the renderer; single-step trajectories only (like opsd).

Deliberate deviations from the paper

  1. Trust region: the paper wraps the modulated advantages in PPO clipping; here they feed the framework's rl component (DPPO + KL by default), which plays the same role. The paper's exact clip is reachable via [trainer.loss] type = "custom".
  2. Teacher refresh: the paper snapshots the student every 10 steps; model = "policy" (the default) refreshes every weight update. A frozen endpoint gives a fixed teacher.
  3. Cross-rollout weighting: folding the path normalization into advantages makes rollouts contribute length-weighted under the global rl normalization (the paper averages rollouts unweighted) — same class of choice as DR-GRPO vs GRPO token weighting.

Validation

  • Unit: log-mean-exp identity, two-path weights, sign-preserving clamp, mask threshold, no-trainable-tokens guard; rlcsd row in the vetted-defaults parametrization; full orchestrator+config suites green (227) incl. the debug TOML parse.
  • 20-step GPU runs on reverse-text (0.6B, group 16, teacher = live policy), wandb algo-debug:
    • correct_threshold at the binary default never partitions reverse-text's continuous rewards → the no-contrast degradation path verified end-to-end: trains as clean std-normalized GRPO (0.16 → 0.63 reward, eval 0.81), no crash. This run motivated the per-batch contrast-availability log.
    • At correct_threshold = 0.5: 98.9% of 2,544 shipped samples carry non-uniform per-token advantages (wire-verified; pre-dates the field rename to advantages), eval 0.095 → 0.81, stable throughout. Modulated-token fraction ~66% vs the paper's 20-30% — δ was calibrated on math at 1.7B-8B; on a toy task the contrast is small-but-everywhere. δ likely wants per-task calibration; the knob exists.

A 50-step run (rlcsd-50step, exit 0) added two findings:

  • The contrast self-anneals. Modulation rate by step: 99% → 100% → 96% → 84% → 35% → 48% (steps 0/10/20/30/40/49) as the pass rate climbed to 125/128 above threshold — groups go uniform-correct, the incorrect-hint pools empty, and the algorithm slides back to plain GRPO exactly when there is nothing left to contrast. No knob does this; it falls out of sourcing hints from the group.
  • No length collapse (the OPSD pathology the paper targets): completion length 114 → ~57 tokens vs pure GRPO's 115 → ~45 on the same task — the shortening is task-driven convergence, and RLCSD stays slightly longer than the baseline. Eval 0.099 → 0.834.

Gap-gated negatives A/B (min_contrast_gap, added in 02afca9 after discussion): negatives must be clearly wrong (reward < threshold − gap), the band between serves as neither hint; positives stay absolutely gated. 50-step A/B at gap 0.3 vs 0.0, modulation rate by step:

step 0 10 20 30 40 49
gap 0.0 99% 100% 96% 84% 35% 48%
gap 0.3 100% 100% 86% 37% 36% 23%

Identical early signal (wide groups have plenty of clear failures), then a sharper, near-monotone anneal — the late-training noise tail (contrast against near-miss negatives) halves, eval unchanged (0.825 vs 0.834). The cut contrast was confirmed non-contributing, and fewer modulated rollouts also means fewer ship-time prefills as training saturates. Default stays 0.0 (= the paper exactly); on binary rewards the knob is a no-op by construction.

reverse-text can't show the paper's headline gains (it saturates; the wins are on hard reasoning with binary verifiers) — a real test wants a math env at ≥1.7B.

🤖 Generated with Claude Code

hallerite and others added 3 commits June 12, 2026 20:07
Implements RLCSD (arXiv:2606.11709) as advantage type 'rlcsd': GRPO
anchored by the verifier, with a contrastive self-distillation signal
modulating the advantage magnitude at high-signal tokens.

- assign: std-normalized group-relative scalars (paper Eq. 8).
- score (ship time, survivors only): each rollout's tokens are
  prefill-scored under the teacher conditioned on a correct sibling
  rollout and on K incorrect siblings (byte-identical hint template, so
  the privilege-induced style shift cancels in the subtraction);
  e_ctr = pos logprob - log mean prob over negatives (Eq. 7), squashed
  via lam*tanh(e/tau), masked at delta, added to the scalar advantage
  with a sign-preserving clamp (Eq. 11), and shipped as per-token
  advantages on the rl component.
- The paper's two-path independent normalization (Eq. 15) folds into
  the advantage magnitudes (the clipped surrogate is positively
  homogeneous in A), so the trainer stays untouched.
- Teacher defaults to the live policy (the paper snapshots the student
  every 10 steps; a frozen endpoint works for a fixed teacher).
- Groups without contrast (no correct or no incorrect sibling) keep
  plain GRPO scalars — logged per batch; uniform groups die in the
  zero-advantage filter, matching the paper's group-discard rule.
- correct_threshold generalizes the binary verifier to continuous
  rewards (reward >= threshold -> correct hint pool).

Knobs (paper defaults): num_negative_hints=4, tau=0.02, lam=0.5,
delta=0.02, eta=1.0, template, max_concurrent=32. Requires the
renderer (hinted prefixes are rendered client-side); single-step
trajectories only, like opsd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
min_contrast_gap adds an exclusion band below correct_threshold:
negative hints need reward < threshold - gap, so borderline rollouts
never serve as wrong hints and near-threshold noise stops producing
contrast as groups tighten. 0.0 (the default) reproduces the plain
threshold split exactly; on binary rewards any gap in (0, 1] is
equivalent to it. The positive pool stays absolutely gated — a hint
presented to the teacher as correct must be verified correct.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… to end

RLCSD adapted to the stream API: _std_norm_advantage_fn broadcasts via
inputs.broadcast, _modulated_token_advantages takes the per-token base
stream (composes with any base credit, not just uniform group-norm), and
score overwrites sample.advantages directly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/algorithm-abstraction branch from ca2f729 to 039b400 Compare June 12, 2026 22:24
@hallerite hallerite force-pushed the feat/rlcsd-algorithm branch from 3352e1e to c8b3ae8 Compare June 12, 2026 22:24
hallerite and others added 6 commits June 12, 2026 23:19
…tage, ...)

RLCSD adapted: constructed from its advantage config, owns
self.teacher_pool via the setup()/connect() pattern like opd/opsd.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: assign -> assign_advantages, apply_advantage_fn helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: score -> query_references.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RLCSD adapted: _std_norm_advantage_fn takes TrainRollouts, broadcast helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant