feat(orchestrator): MaxRL advantage strategy (stacked on #2746)#2778
Merged
Conversation
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the implicit success probability instead of pass@1: the policy gradient averaged over successful rollouts only is unbiased for the order-group_size truncation of the ML objective's pass@k expansion. In estimator form that is one change to GRPO — normalize the centered group reward by the group MEAN instead of the standard deviation, upweighting low-pass-rate examples like 1/p. group_size becomes the truncation order (REINFORCE at 1, exact ML in the limit). New 'max_rl' advantage type + preset: MaxRLAdvantageConfig, max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs rows, and a unit test for the estimator. Groups with zero mean reward carry zero advantages (the paper's no-success convention — the zero-advantage filter drops them). Everything else rides the existing GRPO path: policy sampling (enforced by the rl-component guard), rl loss component, group barrier. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds MaxRL (arXiv:2602.02710, Tajwar & Zeng et al.) as an advantage strategy and preset:
MaxRL's observation: RL on expected reward optimizes only the first-order term of the maximum-likelihood objective over the model's implicit success probability. The ML objective expands as a harmonic mixture of pass@k gradients ($\nabla J_{ML} = \sum_k \frac{1}{k}\nabla\text{pass@k}$ ); truncating at order $T$ gives a compute-indexed family interpolating REINFORCE ($T{=}1$ ) → exact ML ($T{\to}\infty$ ). Their Theorem 2: averaging score functions over successful rollouts only is an unbiased estimator of the order-$T{=}N$ truncation. With a control variate, the whole method reduces to a one-line change to the GRPO advantage:
— centered reward normalized by the group mean instead of the standard deviation. Population-level weighting becomes$w(p) = \frac{1-(1-p)^T}{p}$ (vs GRPO's $\frac{1}{\sqrt{p(1-p)}}$ ): hard, low-pass-rate prompts get ~$1/p$ weight, and unlike GRPO the weight doesn't invert as $p \to 1$ .
group_sizeis the truncation order — more rollouts per prompt improves the objective, not just the variance.How
MaxRLAdvantageConfig(type = "max_rl",group_relative = True, rl component) in the advantage union + amax_rlpreset (delta from grpo: the advantage type swap).max_rl_advantage_fninalgo/advantage.py— the estimator, with the paper's no-success convention: a group with mean reward 0 gets zero advantages everywhere, which the existing zero-advantage filter then drops (all-success groups center to zero exactly like GRPO).MaxRLAlgorithminalgo/algorithm.py—assigncalls the fn; everything else inherits.configs/debug/algorithms/max_rl.toml(reverse-text, mirrorsgrpo.toml) for smoke testing.Existing machinery this composes with, for free: policy sampling enforced by the rl-component guard (frozen sources rejected — importance ratios need policy logprobs), group barrier from
group_size, DPPO trust region + KL regularizer in the rl loss, per-env mixing (amax_rlenv can pack next to agrpoenv in one batch).Caveat for users: designed for non-negative (canonically binary) rewards — mean-normalization is meaningless for signed rewards. The paper's estimator drops both terms on no-success groups; we match that exactly.
Validation
uv run pytest tests/unit/— 482 passed, 6 skipped (includes the config-load sweep over the new TOML and the estimator unit test:[1,0,0,0] → [3,-1,-1,-1], no-success → zeros, all-success → zeros).🤖 Generated with Claude Code