Skip to content

feat(orchestrator): MaxRL advantage strategy (stacked on #2746)#2778

Merged
hallerite merged 9 commits into
feat/algorithm-abstractionfrom
feat/maxrl-advantage
Jun 12, 2026
Merged

feat(orchestrator): MaxRL advantage strategy (stacked on #2746)#2778
hallerite merged 9 commits into
feat/algorithm-abstractionfrom
feat/maxrl-advantage

Conversation

@hallerite

Copy link
Copy Markdown
Member

Stacked on #2746 (feat/algorithm-abstraction) — and deliberately so: this PR is the expressiveness test for that abstraction. A brand-new advantage estimator from a February 2026 paper lands as one advantage function + one config union member + one small algorithm class, with zero changes to the sampler, dispatcher, wire, or trainer.

What

Adds MaxRL (arXiv:2602.02710, Tajwar & Zeng et al.) as an advantage strategy and preset:

[orchestrator.algo]
name = "max_rl"

MaxRL's observation: RL on expected reward optimizes only the first-order term of the maximum-likelihood objective over the model's implicit success probability. The ML objective expands as a harmonic mixture of pass@k gradients ($\nabla J_{ML} = \sum_k \frac{1}{k}\nabla\text{pass@k}$); truncating at order $T$ gives a compute-indexed family interpolating REINFORCE ($T{=}1$) → exact ML ($T{\to}\infty$). Their Theorem 2: averaging score functions over successful rollouts only is an unbiased estimator of the order-$T{=}N$ truncation. With a control variate, the whole method reduces to a one-line change to the GRPO advantage:

$$A_j = \frac{r_j - \bar{r}}{\bar{r}} \qquad (\text{zero when } \bar{r} = 0)$$

— centered reward normalized by the group mean instead of the standard deviation. Population-level weighting becomes $w(p) = \frac{1-(1-p)^T}{p}$ (vs GRPO's $\frac{1}{\sqrt{p(1-p)}}$): hard, low-pass-rate prompts get ~$1/p$ weight, and unlike GRPO the weight doesn't invert as $p \to 1$. group_size is the truncation order — more rollouts per prompt improves the objective, not just the variance.

How

  • MaxRLAdvantageConfig (type = "max_rl", group_relative = True, rl component) in the advantage union + a max_rl preset (delta from grpo: the advantage type swap).
  • max_rl_advantage_fn in algo/advantage.py — the estimator, with the paper's no-success convention: a group with mean reward 0 gets zero advantages everywhere, which the existing zero-advantage filter then drops (all-success groups center to zero exactly like GRPO).
  • MaxRLAlgorithm in algo/algorithm.pyassign calls the fn; everything else inherits.
  • configs/debug/algorithms/max_rl.toml (reverse-text, mirrors grpo.toml) for smoke testing.
  • Docs rows in the preset + class tables; one unit test for the estimator.

Existing machinery this composes with, for free: policy sampling enforced by the rl-component guard (frozen sources rejected — importance ratios need policy logprobs), group barrier from group_size, DPPO trust region + KL regularizer in the rl loss, per-env mixing (a max_rl env can pack next to a grpo env in one batch).

Caveat for users: designed for non-negative (canonically binary) rewards — mean-normalization is meaningless for signed rewards. The paper's estimator drops both terms on no-success groups; we match that exactly.

Validation

  • uv run pytest tests/unit/ — 482 passed, 6 skipped (includes the config-load sweep over the new TOML and the estimator unit test: [1,0,0,0] → [3,-1,-1,-1], no-success → zeros, all-success → zeros).
  • ruff clean.
  • No GPU smoke yet — config-identical to the validated grpo debug path except the advantage numbers; happy to run a 50-step reverse-text A/B vs grpo on request.

🤖 Generated with Claude Code

hallerite and others added 9 commits June 11, 2026 21:59
MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the
implicit success probability instead of pass@1: the policy gradient
averaged over successful rollouts only is unbiased for the
order-group_size truncation of the ML objective's pass@k expansion. In
estimator form that is one change to GRPO — normalize the centered group
reward by the group MEAN instead of the standard deviation, upweighting
low-pass-rate examples like 1/p. group_size becomes the truncation order
(REINFORCE at 1, exact ML in the limit).

New 'max_rl' advantage type + preset: MaxRLAdvantageConfig,
max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs
rows, and a unit test for the estimator. Groups with zero mean reward
carry zero advantages (the paper's no-success convention — the
zero-advantage filter drops them). Everything else rides the existing
GRPO path: policy sampling (enforced by the rl-component guard), rl
loss component, group barrier.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite merged commit e55f067 into feat/algorithm-abstraction Jun 12, 2026
4 of 5 checks passed
@hallerite

Copy link
Copy Markdown
Member Author

Folded into #2746 by merging this branch into its base (same flow as #2764) — GitHub auto-marked it merged. MaxRL ships with the main algorithm-abstraction PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant