feat(orchestrator): MaxRL advantage strategy (stacked on #2746) by hallerite · Pull Request #2778 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-11T21:59:30Z

Stacked on #2746 (feat/algorithm-abstraction) — and deliberately so: this PR is the expressiveness test for that abstraction. A brand-new advantage estimator from a February 2026 paper lands as one advantage function + one config union member + one small algorithm class, with zero changes to the sampler, dispatcher, wire, or trainer.

What

Adds MaxRL (arXiv:2602.02710, Tajwar & Zeng et al.) as an advantage strategy and preset:

[orchestrator.algo]
name = "max_rl"

MaxRL's observation: RL on expected reward optimizes only the first-order term of the maximum-likelihood objective over the model's implicit success probability. The ML objective expands as a harmonic mixture of pass@k gradients ($\nabla J_{ML} = \sum_k \frac{1}{k}\nabla\text{pass@k}$); truncating at order $T$ gives a compute-indexed family interpolating REINFORCE ($T{=}1$) → exact ML ($T{\to}\infty$). Their Theorem 2: averaging score functions over successful rollouts only is an unbiased estimator of the order-$T{=}N$ truncation. With a control variate, the whole method reduces to a one-line change to the GRPO advantage:

$$A_j = \frac{r_j - \bar{r}}{\bar{r}} \qquad (\text{zero when } \bar{r} = 0)$$

— centered reward normalized by the group mean instead of the standard deviation. Population-level weighting becomes $w(p) = \frac{1-(1-p)^T}{p}$ (vs GRPO's $\frac{1}{\sqrt{p(1-p)}}$): hard, low-pass-rate prompts get ~$1/p$ weight, and unlike GRPO the weight doesn't invert as $p \to 1$. group_size is the truncation order — more rollouts per prompt improves the objective, not just the variance.

How

MaxRLAdvantageConfig (type = "max_rl", group_relative = True, rl component) in the advantage union + a max_rl preset (delta from grpo: the advantage type swap).
max_rl_advantage_fn in algo/advantage.py — the estimator, with the paper's no-success convention: a group with mean reward 0 gets zero advantages everywhere, which the existing zero-advantage filter then drops (all-success groups center to zero exactly like GRPO).
MaxRLAlgorithm in algo/algorithm.py — assign calls the fn; everything else inherits.
configs/debug/algorithms/max_rl.toml (reverse-text, mirrors grpo.toml) for smoke testing.
Docs rows in the preset + class tables; one unit test for the estimator.

Existing machinery this composes with, for free: policy sampling enforced by the rl-component guard (frozen sources rejected — importance ratios need policy logprobs), group barrier from group_size, DPPO trust region + KL regularizer in the rl loss, per-env mixing (a max_rl env can pack next to a grpo env in one batch).

Caveat for users: designed for non-negative (canonically binary) rewards — mean-normalization is meaningless for signed rewards. The paper's estimator drops both terms on no-success groups; we match that exactly.

Validation

uv run pytest tests/unit/ — 482 passed, 6 skipped (includes the config-load sweep over the new TOML and the estimator unit test: [1,0,0,0] → [3,-1,-1,-1], no-success → zeros, all-success → zeros).
ruff clean.
No GPU smoke yet — config-identical to the validated grpo debug path except the advantage numbers; happy to run a 50-step reverse-text A/B vs grpo on request.

🤖 Generated with Claude Code

MaxRL (arXiv:2602.02710) approximates maximum-likelihood training of the implicit success probability instead of pass@1: the policy gradient averaged over successful rollouts only is unbiased for the order-group_size truncation of the ML objective's pass@k expansion. In estimator form that is one change to GRPO — normalize the centered group reward by the group MEAN instead of the standard deviation, upweighting low-pass-rate examples like 1/p. group_size becomes the truncation order (REINFORCE at 1, exact ML in the limit). New 'max_rl' advantage type + preset: MaxRLAdvantageConfig, max_rl_advantage_fn, MaxRLAlgorithm, a reverse-text debug config, docs rows, and a unit test for the estimator. Groups with zero mean reward carry zero advantages (the paper's no-success convention — the zero-advantage filter drops them). Everything else rides the existing GRPO path: policy sampling (enforced by the rl-component guard), rl loss component, group barrier. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hallerite · 2026-06-12T00:19:14Z

Folded into #2746 by merging this branch into its base (same flow as #2764) — GitHub auto-marked it merged. MaxRL ships with the main algorithm-abstraction PR.

hallerite and others added 9 commits June 11, 2026 21:59

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

ba05022

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

3ff860c

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

eeaf4f0

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

b9af98c

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

1395858

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

26cdbd3

Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage

a718d59

chore: section-style algo config in max_rl debug toml

262baf2

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

hallerite merged commit e55f067 into feat/algorithm-abstraction Jun 12, 2026
4 of 5 checks passed

hallerite mentioned this pull request Jun 12, 2026

feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo) #2746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): MaxRL advantage strategy (stacked on #2746)#2778

feat(orchestrator): MaxRL advantage strategy (stacked on #2746)#2778
hallerite merged 9 commits into
feat/algorithm-abstractionfrom
feat/maxrl-advantage

hallerite commented Jun 11, 2026

Uh oh!

Uh oh!

hallerite commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 11, 2026

What

How

Validation

Uh oh!

Uh oh!

hallerite commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant