feat(loss): add `--pg-loss-divisor`: first-class constant-divisor pg_loss normalization (Dr.GRPO) by EazyReal · Pull Request #2060 · THUDM/slime

EazyReal · 2026-06-12T04:21:08Z

Problem

The default pg_loss reducer normalizes each sample by its own active-token count (a sum of per-sample token means). Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies this data-dependent denominator as a source of length bias and replaces it with a constant divisor (e.g. the max context length); DeepSWE trains with the same rule.

slime had no way to express this directly. docs/en/get_started/customization.md §11 pointed at examples/DrGRPO/custom_reducer.py:get_pg_loss_reducer as the way to get it, but that file never existed (the doc link 404s). An earlier revision of this PR filled the gap with a ~265-line example reducer — and writing it exposed the real problem: the --custom-pg-loss-reducer-function-path factory is invoked with only four positional arguments (total_lengths, response_lengths, loss_masks, calculate_per_token_loss) — no args, no max_seq_lens. So a reducer implementing the one-line Dr.GRPO rule had to:

reach into Megatron's global args for its divisor and qkv_format,
re-implement the per-CP-rank loss-mask chunking that the default reducer (get_logits_and_tokens_offset_with_cp) already does, and
still fail loud on qkv_format=bshd under context parallelism, because the call site does not forward max_seq_lens.

The example also had to own a core CLI argument (--drgrpo-divisor in slime/utils/arguments.py) — an example owning a global arg.

Before / After

Before (this PR's earlier shape — or any out-of-tree reducer today):

--custom-pg-loss-reducer-function-path examples.DrGRPO.custom_reducer.get_pg_loss_reducer \
--drgrpo-divisor 40960

~265 lines of example code reading Megatron globals, duplicating CP mask slicing, raising on bshd.

After:

--pg-loss-divisor 40960   # a constant, e.g. the max context length

pg_loss is aggregated as sum(token_loss * loss_mask) / 40960 instead of the default sum of per-sample active-token means. Concretely, for two responses with masked sums s1 (10 active tokens) and s2 (1000 active tokens): default gives s1/10 + s2/1000 (the short response dominates per token); constant mode gives (s1 + s2) / 40960 (every token weighs the same regardless of response length — Dr.GRPO's point).

Flag unset → byte-identical to today. Other metrics (pg_clipfrac, ppo_kl, entropy_loss, kl_loss) keep the default reducer, matching the scope of the custom hook.

Why first-class instead of an example

The core reducer site already has all the layout context the hook lacks, so the whole feature is ~15 lines there:

cp_utils.get_sum_of_sample_mean grows a constant_divisor option that divides the masked token-sum closure it already builds — reusing the existing CP mask chunking, so it is correct for thd and bshd under context parallelism.
No extra allreduce: the divisor is identical on every CP rank, so Megatron's gradient sum-allreduce already yields the full-batch value.
No global-args reach-around: the divisor arrives through args at the selection site in loss.py (policy_loss_function), which also forwards qkv_format and max_seq_lens.
--calculate-per-token-loss: the divisor is intentionally not applied — Megatron already divides by the all-reduced token count (same contract as the default reducer).
Fail at startup, not mid-train: slime_validate_args rejects non-positive (or NaN) divisors.

The --custom-pg-loss-reducer-function-path hook is untouched for normalizations that need more than a constant. Docs §11 (en + zh) now shows the flag for the Dr.GRPO use case — fixing the dangling example reference that originally motivated this PR — and examples/DrGRPO/README.md is a thin pointer to the flag.

Tests

tests/test_cp_utils.py: constant mode = masked token sum / constant (contrasted with the default per-sample means on the same inputs); divisor not applied under calculate_per_token_loss; summing per-CP-rank reducer outputs reproduces the cp=1 value for both thd and bshd.
tests/test_megatron_argument_validation.py: startup validation rejects 0 / negative / NaN divisors, accepts positive ones.

Both files are already in the cpu-unittest matrix, so no CI registration changes. New tests were mutation-checked: ignoring the divisor, applying it in per-token mode, or removing the validation each makes the corresponding test fail.

🤖 Generated with Claude Code

…tion (Dr.GRPO) The default pg_loss reducer normalizes each sample by its own active-token count, which Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies as a source of length bias; Dr.GRPO and DeepSWE instead divide the policy-gradient loss by a constant (e.g. the max context length). Previously the only way to express this in slime was the --custom-pg-loss-reducer-function-path hook, but the hook factory is invoked with only four positional arguments (total_lengths, response_lengths, loss_masks, calculate_per_token_loss) — no args, no max_seq_lens. A reducer implementing the one-line Dr.GRPO rule must therefore reach into Megatron's global args for the divisor and qkv_format, re-implement the per-CP-rank loss-mask chunking, and still fail on qkv_format=bshd under context parallelism because max_seq_lens is not forwarded. The core reducer site already has all of that layout context, so make the constant divisor a first-class mode there: - cp_utils.get_sum_of_sample_mean grows a constant_divisor option that returns sum(x * loss_mask) / constant. It reuses the existing CP mask chunking, so it is correct for thd AND bshd, and the constant is identical on every CP rank, so the gradient sum-allreduce needs no extra denominator correction. - policy_loss_function selects it for pg_loss when --pg-loss-divisor is set (the custom hook still takes precedence; other metrics keep the default reducer). Unset, behavior is byte-identical to before. Under --calculate-per-token-loss the divisor is intentionally not applied — Megatron already divides by the all-reduced token count. - slime_validate_args rejects non-positive (or NaN) divisors at startup instead of training with a wrong denominator. - The --custom-pg-loss-reducer-function-path hook is unchanged; docs section 11 (en+zh) now points the Dr.GRPO use case at the flag instead of the previously missing examples/DrGRPO/custom_reducer.py, and examples/DrGRPO/README.md is a thin pointer to the flag. - Tests: constant-mode value, per-token-loss contract, and CP (thd/bshd) equivalence in tests/test_cp_utils.py; startup validation in tests/test_megatron_argument_validation.py. Both files are already in the cpu-unittest matrix, so no workflow changes are needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}` (+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is aggregated across a training step, riding the existing `sample_denoms` seam in `get_sum_of_sample_mean`. Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2): - sample_mean (default): GRPO sample average — per-rollout token-weighted mean via `rollout_mask_sums`. Byte-identical to the prior default. - prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and variable-GBS-correct). Every prompt group enters the step sum once under the same `/ step_global_batch_size` divisor, so relative per-prompt weighting is uniform; absolute scale differs from a strict 1/P DAPO average by a constant factor (P/N), which the learning rate absorbs. - token_mean: token average — aliased onto `--calculate-per-token-loss` so the whole loss-scaling/reporting path stays per-token. - constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new `constant_divisor` branch in cp_utils. Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean); `--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated > 0 at startup only for constant. Subsumes the open THUDM#2060 `--pg-loss-divisor`. Tests in test_cp_utils.py (constant divides by L; prompt_mean per-group denom distinct from sample/token mean; mutual-exclusion guard; CP rank-sum invariance for prompt_mean and constant) and test_megatron_argument_validation.py (divisor validation; token_mean alias). Mutation-verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}` (+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is aggregated across a training step, riding the existing `sample_denoms` seam in `get_sum_of_sample_mean`. Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2): - sample_mean (default): GRPO sample average — per-rollout token-weighted mean via `rollout_mask_sums`. Byte-identical to the prior default. - prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and variable-GBS-correct). Every prompt group enters the step sum once under the same `/ step_global_batch_size` divisor, so relative per-prompt weighting is uniform; absolute scale differs from a strict 1/P DAPO average by a constant factor (P/N), which the learning rate absorbs. - token_mean: token average — aliased onto `--calculate-per-token-loss` so the whole loss-scaling/reporting path stays per-token. - constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new `constant_divisor` branch in cp_utils. Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean); `--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated > 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`. Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group denominator distinct from sample/token mean, the constant/per-token mutual- exclusion guard, and CP rank-sum invariance for the new constant branch (prompt_mean reuses the per-rollout sample-mean CP path already pinned). test_megatron_argument_validation.py pins divisor validation and the token_mean alias. Mutation-verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal · 2026-06-16T08:08:20Z

Superseded by #2090, which adds --loss-aggregation {sample_mean, prompt_mean, token_mean, constant} — the Dr.GRPO constant divisor becomes the first-class --loss-aggregation constant --loss-aggregation-divisor L mode (same constant_divisor design as here), alongside the DAPO prompt-average, GRPO sample-average, and token-average. Closing in favor of the unified flag.

Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}` (+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is aggregated across a training step, riding the existing `sample_denoms` seam in `get_sum_of_sample_mean`. Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2): - sample_mean (default): GRPO sample average — per-rollout token-weighted mean via `rollout_mask_sums`. Byte-identical to the prior default (no extra batch key in any non-prompt_mean mode). - prompt_mean: DAPO prompt average — step-level `prompt_mask_sums` grouped by Sample.group_index, built ONLY under prompt_mean and plumbed like `rollout_mask_sums` (CP- and variable-GBS-correct). The other three modes never read group_index and never build the key. A None group_index under prompt_mean fails loud (the prompt-grouping invariant is broken; silently renumbering it into a singleton group would degrade prompt_mean -> sample_mean for that sample). Every prompt group enters the step sum once under the same `/ step_global_batch_size` divisor, so relative per-prompt weighting is uniform; absolute scale differs from a strict 1/P DAPO average by a constant factor (P/N), which the learning rate absorbs. - token_mean: token average — aliased onto `--calculate-per-token-loss` so the whole loss-scaling/reporting path stays per-token. - constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new `constant_divisor` branch in cp_utils. Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean); `--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated > 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`. Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group denominator distinct from sample/token mean, the constant/per-token mutual- exclusion guard, and CP rank-sum invariance for the new constant branch (prompt_mean reuses the per-rollout sample-mean CP path already pinned). test_megatron_argument_validation.py pins divisor validation and the token_mean alias. test_rollout_validation.py pins the prompt_mask_sums build: fail-loud on a None group_index under prompt_mean, no key built for the other three modes (default batch unchanged), and correct per-group totals. Mutation-verified. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EazyReal force-pushed the upstream-pr/drgrpo-reducer-example branch 2 times, most recently from 07c9c15 to 09d237d Compare June 12, 2026 06:20

EazyReal force-pushed the upstream-pr/drgrpo-reducer-example branch from 09d237d to 0baefb5 Compare June 12, 2026 08:37

EazyReal changed the title ~~docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example~~ docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example (Todo: make it proper loss agg option) Jun 12, 2026

EazyReal changed the title ~~docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example (Todo: make it proper loss agg option)~~ feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO) Jun 12, 2026

EazyReal mentioned this pull request Jun 16, 2026

Add --loss-aggregation for the four ScaleRL pg_loss aggregation modes #2090

Open

EazyReal closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loss): add `--pg-loss-divisor`: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060

feat(loss): add `--pg-loss-divisor`: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-pr/drgrpo-reducer-example

EazyReal commented Jun 12, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EazyReal commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Before / After

Why first-class instead of an example

Tests

Uh oh!

EazyReal commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 12, 2026 •

edited

Loading