Skip to content

feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060

Closed
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-pr/drgrpo-reducer-example
Closed

feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:upstream-pr/drgrpo-reducer-example

Conversation

@EazyReal

@EazyReal EazyReal commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

The default pg_loss reducer normalizes each sample by its own active-token count (a sum of per-sample token means). Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies this data-dependent denominator as a source of length bias and replaces it with a constant divisor (e.g. the max context length); DeepSWE trains with the same rule.

slime had no way to express this directly. docs/en/get_started/customization.md §11 pointed at examples/DrGRPO/custom_reducer.py:get_pg_loss_reducer as the way to get it, but that file never existed (the doc link 404s). An earlier revision of this PR filled the gap with a ~265-line example reducer — and writing it exposed the real problem: the --custom-pg-loss-reducer-function-path factory is invoked with only four positional arguments (total_lengths, response_lengths, loss_masks, calculate_per_token_loss) — no args, no max_seq_lens. So a reducer implementing the one-line Dr.GRPO rule had to:

  • reach into Megatron's global args for its divisor and qkv_format,
  • re-implement the per-CP-rank loss-mask chunking that the default reducer (get_logits_and_tokens_offset_with_cp) already does, and
  • still fail loud on qkv_format=bshd under context parallelism, because the call site does not forward max_seq_lens.

The example also had to own a core CLI argument (--drgrpo-divisor in slime/utils/arguments.py) — an example owning a global arg.

Before / After

Before (this PR's earlier shape — or any out-of-tree reducer today):

--custom-pg-loss-reducer-function-path examples.DrGRPO.custom_reducer.get_pg_loss_reducer \
--drgrpo-divisor 40960

~265 lines of example code reading Megatron globals, duplicating CP mask slicing, raising on bshd.

After:

--pg-loss-divisor 40960   # a constant, e.g. the max context length

pg_loss is aggregated as sum(token_loss * loss_mask) / 40960 instead of the default sum of per-sample active-token means. Concretely, for two responses with masked sums s1 (10 active tokens) and s2 (1000 active tokens): default gives s1/10 + s2/1000 (the short response dominates per token); constant mode gives (s1 + s2) / 40960 (every token weighs the same regardless of response length — Dr.GRPO's point).

Flag unset → byte-identical to today. Other metrics (pg_clipfrac, ppo_kl, entropy_loss, kl_loss) keep the default reducer, matching the scope of the custom hook.

Why first-class instead of an example

The core reducer site already has all the layout context the hook lacks, so the whole feature is ~15 lines there:

  • cp_utils.get_sum_of_sample_mean grows a constant_divisor option that divides the masked token-sum closure it already builds — reusing the existing CP mask chunking, so it is correct for thd and bshd under context parallelism.
  • No extra allreduce: the divisor is identical on every CP rank, so Megatron's gradient sum-allreduce already yields the full-batch value.
  • No global-args reach-around: the divisor arrives through args at the selection site in loss.py (policy_loss_function), which also forwards qkv_format and max_seq_lens.
  • --calculate-per-token-loss: the divisor is intentionally not applied — Megatron already divides by the all-reduced token count (same contract as the default reducer).
  • Fail at startup, not mid-train: slime_validate_args rejects non-positive (or NaN) divisors.

The --custom-pg-loss-reducer-function-path hook is untouched for normalizations that need more than a constant. Docs §11 (en + zh) now shows the flag for the Dr.GRPO use case — fixing the dangling example reference that originally motivated this PR — and examples/DrGRPO/README.md is a thin pointer to the flag.

Tests

  • tests/test_cp_utils.py: constant mode = masked token sum / constant (contrasted with the default per-sample means on the same inputs); divisor not applied under calculate_per_token_loss; summing per-CP-rank reducer outputs reproduces the cp=1 value for both thd and bshd.
  • tests/test_megatron_argument_validation.py: startup validation rejects 0 / negative / NaN divisors, accepts positive ones.

Both files are already in the cpu-unittest matrix, so no CI registration changes. New tests were mutation-checked: ignoring the divisor, applying it in per-token mode, or removing the validation each makes the corresponding test fail.

🤖 Generated with Claude Code

@EazyReal EazyReal force-pushed the upstream-pr/drgrpo-reducer-example branch 2 times, most recently from 07c9c15 to 09d237d Compare June 12, 2026 06:20
…tion (Dr.GRPO)

The default pg_loss reducer normalizes each sample by its own
active-token count, which Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies
as a source of length bias; Dr.GRPO and DeepSWE instead divide the
policy-gradient loss by a constant (e.g. the max context length).

Previously the only way to express this in slime was the
--custom-pg-loss-reducer-function-path hook, but the hook factory is
invoked with only four positional arguments (total_lengths,
response_lengths, loss_masks, calculate_per_token_loss) — no args, no
max_seq_lens. A reducer implementing the one-line Dr.GRPO rule must
therefore reach into Megatron's global args for the divisor and
qkv_format, re-implement the per-CP-rank loss-mask chunking, and still
fail on qkv_format=bshd under context parallelism because max_seq_lens
is not forwarded. The core reducer site already has all of that layout
context, so make the constant divisor a first-class mode there:

- cp_utils.get_sum_of_sample_mean grows a constant_divisor option that
  returns sum(x * loss_mask) / constant. It reuses the existing CP mask
  chunking, so it is correct for thd AND bshd, and the constant is
  identical on every CP rank, so the gradient sum-allreduce needs no
  extra denominator correction.
- policy_loss_function selects it for pg_loss when --pg-loss-divisor is
  set (the custom hook still takes precedence; other metrics keep the
  default reducer). Unset, behavior is byte-identical to before. Under
  --calculate-per-token-loss the divisor is intentionally not applied —
  Megatron already divides by the all-reduced token count.
- slime_validate_args rejects non-positive (or NaN) divisors at
  startup instead of training with a wrong denominator.
- The --custom-pg-loss-reducer-function-path hook is unchanged; docs
  section 11 (en+zh) now points the Dr.GRPO use case at the flag
  instead of the previously missing examples/DrGRPO/custom_reducer.py,
  and examples/DrGRPO/README.md is a thin pointer to the flag.
- Tests: constant-mode value, per-token-loss contract, and CP
  (thd/bshd) equivalence in tests/test_cp_utils.py; startup validation
  in tests/test_megatron_argument_validation.py. Both files are already
  in the cpu-unittest matrix, so no workflow changes are needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal EazyReal force-pushed the upstream-pr/drgrpo-reducer-example branch from 09d237d to 0baefb5 Compare June 12, 2026 08:37
@EazyReal EazyReal changed the title docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example (Todo: make it proper loss agg option) Jun 12, 2026
@EazyReal EazyReal changed the title docs(examples): add Dr.GRPO constant-divisor pg_loss reducer example (Todo: make it proper loss agg option) feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO) Jun 12, 2026
EazyReal added a commit to EazyReal/slime that referenced this pull request Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.

Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
  via `rollout_mask_sums`. Byte-identical to the prior default.
- prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by
  Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and
  variable-GBS-correct). Every prompt group enters the step sum once under the
  same `/ step_global_batch_size` divisor, so relative per-prompt weighting is
  uniform; absolute scale differs from a strict 1/P DAPO average by a constant
  factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
  whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
  `constant_divisor` branch in cp_utils.

Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Subsumes the open THUDM#2060 `--pg-loss-divisor`.

Tests in test_cp_utils.py (constant divides by L; prompt_mean per-group denom
distinct from sample/token mean; mutual-exclusion guard; CP rank-sum invariance
for prompt_mean and constant) and test_megatron_argument_validation.py (divisor
validation; token_mean alias). Mutation-verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
EazyReal added a commit to EazyReal/slime that referenced this pull request Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.

Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
  via `rollout_mask_sums`. Byte-identical to the prior default.
- prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by
  Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and
  variable-GBS-correct). Every prompt group enters the step sum once under the
  same `/ step_global_batch_size` divisor, so relative per-prompt weighting is
  uniform; absolute scale differs from a strict 1/P DAPO average by a constant
  factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
  whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
  `constant_divisor` branch in cp_utils.

Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`.

Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group
denominator distinct from sample/token mean, the constant/per-token mutual-
exclusion guard, and CP rank-sum invariance for the new constant branch
(prompt_mean reuses the per-rollout sample-mean CP path already pinned).
test_megatron_argument_validation.py pins divisor validation and the token_mean
alias. Mutation-verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EazyReal

Copy link
Copy Markdown
Contributor Author

Superseded by #2090, which adds --loss-aggregation {sample_mean, prompt_mean, token_mean, constant} — the Dr.GRPO constant divisor becomes the first-class --loss-aggregation constant --loss-aggregation-divisor L mode (same constant_divisor design as here), alongside the DAPO prompt-average, GRPO sample-average, and token-average. Closing in favor of the unified flag.

@EazyReal EazyReal closed this Jun 16, 2026
EazyReal added a commit to EazyReal/slime that referenced this pull request Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.

Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
  via `rollout_mask_sums`. Byte-identical to the prior default (no extra batch
  key in any non-prompt_mean mode).
- prompt_mean: DAPO prompt average — step-level `prompt_mask_sums` grouped by
  Sample.group_index, built ONLY under prompt_mean and plumbed like
  `rollout_mask_sums` (CP- and variable-GBS-correct). The other three modes
  never read group_index and never build the key. A None group_index under
  prompt_mean fails loud (the prompt-grouping invariant is broken; silently
  renumbering it into a singleton group would degrade prompt_mean -> sample_mean
  for that sample). Every prompt group enters the step sum once under the same
  `/ step_global_batch_size` divisor, so relative per-prompt weighting is
  uniform; absolute scale differs from a strict 1/P DAPO average by a constant
  factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
  whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
  `constant_divisor` branch in cp_utils.

Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`.

Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group
denominator distinct from sample/token mean, the constant/per-token mutual-
exclusion guard, and CP rank-sum invariance for the new constant branch
(prompt_mean reuses the per-rollout sample-mean CP path already pinned).
test_megatron_argument_validation.py pins divisor validation and the token_mean
alias. test_rollout_validation.py pins the prompt_mask_sums build: fail-loud on a
None group_index under prompt_mean, no key built for the other three modes
(default batch unchanged), and correct per-group totals. Mutation-verified.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant