feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060
Closed
EazyReal wants to merge 1 commit into
Closed
feat(loss): add --pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060EazyReal wants to merge 1 commit into
--pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)#2060EazyReal wants to merge 1 commit into
Conversation
07c9c15 to
09d237d
Compare
…tion (Dr.GRPO) The default pg_loss reducer normalizes each sample by its own active-token count, which Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies as a source of length bias; Dr.GRPO and DeepSWE instead divide the policy-gradient loss by a constant (e.g. the max context length). Previously the only way to express this in slime was the --custom-pg-loss-reducer-function-path hook, but the hook factory is invoked with only four positional arguments (total_lengths, response_lengths, loss_masks, calculate_per_token_loss) — no args, no max_seq_lens. A reducer implementing the one-line Dr.GRPO rule must therefore reach into Megatron's global args for the divisor and qkv_format, re-implement the per-CP-rank loss-mask chunking, and still fail on qkv_format=bshd under context parallelism because max_seq_lens is not forwarded. The core reducer site already has all of that layout context, so make the constant divisor a first-class mode there: - cp_utils.get_sum_of_sample_mean grows a constant_divisor option that returns sum(x * loss_mask) / constant. It reuses the existing CP mask chunking, so it is correct for thd AND bshd, and the constant is identical on every CP rank, so the gradient sum-allreduce needs no extra denominator correction. - policy_loss_function selects it for pg_loss when --pg-loss-divisor is set (the custom hook still takes precedence; other metrics keep the default reducer). Unset, behavior is byte-identical to before. Under --calculate-per-token-loss the divisor is intentionally not applied — Megatron already divides by the all-reduced token count. - slime_validate_args rejects non-positive (or NaN) divisors at startup instead of training with a wrong denominator. - The --custom-pg-loss-reducer-function-path hook is unchanged; docs section 11 (en+zh) now points the Dr.GRPO use case at the flag instead of the previously missing examples/DrGRPO/custom_reducer.py, and examples/DrGRPO/README.md is a thin pointer to the flag. - Tests: constant-mode value, per-token-loss contract, and CP (thd/bshd) equivalence in tests/test_cp_utils.py; startup validation in tests/test_megatron_argument_validation.py. Both files are already in the cpu-unittest matrix, so no workflow changes are needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
09d237d to
0baefb5
Compare
--pg-loss-divisor: first-class constant-divisor pg_loss normalization (Dr.GRPO)
EazyReal
added a commit
to EazyReal/slime
that referenced
this pull request
Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.
Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
via `rollout_mask_sums`. Byte-identical to the prior default.
- prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by
Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and
variable-GBS-correct). Every prompt group enters the step sum once under the
same `/ step_global_batch_size` divisor, so relative per-prompt weighting is
uniform; absolute scale differs from a strict 1/P DAPO average by a constant
factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
`constant_divisor` branch in cp_utils.
Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Subsumes the open THUDM#2060 `--pg-loss-divisor`.
Tests in test_cp_utils.py (constant divides by L; prompt_mean per-group denom
distinct from sample/token mean; mutual-exclusion guard; CP rank-sum invariance
for prompt_mean and constant) and test_megatron_argument_validation.py (divisor
validation; token_mean alias). Mutation-verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
EazyReal
added a commit
to EazyReal/slime
that referenced
this pull request
Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.
Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
via `rollout_mask_sums`. Byte-identical to the prior default.
- prompt_mean: DAPO prompt average — new step-level `prompt_mask_sums` grouped by
Sample.group_index, plumbed exactly like `rollout_mask_sums` (CP- and
variable-GBS-correct). Every prompt group enters the step sum once under the
same `/ step_global_batch_size` divisor, so relative per-prompt weighting is
uniform; absolute scale differs from a strict 1/P DAPO average by a constant
factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
`constant_divisor` branch in cp_utils.
Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`.
Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group
denominator distinct from sample/token mean, the constant/per-token mutual-
exclusion guard, and CP rank-sum invariance for the new constant branch
(prompt_mean reuses the per-rollout sample-mean CP path already pinned).
test_megatron_argument_validation.py pins divisor validation and the token_mean
alias. Mutation-verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
Superseded by #2090, which adds |
EazyReal
added a commit
to EazyReal/slime
that referenced
this pull request
Jun 16, 2026
Add a unified `--loss-aggregation {sample_mean,prompt_mean,token_mean,constant}`
(+ `--loss-aggregation-divisor L` for constant) selecting how pg_loss is
aggregated across a training step, riding the existing `sample_denoms` seam in
`get_sum_of_sample_mean`.
Modes follow the ScaleRL taxonomy (arXiv:2510.13786 §3.2):
- sample_mean (default): GRPO sample average — per-rollout token-weighted mean
via `rollout_mask_sums`. Byte-identical to the prior default (no extra batch
key in any non-prompt_mean mode).
- prompt_mean: DAPO prompt average — step-level `prompt_mask_sums` grouped by
Sample.group_index, built ONLY under prompt_mean and plumbed like
`rollout_mask_sums` (CP- and variable-GBS-correct). The other three modes
never read group_index and never build the key. A None group_index under
prompt_mean fails loud (the prompt-grouping invariant is broken; silently
renumbering it into a singleton group would degrade prompt_mean -> sample_mean
for that sample). Every prompt group enters the step sum once under the same
`/ step_global_batch_size` divisor, so relative per-prompt weighting is
uniform; absolute scale differs from a strict 1/P DAPO average by a constant
factor (P/N), which the learning rate absorbs.
- token_mean: token average — aliased onto `--calculate-per-token-loss` so the
whole loss-scaling/reporting path stays per-token.
- constant: Dr.GRPO (arXiv:2503.20783) — masked token sum / L via a new
`constant_divisor` branch in cp_utils.
Aggregation applies to pg_loss only (metrics keep sum_of_sample_mean);
`--custom-pg-loss-reducer-function-path` still takes precedence. `L` is validated
> 0 at startup only for constant. Supersedes the open THUDM#2060 `--pg-loss-divisor`.
Tests: test_cp_utils.py pins constant divides by L, prompt_mean's per-group
denominator distinct from sample/token mean, the constant/per-token mutual-
exclusion guard, and CP rank-sum invariance for the new constant branch
(prompt_mean reuses the per-rollout sample-mean CP path already pinned).
test_megatron_argument_validation.py pins divisor validation and the token_mean
alias. test_rollout_validation.py pins the prompt_mask_sums build: fail-loud on a
None group_index under prompt_mean, no key built for the other three modes
(default batch unchanged), and correct per-group totals. Mutation-verified.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The default pg_loss reducer normalizes each sample by its own active-token count (a sum of per-sample token means). Dr.GRPO (arXiv:2503.20783, Eq. 2) identifies this data-dependent denominator as a source of length bias and replaces it with a constant divisor (e.g. the max context length); DeepSWE trains with the same rule.
slime had no way to express this directly.
docs/en/get_started/customization.md§11 pointed atexamples/DrGRPO/custom_reducer.py:get_pg_loss_reduceras the way to get it, but that file never existed (the doc link 404s). An earlier revision of this PR filled the gap with a ~265-line example reducer — and writing it exposed the real problem: the--custom-pg-loss-reducer-function-pathfactory is invoked with only four positional arguments (total_lengths, response_lengths, loss_masks, calculate_per_token_loss) — noargs, nomax_seq_lens. So a reducer implementing the one-line Dr.GRPO rule had to:qkv_format,get_logits_and_tokens_offset_with_cp) already does, andqkv_format=bshdunder context parallelism, because the call site does not forwardmax_seq_lens.The example also had to own a core CLI argument (
--drgrpo-divisorinslime/utils/arguments.py) — an example owning a global arg.Before / After
Before (this PR's earlier shape — or any out-of-tree reducer today):
~265 lines of example code reading Megatron globals, duplicating CP mask slicing, raising on
bshd.After:
--pg-loss-divisor 40960 # a constant, e.g. the max context lengthpg_loss is aggregated as
sum(token_loss * loss_mask) / 40960instead of the default sum of per-sample active-token means. Concretely, for two responses with masked sumss1(10 active tokens) ands2(1000 active tokens): default givess1/10 + s2/1000(the short response dominates per token); constant mode gives(s1 + s2) / 40960(every token weighs the same regardless of response length — Dr.GRPO's point).Flag unset → byte-identical to today. Other metrics (pg_clipfrac, ppo_kl, entropy_loss, kl_loss) keep the default reducer, matching the scope of the custom hook.
Why first-class instead of an example
The core reducer site already has all the layout context the hook lacks, so the whole feature is ~15 lines there:
cp_utils.get_sum_of_sample_meangrows aconstant_divisoroption that divides the masked token-sum closure it already builds — reusing the existing CP mask chunking, so it is correct for thd and bshd under context parallelism.argsat the selection site inloss.py(policy_loss_function), which also forwardsqkv_formatandmax_seq_lens.--calculate-per-token-loss: the divisor is intentionally not applied — Megatron already divides by the all-reduced token count (same contract as the default reducer).slime_validate_argsrejects non-positive (or NaN) divisors.The
--custom-pg-loss-reducer-function-pathhook is untouched for normalizations that need more than a constant. Docs §11 (en + zh) now shows the flag for the Dr.GRPO use case — fixing the dangling example reference that originally motivated this PR — andexamples/DrGRPO/README.mdis a thin pointer to the flag.Tests
tests/test_cp_utils.py: constant mode = masked token sum / constant (contrasted with the default per-sample means on the same inputs); divisor not applied undercalculate_per_token_loss; summing per-CP-rank reducer outputs reproduces the cp=1 value for boththdandbshd.tests/test_megatron_argument_validation.py: startup validation rejects 0 / negative / NaN divisors, accepts positive ones.Both files are already in the
cpu-unittestmatrix, so no CI registration changes. New tests were mutation-checked: ignoring the divisor, applying it in per-token mode, or removing the validation each makes the corresponding test fail.🤖 Generated with Claude Code