[megatron] don't re-assert no_sync_func every step with overlap_grad_reduce#2066
Draft
HaozheZhang6 wants to merge 1 commit into
Draft
[megatron] don't re-assert no_sync_func every step with overlap_grad_reduce#2066HaozheZhang6 wants to merge 1 commit into
HaozheZhang6 wants to merge 1 commit into
Conversation
…reduce `train()` sets up `config.no_sync_func` on every step, but `config` is the model config and persists across steps. With `--overlap-grad-reduce` the first step sets it, then the second step trips `assert config.no_sync_func is None` and crashes. Guard the setup with `if config.no_sync_func is None:` so the sync funcs are set once (they are constant, so skipping later steps is a no-op). Fixes THUDM#1779
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
train()inslime/backends/megatron_utils/model.pysets upconfig.no_sync_funcon every step, butconfig = get_model_config(model[0])is the model config and persists across steps. With--overlap-grad-reduce, the first step passesassert config.no_sync_func is Noneand sets it; the second step then trips that same assert and crashes.Fix
Guard the setup with
if config.no_sync_func is None:so the sync funcs are set once. They're constant (the model chunks' ownno_sync/start_grad_sync), so skipping on later steps is a no-op — the only thing the per-step re-run did was trip the assert.Reproduction
Run any training with
--overlap-grad-reduce(from #1779) — it crashes on the 2nd step without this change.Notes
--overlap-grad-reduce, which I can't stand up locally. The fix is small and verified by inspection against the reported repro; happy to add a test if you can point me at a lightweight harness for this path.assertalso doubled as a guard against a pre-supplied customno_sync_func. Under theis Noneguard it becomes redundant, and a hypothetical pre-set custom func would now be left as-is rather than rejected. If you'd rather keep that rejection explicit (e.g. assert only on first setup), I'll adjust.Fixes #1779