Fix per-rank training state resume by gq112 · Pull Request #587 · sgl-project/SpecForge

gq112 · 2026-06-20T09:52:21Z

Summary

Fix resume behavior by saving and loading optimizer/scheduler training state per rank.

Changes

Save training state as training_state_rank_{rank}.pt
Load the matching per-rank state on resume
Stop using the shared training_state.pt
Restore full optimizer state for Domino resume
Include current lr in optimizer state

Validation

python -m py_compile specforge/optimizer.py specforge/utils.py scripts/train_eagle3.py scripts/train_dflash.py scripts/train_domino.py
git diff --check

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request updates the training scripts (train_dflash.py, train_domino.py, and train_eagle3.py) to save and load rank-specific training states using a new helper function get_training_state_path in specforge/utils.py. It also updates the optimizer state dictionary to include the learning rate and adjusts how the optimizer state is loaded in train_domino.py. The review feedback highlights that removing the os.path.exists check when loading the training state across all three training scripts makes the resume process fragile, as it will crash with a FileNotFoundError if the training state file is missing. The reviewer suggests restoring the existence check to ensure a graceful fallback.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T09:53:51Z

+        training_state_path = get_training_state_path(draft_model_last_checkpoint)
+        resume_state = torch.load(
+            training_state_path, map_location="cpu", weights_only=False
+        )
+        print(
+            f"Will resume from epoch {resume_state['epoch']}, "
+            f"step {resume_state['global_step']}"
        )


Removing the os.path.exists check for the training state file makes the resume process fragile. If a user attempts to resume training from a checkpoint that only contains model weights (or if the training state files were deleted/not saved), the script will crash with a FileNotFoundError instead of gracefully falling back to starting with a fresh optimizer state. Restoring the existence check ensures robust fallback behavior.

Suggested change

training_state_path = get_training_state_path(draft_model_last_checkpoint)

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print(

f"Will resume from epoch {resume_state['epoch']}, "

f"step {resume_state['global_step']}"

)

training_state_path = get_training_state_path(draft_model_last_checkpoint)

if os.path.exists(training_state_path):

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print(

f"Will resume from epoch {resume_state['epoch']}, "

f"step {resume_state['global_step']}"

)

gemini-code-assist · 2026-06-20T09:53:51Z

+        training_state_path = get_training_state_path(draft_model_last_checkpoint)
+        resume_state = torch.load(
+            training_state_path, map_location="cpu", weights_only=False
+        )
+        print(
+            f"Will resume from epoch {resume_state['epoch']}, "
+            f"step {resume_state['global_step']}"
        )


Removing the os.path.exists check for the training state file makes the resume process fragile. If a user attempts to resume training from a checkpoint that only contains model weights (or if the training state files were deleted/not saved), the script will crash with a FileNotFoundError instead of gracefully falling back to starting with a fresh optimizer state. Restoring the existence check ensures robust fallback behavior.

Suggested change

training_state_path = get_training_state_path(draft_model_last_checkpoint)

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print(

f"Will resume from epoch {resume_state['epoch']}, "

f"step {resume_state['global_step']}"

)

training_state_path = get_training_state_path(draft_model_last_checkpoint)

if os.path.exists(training_state_path):

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print(

f"Will resume from epoch {resume_state['epoch']}, "

f"step {resume_state['global_step']}"

)

gemini-code-assist · 2026-06-20T09:53:51Z

+        training_state_path = get_training_state_path(draft_model_last_checkpoint)
+        resume_state = torch.load(
+            training_state_path, map_location="cpu", weights_only=False
+        )
+        print_on_rank0(
+            f"Loaded training state from {training_state_path}: "
+            f"epoch={resume_state['epoch']}, step={resume_state['global_step']}"
        )


Removing the os.path.exists check for the training state file makes the resume process fragile. If a user attempts to resume training from a checkpoint that only contains model weights (or if the training state files were deleted/not saved), the script will crash with a FileNotFoundError instead of gracefully falling back to starting with a fresh optimizer state. Restoring the existence check ensures robust fallback behavior.

Suggested change

training_state_path = get_training_state_path(draft_model_last_checkpoint)

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print_on_rank0(

f"Loaded training state from {training_state_path}: "

f"epoch={resume_state['epoch']}, step={resume_state['global_step']}"

)

training_state_path = get_training_state_path(draft_model_last_checkpoint)

if os.path.exists(training_state_path):

resume_state = torch.load(

training_state_path, map_location="cpu", weights_only=False

)

print_on_rank0(

f"Loaded training state from {training_state_path}: "

f"epoch={resume_state['epoch']}, step={resume_state['global_step']}"

)

Fix per-rank training state resume

c122b12

gq112 requested review from FlamingoPg, shuaills and sleepcoo as code owners June 20, 2026 09:52

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix per-rank training state resume#587

Fix per-rank training state resume#587
gq112 wants to merge 1 commit into
sgl-project:mainfrom
gq112:fix/resume-pr

gq112 commented Jun 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gq112 commented Jun 20, 2026

Summary

Changes

Validation

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants