feat: add --save-total-limit to auto-cleanup old checkpoints by jxiaof · Pull Request #590 · sgl-project/SpecForge

jxiaof · 2026-06-23T12:07:57Z

Training scripts save a checkpoint every --save-interval steps with no limit on how many are kept. For long runs (e.g. 500K steps with interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk.

This PR adds a --save-total-limit argument to all three training scripts (train_domino, train_dflash, train_eagle3). When set, the oldest checkpoints beyond the limit are automatically removed after each save. Default is None (no cleanup) to preserve backward compatibility.

Changes:

Add cleanup_checkpoints() utility in specforge/utils.py
Add --save-total-limit arg to all three training scripts
Call cleanup_checkpoints() after each periodic save (not after final)
Add comprehensive unit tests in tests/test_utils/test_checkpoint_cleanup.py

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

Training scripts save a checkpoint every --save-interval steps with no limit on how many are kept. For long runs (e.g. 500K steps with interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk. This PR adds a --save-total-limit argument to all three training scripts (train_domino, train_dflash, train_eagle3). When set, the oldest checkpoints beyond the limit are automatically removed after each save. Default is None (no cleanup) to preserve backward compatibility. Changes: - Add cleanup_checkpoints() utility in specforge/utils.py - Add --save-total-limit arg to all three training scripts - Call cleanup_checkpoints() after each periodic save (not after final) - Add comprehensive unit tests in tests/test_utils/test_checkpoint_cleanup.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request introduces a checkpoint cleanup utility to limit the total number of saved checkpoints during training, adding a --save-total-limit argument to the DFlash, Domino, and Eagle3 training scripts. The utility parses, sorts, and removes older checkpoint directories exceeding the specified limit. The review feedback correctly identifies a critical issue in distributed training settings: the cleanup_checkpoints function does not enforce that only rank 0 performs the deletion, which can lead to race conditions and crashes on shared filesystems. A code suggestion is provided to restrict the cleanup logic to rank 0.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-23T12:08:45Z

+    if save_total_limit is None or save_total_limit <= 0:
+        return 0
+
+    if not os.path.isdir(output_dir):
+        return 0


In distributed training settings with a shared filesystem (e.g., NFS, Lustre), having all ranks attempt to delete the same checkpoint directories simultaneously will cause race conditions, leading to FileNotFoundError or OSError on non-zero ranks and crashing the training run.

Although the docstring mentions that only rank 0 performs the deletion, the implementation does not actually enforce this check. Adding a check for dist.is_initialized() and dist.get_rank() != 0 ensures that only rank 0 executes the cleanup, while other ranks return early and wait at the subsequent dist.barrier() in the training scripts.

Suggested change

if save_total_limit is None or save_total_limit <= 0:

return 0

if not os.path.isdir(output_dir):

return 0

if save_total_limit is None or save_total_limit <= 0:

return 0

if dist.is_initialized() and dist.get_rank() != 0:

return 0

if not os.path.isdir(output_dir):

return 0

Copilot

Pull request overview

Adds an optional checkpoint retention limit to SpecForge training scripts to prevent unbounded disk growth by automatically deleting the oldest checkpoints after periodic saves.

Changes:

Introduces cleanup_checkpoints() in specforge/utils.py to delete oldest epoch_*_step_* checkpoint directories beyond a configured limit.
Adds --save-total-limit CLI flag to train_domino.py, train_dflash.py, and train_eagle3.py, and calls cleanup after periodic checkpoint saves.
Adds unit tests covering limit behavior and ignoring non-checkpoint files/dirs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`specforge/utils.py`	Adds `cleanup_checkpoints()` utility for pruning oldest checkpoint directories.
`scripts/train_domino.py`	Adds `--save-total-limit` argument and calls checkpoint cleanup after periodic saves.
`scripts/train_dflash.py`	Adds `--save-total-limit` argument and calls checkpoint cleanup after periodic saves.
`scripts/train_eagle3.py`	Adds `--save-total-limit` argument and calls checkpoint cleanup after periodic saves.
`tests/test_utils/test_checkpoint_cleanup.py`	Adds unit tests for checkpoint cleanup behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if save_total_limit is None or save_total_limit <= 0:
+        return 0
+
+    if not os.path.isdir(output_dir):
+        return 0
+


+    for ckpt in checkpoints[:num_to_remove]:
+        ckpt_path = os.path.join(output_dir, ckpt)
+        shutil.rmtree(ckpt_path)
+        print_on_rank0(f"Removed old checkpoint: {ckpt_path}")
+        removed += 1


+                cleanup_checkpoints(
+                    args.output_dir, args.save_total_limit
+                )
+                dist.barrier()


+                cleanup_checkpoints(
+                    args.output_dir, args.save_total_limit
+                )
+                dist.barrier()


+                cleanup_checkpoints(
+                    args.output_dir, args.save_total_limit
+                )
+                dist.barrier()


Copilot AI review requested due to automatic review settings June 23, 2026 12:07

jxiaof requested review from FlamingoPg, FrankLeeeee, shuaills and sleepcoo as code owners June 23, 2026 12:07

Copilot started reviewing on behalf of jxiaof June 23, 2026 12:08 View session

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --save-total-limit to auto-cleanup old checkpoints#590

feat: add --save-total-limit to auto-cleanup old checkpoints#590
jxiaof wants to merge 1 commit into
sgl-project:mainfrom
jxiaof:feat/save-total-limit

jxiaof commented Jun 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jxiaof commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jxiaof commented Jun 23, 2026 •

edited

Loading