Skip to content

feat: add --save-total-limit to auto-cleanup old checkpoints#590

Open
jxiaof wants to merge 1 commit into
sgl-project:mainfrom
jxiaof:feat/save-total-limit
Open

feat: add --save-total-limit to auto-cleanup old checkpoints#590
jxiaof wants to merge 1 commit into
sgl-project:mainfrom
jxiaof:feat/save-total-limit

Conversation

@jxiaof

@jxiaof jxiaof commented Jun 23, 2026

Copy link
Copy Markdown

Training scripts save a checkpoint every --save-interval steps with no limit on how many are kept. For long runs (e.g. 500K steps with interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk.

This PR adds a --save-total-limit argument to all three training scripts (train_domino, train_dflash, train_eagle3). When set, the oldest checkpoints beyond the limit are automatically removed after each save. Default is None (no cleanup) to preserve backward compatibility.

Changes:

  • Add cleanup_checkpoints() utility in specforge/utils.py
  • Add --save-total-limit arg to all three training scripts
  • Call cleanup_checkpoints() after each periodic save (not after final)
  • Add comprehensive unit tests in tests/test_utils/test_checkpoint_cleanup.py

Motivation

Modifications

Related Issues

Accuracy Test

Benchmark & Profiling

Checklist

Training scripts save a checkpoint every --save-interval steps with no
limit on how many are kept. For long runs (e.g. 500K steps with
interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk.

This PR adds a --save-total-limit argument to all three training scripts
(train_domino, train_dflash, train_eagle3). When set, the oldest
checkpoints beyond the limit are automatically removed after each save.
Default is None (no cleanup) to preserve backward compatibility.

Changes:
- Add cleanup_checkpoints() utility in specforge/utils.py
- Add --save-total-limit arg to all three training scripts
- Call cleanup_checkpoints() after each periodic save (not after final)
- Add comprehensive unit tests in tests/test_utils/test_checkpoint_cleanup.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 23, 2026 12:07

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a checkpoint cleanup utility to limit the total number of saved checkpoints during training, adding a --save-total-limit argument to the DFlash, Domino, and Eagle3 training scripts. The utility parses, sorts, and removes older checkpoint directories exceeding the specified limit. The review feedback correctly identifies a critical issue in distributed training settings: the cleanup_checkpoints function does not enforce that only rank 0 performs the deletion, which can lead to race conditions and crashes on shared filesystems. A code suggestion is provided to restrict the cleanup logic to rank 0.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread specforge/utils.py
Comment on lines +168 to +172
if save_total_limit is None or save_total_limit <= 0:
return 0

if not os.path.isdir(output_dir):
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In distributed training settings with a shared filesystem (e.g., NFS, Lustre), having all ranks attempt to delete the same checkpoint directories simultaneously will cause race conditions, leading to FileNotFoundError or OSError on non-zero ranks and crashing the training run.

Although the docstring mentions that only rank 0 performs the deletion, the implementation does not actually enforce this check. Adding a check for dist.is_initialized() and dist.get_rank() != 0 ensures that only rank 0 executes the cleanup, while other ranks return early and wait at the subsequent dist.barrier() in the training scripts.

Suggested change
if save_total_limit is None or save_total_limit <= 0:
return 0
if not os.path.isdir(output_dir):
return 0
if save_total_limit is None or save_total_limit <= 0:
return 0
if dist.is_initialized() and dist.get_rank() != 0:
return 0
if not os.path.isdir(output_dir):
return 0

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional checkpoint retention limit to SpecForge training scripts to prevent unbounded disk growth by automatically deleting the oldest checkpoints after periodic saves.

Changes:

  • Introduces cleanup_checkpoints() in specforge/utils.py to delete oldest epoch_*_step_* checkpoint directories beyond a configured limit.
  • Adds --save-total-limit CLI flag to train_domino.py, train_dflash.py, and train_eagle3.py, and calls cleanup after periodic checkpoint saves.
  • Adds unit tests covering limit behavior and ignoring non-checkpoint files/dirs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
specforge/utils.py Adds cleanup_checkpoints() utility for pruning oldest checkpoint directories.
scripts/train_domino.py Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves.
scripts/train_dflash.py Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves.
scripts/train_eagle3.py Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves.
tests/test_utils/test_checkpoint_cleanup.py Adds unit tests for checkpoint cleanup behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread specforge/utils.py
Comment on lines +168 to +173
if save_total_limit is None or save_total_limit <= 0:
return 0

if not os.path.isdir(output_dir):
return 0

Comment thread specforge/utils.py
Comment on lines +196 to +200
for ckpt in checkpoints[:num_to_remove]:
ckpt_path = os.path.join(output_dir, ckpt)
shutil.rmtree(ckpt_path)
print_on_rank0(f"Removed old checkpoint: {ckpt_path}")
removed += 1
Comment thread scripts/train_domino.py
Comment on lines +665 to +668
cleanup_checkpoints(
args.output_dir, args.save_total_limit
)
dist.barrier()
Comment thread scripts/train_dflash.py
Comment on lines +615 to +618
cleanup_checkpoints(
args.output_dir, args.save_total_limit
)
dist.barrier()
Comment thread scripts/train_eagle3.py
Comment on lines +1356 to +1359
cleanup_checkpoints(
args.output_dir, args.save_total_limit
)
dist.barrier()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants