feat: add --save-total-limit to auto-cleanup old checkpoints#590
Conversation
Training scripts save a checkpoint every --save-interval steps with no limit on how many are kept. For long runs (e.g. 500K steps with interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk. This PR adds a --save-total-limit argument to all three training scripts (train_domino, train_dflash, train_eagle3). When set, the oldest checkpoints beyond the limit are automatically removed after each save. Default is None (no cleanup) to preserve backward compatibility. Changes: - Add cleanup_checkpoints() utility in specforge/utils.py - Add --save-total-limit arg to all three training scripts - Call cleanup_checkpoints() after each periodic save (not after final) - Add comprehensive unit tests in tests/test_utils/test_checkpoint_cleanup.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a checkpoint cleanup utility to limit the total number of saved checkpoints during training, adding a --save-total-limit argument to the DFlash, Domino, and Eagle3 training scripts. The utility parses, sorts, and removes older checkpoint directories exceeding the specified limit. The review feedback correctly identifies a critical issue in distributed training settings: the cleanup_checkpoints function does not enforce that only rank 0 performs the deletion, which can lead to race conditions and crashes on shared filesystems. A code suggestion is provided to restrict the cleanup logic to rank 0.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if save_total_limit is None or save_total_limit <= 0: | ||
| return 0 | ||
|
|
||
| if not os.path.isdir(output_dir): | ||
| return 0 |
There was a problem hiding this comment.
In distributed training settings with a shared filesystem (e.g., NFS, Lustre), having all ranks attempt to delete the same checkpoint directories simultaneously will cause race conditions, leading to FileNotFoundError or OSError on non-zero ranks and crashing the training run.
Although the docstring mentions that only rank 0 performs the deletion, the implementation does not actually enforce this check. Adding a check for dist.is_initialized() and dist.get_rank() != 0 ensures that only rank 0 executes the cleanup, while other ranks return early and wait at the subsequent dist.barrier() in the training scripts.
| if save_total_limit is None or save_total_limit <= 0: | |
| return 0 | |
| if not os.path.isdir(output_dir): | |
| return 0 | |
| if save_total_limit is None or save_total_limit <= 0: | |
| return 0 | |
| if dist.is_initialized() and dist.get_rank() != 0: | |
| return 0 | |
| if not os.path.isdir(output_dir): | |
| return 0 |
There was a problem hiding this comment.
Pull request overview
Adds an optional checkpoint retention limit to SpecForge training scripts to prevent unbounded disk growth by automatically deleting the oldest checkpoints after periodic saves.
Changes:
- Introduces
cleanup_checkpoints()inspecforge/utils.pyto delete oldestepoch_*_step_*checkpoint directories beyond a configured limit. - Adds
--save-total-limitCLI flag totrain_domino.py,train_dflash.py, andtrain_eagle3.py, and calls cleanup after periodic checkpoint saves. - Adds unit tests covering limit behavior and ignoring non-checkpoint files/dirs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
specforge/utils.py |
Adds cleanup_checkpoints() utility for pruning oldest checkpoint directories. |
scripts/train_domino.py |
Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves. |
scripts/train_dflash.py |
Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves. |
scripts/train_eagle3.py |
Adds --save-total-limit argument and calls checkpoint cleanup after periodic saves. |
tests/test_utils/test_checkpoint_cleanup.py |
Adds unit tests for checkpoint cleanup behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if save_total_limit is None or save_total_limit <= 0: | ||
| return 0 | ||
|
|
||
| if not os.path.isdir(output_dir): | ||
| return 0 | ||
|
|
| for ckpt in checkpoints[:num_to_remove]: | ||
| ckpt_path = os.path.join(output_dir, ckpt) | ||
| shutil.rmtree(ckpt_path) | ||
| print_on_rank0(f"Removed old checkpoint: {ckpt_path}") | ||
| removed += 1 |
| cleanup_checkpoints( | ||
| args.output_dir, args.save_total_limit | ||
| ) | ||
| dist.barrier() |
| cleanup_checkpoints( | ||
| args.output_dir, args.save_total_limit | ||
| ) | ||
| dist.barrier() |
| cleanup_checkpoints( | ||
| args.output_dir, args.save_total_limit | ||
| ) | ||
| dist.barrier() |
Training scripts save a checkpoint every --save-interval steps with no limit on how many are kept. For long runs (e.g. 500K steps with interval=2000), this produces 250+ checkpoints consuming 1.1TB+ of disk.
This PR adds a --save-total-limit argument to all three training scripts (train_domino, train_dflash, train_eagle3). When set, the oldest checkpoints beyond the limit are automatically removed after each save. Default is None (no cleanup) to preserve backward compatibility.
Changes:
Motivation
Modifications
Related Issues
Accuracy Test
Benchmark & Profiling
Checklist