Skip to content

refactor(checkpoint-object-manager): deprecate C++ object_manager layer for lightweight Python subprocess async deletion#102

Merged
Leahlijuan merged 5 commits intomainfrom
feat/cleanup-final
Apr 23, 2026
Merged

refactor(checkpoint-object-manager): deprecate C++ object_manager layer for lightweight Python subprocess async deletion#102
Leahlijuan merged 5 commits intomainfrom
feat/cleanup-final

Conversation

@Leahlijuan
Copy link
Copy Markdown
Collaborator

@Leahlijuan Leahlijuan commented Apr 14, 2026

Refactor: deprecate C++ object_manager layer for lightweight Python subprocess async deletion

This patch modernizes the ML-Flashpoint directory deletion lifecycle by fully replacing the native C++ object_manager module with streamlined subprocess.Popen calls. This drastically simplifies the codebase overhead during checkpoint cleanup routines.

@Leahlijuan Leahlijuan changed the title fix(core): Create new python subprocess to delete older checkpoints Refactor: deprecate C++ object_manager layer for lightweight Python subprocess async deletion Apr 14, 2026
@Leahlijuan Leahlijuan requested review from g-husam and kkkapu April 14, 2026 18:56
@g-husam g-husam changed the title Refactor: deprecate C++ object_manager layer for lightweight Python subprocess async deletion refactor(checkpoint-object-manager): deprecate C++ object_manager layer for lightweight Python subprocess async deletion Apr 14, 2026
Comment thread tests/core/test_checkpoint_saver.py Outdated
Comment thread src/ml_flashpoint/core/checkpoint_saver.py
Comment thread src/ml_flashpoint/core/checkpoint_saver.py Outdated
@Leahlijuan Leahlijuan requested a review from g-husam April 21, 2026 14:44
@Leahlijuan Leahlijuan self-assigned this Apr 21, 2026
Comment thread src/ml_flashpoint/core/checkpoint_saver.py Outdated
Comment thread tests/core/test_checkpoint_saver.py Outdated
return p
except Exception as e:
_LOGGER.exception("Background deletion of old checkpoints failed: %s", e)
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm ideally we'd at least return something that informs the caller there was an error. Otherwise they can't distinguish between "nothing to do" and "failed to delete".

Maybe bubble up the exception? Or can return a wrapper type encapsulating an Optional proc and an optional error e.g.

@dataclass
class DeletionResult:
    proc: Optional[subprocess.Popen] = None
    error: Optional[Exception] = None
    skipped: bool = False

Not a blocker, but will make the API more explicit.

Then in the failure test, we can confirm there was an error by asserting on the result, as opposed to just expecting proc to be None, which is the same expectation when there is no error and nothing to do.

@github-actions
Copy link
Copy Markdown

Python Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint 100% 100%
src.ml_flashpoint.adapter 100% 100%
src.ml_flashpoint.adapter.megatron 97% 95%
src.ml_flashpoint.adapter.nemo 98% 93%
src.ml_flashpoint.adapter.pytorch 99% 92%
src.ml_flashpoint.checkpoint_object_manager 93% 93%
src.ml_flashpoint.core 95% 92%
src.ml_flashpoint.replication 81% 81%
Summary 95% (2390 / 2521) 92% (571 / 624)

Minimum allowed line rate is 90%

@github-actions
Copy link
Copy Markdown

C++ Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object 93% 54%
src.ml_flashpoint.replication.transfer_service 79% 41%
Summary 82% (899 / 1097) 43% (668 / 1539)

Minimum allowed line rate is 80%

@Leahlijuan Leahlijuan merged commit 67fbe4a into main Apr 23, 2026
7 checks passed
@Leahlijuan Leahlijuan deleted the feat/cleanup-final branch April 23, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants