Skip to content

feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util#78

Draft
g-husam wants to merge 30 commits intomainfrom
feature/nemo-rl-adapter
Draft

feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util#78
g-husam wants to merge 30 commits intomainfrom
feature/nemo-rl-adapter

Conversation

@g-husam
Copy link
Copy Markdown
Collaborator

@g-husam g-husam commented Mar 13, 2026

This change adds integration support for checkpointing in NeMo/RL.

It introduces an MLFlashpointRLCheckpointManager, which subtypes CheckpointManager in a bespoke way - it does not re-initialize the parent CheckpointManager as that is already instantiated by the time users create an MLFlashpointRLCheckpointManager. Instead, it receives an instance of its parent in its __init__, and uses composition to re-expose behavior of that "parent" instance, overriding certain behaviors.

Namely, it intercepts the given policy's save_checkpoint and load_checkpoint methods to a custom implementation that uses waterfall logic to try ML Flashpoint first, then the regular checkpointing logic if needed.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates ML Flashpoint with the NeMo RL framework, allowing users to leverage fast, in-memory checkpointing alongside traditional persistent saves. The new adapter transparently manages checkpointing by intercepting NeMo RL's native save mechanisms, providing a flexible and efficient way to handle model state during reinforcement learning training. This enhancement aims to improve training resilience and recovery without requiring significant modifications to existing NeMo RL workflows.

Highlights

  • NeMo RL Adapter: Introduced a new adapter for the NeMo RL framework, enabling seamless integration with ML Flashpoint for enhanced checkpointing capabilities.
  • Dual Checkpoint Management: Implemented a specialized MLFlashpointRLCheckpointManager that coordinates between frequent ML Flashpoint saves (to tmpfs) and less frequent standard saves (to long-term storage).
  • Simplified Integration: Provided a wrap_rl_components_with_mlflashpoint utility function to simplify the injection of ML Flashpoint's dual-checkpointing logic into existing NeMo RL training scripts.
  • Documentation and Testing: Added comprehensive documentation in the user guide for NeMo RL integration and included unit tests to ensure the correctness and robustness of the new adapter.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/user-guide.md
    • Added a new section 'NeMo RL' detailing how to integrate ML Flashpoint, including import statements, recipe changes, and important limitations/requisites for configuration.
  • src/ml_flashpoint/adapter/nemo_rl/init.py
    • Created the __init__.py file to establish the nemo_rl adapter package and expose MLFlashpointRLCheckpointManager.
  • src/ml_flashpoint/adapter/nemo_rl/checkpoint_manager.py
    • Implemented the MLFlashpointRLCheckpointManager class, which extends NeMo RL's CheckpointManager to manage both standard and ML Flashpoint saves.
    • Overrode init_tmp_checkpoint to direct saves based on standard_save_period.
    • Intercepted the policy's save_checkpoint method to route ML Flashpoint saves through save_local_aware_megatron_checkpoint.
    • Added logic to get_latest_checkpoint_path to compare and return the freshest checkpoint between standard and ML Flashpoint saves.
  • src/ml_flashpoint/adapter/nemo_rl/wrapper_util.py
    • Added wrap_rl_components_with_mlflashpoint function to provide a high-level utility for wrapping NeMo RL's CheckpointManager and policy with the ML Flashpoint dual manager.
  • tests/adapter/nemo_rl/test_checkpoint_manager.py
    • Added MockPolicy class for testing purposes.
    • Included fixtures for mocking base_checkpointer, save_strategy, and checkpoint_loader.
    • Implemented tests for wrap_rl_components_with_mlflashpoint utility.
    • Added tests for attribute delegation to the base checkpointer.
    • Created tests to verify correct behavior for standard and ML Flashpoint save periods, including policy interception and checkpoint finalization.
    • Added tests for model eval/train mode toggling during MLF saves.
    • Included tests for get_best_checkpoint_path, get_latest_checkpoint_path, load_training_info, and remove_old_checkpoints delegation.
    • Added tests for error handling and optional policy attributes during save_checkpoint.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces ML Flashpoint integration for the NeMo RL framework, including a new MLFlashpointRLCheckpointManager for dual checkpointing (frequent MLF saves to tmpfs and infrequent standard saves) and a wrapper_util function to facilitate this integration. The documentation has been updated with usage instructions. Review comments highlight three areas for improvement: adding upfront validation for the save_strategy parameter in wrap_rl_components_with_mlflashpoint to prevent runtime errors, correcting a missing checkpoint_loader argument in the user guide's example code, and fixing an incorrect expected path in a unit test for MLF checkpoint saving.

Comment thread src/ml_flashpoint/adapter/nemo_rl/wrapper_util_rl.py
Comment thread docs/user-guide.md
Comment thread tests/adapter/nemo_rl/test_checkpoint_manager.py Outdated
@g-husam g-husam force-pushed the feature/nemo-rl-adapter branch 2 times, most recently from d16556e to cd345ff Compare March 25, 2026 03:07
@g-husam g-husam force-pushed the feature/nemo-rl-adapter branch from 840c26a to 76dabfd Compare March 26, 2026 23:40
g-husam and others added 21 commits March 27, 2026 15:20
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@g-husam g-husam force-pushed the feature/nemo-rl-adapter branch from 402678f to 22c35dc Compare March 27, 2026 15:21
g-husam added 3 commits March 28, 2026 01:26
- Switch build-nemo-rl to use nvcr.io/nvidia/nemo-rl:v0.5.0 as the base environment.
- Use a login shell (bash -l) to ensure container profiles and virtual environments are correctly loaded.
- Unify standard and nemo-rl builds into a single parameterized 'build' job to reduce duplication.
- Empty the 'nemo-rl' optional dependency in pyproject.toml, as dependencies are pre-installed in the container.
- Added explanatory comments for non-obvious environment configurations.
@github-actions
Copy link
Copy Markdown

Python Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint 100% 100%
src.ml_flashpoint.adapter 100% 100%
src.ml_flashpoint.adapter.megatron 97% 95%
src.ml_flashpoint.adapter.nemo 98% 94%
src.ml_flashpoint.adapter.pytorch 99% 92%
src.ml_flashpoint.checkpoint_object_manager 93% 93%
src.ml_flashpoint.core 95% 92%
src.ml_flashpoint.replication 81% 81%
Summary 95% (2335 / 2464) 92% (549 / 600)

Minimum allowed line rate is 90%

@github-actions
Copy link
Copy Markdown

C++ Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object 93% 54%
src.ml_flashpoint.checkpoint_object_manager.object_manager 70% 37%
src.ml_flashpoint.replication.transfer_service 79% 40%
Summary 81% (916 / 1126) 43% (687 / 1604)

Minimum allowed line rate is 80%

@github-actions
Copy link
Copy Markdown

Python Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint 100% 100%
src.ml_flashpoint.adapter 100% 100%
src.ml_flashpoint.adapter.megatron 97% 95%
src.ml_flashpoint.adapter.nemo 98% 94%
src.ml_flashpoint.adapter.pytorch 99% 92%
src.ml_flashpoint.checkpoint_object_manager 93% 93%
src.ml_flashpoint.core 95% 92%
src.ml_flashpoint.replication 81% 81%
Summary 95% (2335 / 2464) 92% (549 / 600)

Minimum allowed line rate is 90%

@github-actions
Copy link
Copy Markdown

C++ Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object 93% 54%
src.ml_flashpoint.checkpoint_object_manager.object_manager 70% 37%
src.ml_flashpoint.replication.transfer_service 79% 40%
Summary 81% (916 / 1126) 43% (687 / 1604)

Minimum allowed line rate is 80%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant