feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util by g-husam · Pull Request #78 · google/ml-flashpoint

g-husam · 2026-03-13T13:14:28Z

This change adds integration support for checkpointing in NeMo/RL.

It introduces an MLFlashpointRLCheckpointManager, which subtypes CheckpointManager in a bespoke way - it does not re-initialize the parent CheckpointManager as that is already instantiated by the time users create an MLFlashpointRLCheckpointManager. Instead, it receives an instance of its parent in its __init__, and uses composition to re-expose behavior of that "parent" instance, overriding certain behaviors.

Namely, it intercepts the given policy's save_checkpoint and load_checkpoint methods to a custom implementation that uses waterfall logic to try ML Flashpoint first, then the regular checkpointing logic if needed.

gemini-code-assist · 2026-03-13T13:14:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates ML Flashpoint with the NeMo RL framework, allowing users to leverage fast, in-memory checkpointing alongside traditional persistent saves. The new adapter transparently manages checkpointing by intercepting NeMo RL's native save mechanisms, providing a flexible and efficient way to handle model state during reinforcement learning training. This enhancement aims to improve training resilience and recovery without requiring significant modifications to existing NeMo RL workflows.

Highlights

NeMo RL Adapter: Introduced a new adapter for the NeMo RL framework, enabling seamless integration with ML Flashpoint for enhanced checkpointing capabilities.
Dual Checkpoint Management: Implemented a specialized MLFlashpointRLCheckpointManager that coordinates between frequent ML Flashpoint saves (to tmpfs) and less frequent standard saves (to long-term storage).
Simplified Integration: Provided a wrap_rl_components_with_mlflashpoint utility function to simplify the injection of ML Flashpoint's dual-checkpointing logic into existing NeMo RL training scripts.
Documentation and Testing: Added comprehensive documentation in the user guide for NeMo RL integration and included unit tests to ensure the correctness and robustness of the new adapter.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/user-guide.md
- Added a new section 'NeMo RL' detailing how to integrate ML Flashpoint, including import statements, recipe changes, and important limitations/requisites for configuration.
src/ml_flashpoint/adapter/nemo_rl/init.py
- Created the __init__.py file to establish the nemo_rl adapter package and expose MLFlashpointRLCheckpointManager.
src/ml_flashpoint/adapter/nemo_rl/checkpoint_manager.py
- Implemented the MLFlashpointRLCheckpointManager class, which extends NeMo RL's CheckpointManager to manage both standard and ML Flashpoint saves.
- Overrode init_tmp_checkpoint to direct saves based on standard_save_period.
- Intercepted the policy's save_checkpoint method to route ML Flashpoint saves through save_local_aware_megatron_checkpoint.
- Added logic to get_latest_checkpoint_path to compare and return the freshest checkpoint between standard and ML Flashpoint saves.
src/ml_flashpoint/adapter/nemo_rl/wrapper_util.py
- Added wrap_rl_components_with_mlflashpoint function to provide a high-level utility for wrapping NeMo RL's CheckpointManager and policy with the ML Flashpoint dual manager.
tests/adapter/nemo_rl/test_checkpoint_manager.py
- Added MockPolicy class for testing purposes.
- Included fixtures for mocking base_checkpointer, save_strategy, and checkpoint_loader.
- Implemented tests for wrap_rl_components_with_mlflashpoint utility.
- Added tests for attribute delegation to the base checkpointer.
- Created tests to verify correct behavior for standard and ML Flashpoint save periods, including policy interception and checkpoint finalization.
- Added tests for model eval/train mode toggling during MLF saves.
- Included tests for get_best_checkpoint_path, get_latest_checkpoint_path, load_training_info, and remove_old_checkpoints delegation.
- Added tests for error handling and optional policy attributes during save_checkpoint.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces ML Flashpoint integration for the NeMo RL framework, including a new MLFlashpointRLCheckpointManager for dual checkpointing (frequent MLF saves to tmpfs and infrequent standard saves) and a wrapper_util function to facilitate this integration. The documentation has been updated with usage instructions. Review comments highlight three areas for improvement: adding upfront validation for the save_strategy parameter in wrap_rl_components_with_mlflashpoint to prevent runtime errors, correcting a missing checkpoint_loader argument in the user guide's example code, and fixing an incorrect expected path in a unit test for MLF checkpoint saving.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…collection

… streamline CI pipeline

… and fix unit test expected paths

…ff format

…lict with nemo_rl

…dencies

… 3.12

- Switch build-nemo-rl to use nvcr.io/nvidia/nemo-rl:v0.5.0 as the base environment. - Use a login shell (bash -l) to ensure container profiles and virtual environments are correctly loaded. - Unify standard and nemo-rl builds into a single parameterized 'build' job to reduce duplication. - Empty the 'nemo-rl' optional dependency in pyproject.toml, as dependencies are pre-installed in the container. - Added explanatory comments for non-obvious environment configurations.

github-actions · 2026-03-28T03:00:22Z

Python Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint	100%	100%	✔
src.ml_flashpoint.adapter	100%	100%	✔
src.ml_flashpoint.adapter.megatron	97%	95%	✔
src.ml_flashpoint.adapter.nemo	98%	94%	✔
src.ml_flashpoint.adapter.pytorch	99%	92%	✔
src.ml_flashpoint.checkpoint_object_manager	93%	93%	➖
src.ml_flashpoint.core	95%	92%	✔
src.ml_flashpoint.replication	81%	81%	❌
Summary	95% (2335 / 2464)	92% (549 / 600)	➖

Minimum allowed line rate is 90%

github-actions · 2026-03-28T03:00:24Z

C++ Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object	93%	54%	✔
src.ml_flashpoint.checkpoint_object_manager.object_manager	70%	37%	❌
src.ml_flashpoint.replication.transfer_service	79%	40%	❌
Summary	81% (916 / 1126)	43% (687 / 1604)	➖

Minimum allowed line rate is 80%

github-actions · 2026-03-28T03:16:53Z

Python Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint	100%	100%	✔
src.ml_flashpoint.adapter	100%	100%	✔
src.ml_flashpoint.adapter.megatron	97%	95%	✔
src.ml_flashpoint.adapter.nemo	98%	94%	✔
src.ml_flashpoint.adapter.pytorch	99%	92%	✔
src.ml_flashpoint.checkpoint_object_manager	93%	93%	➖
src.ml_flashpoint.core	95%	92%	✔
src.ml_flashpoint.replication	81%	81%	❌
Summary	95% (2335 / 2464)	92% (549 / 600)	➖

Minimum allowed line rate is 90%

github-actions · 2026-03-28T03:16:55Z

C++ Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object	93%	54%	✔
src.ml_flashpoint.checkpoint_object_manager.object_manager	70%	37%	❌
src.ml_flashpoint.replication.transfer_service	79%	40%	❌
Summary	81% (916 / 1126)	43% (687 / 1604)	➖

Minimum allowed line rate is 80%

gemini-code-assist Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread src/ml_flashpoint/adapter/nemo_rl/wrapper_util_rl.py

Comment thread docs/user-guide.md

Comment thread tests/adapter/nemo_rl/test_checkpoint_manager.py Outdated

g-husam force-pushed the feature/nemo-rl-adapter branch 2 times, most recently from d16556e to cd345ff Compare March 25, 2026 03:07

g-husam force-pushed the feature/nemo-rl-adapter branch from 840c26a to 76dabfd Compare March 26, 2026 23:40

g-husam and others added 21 commits March 27, 2026 15:20

ci: fix artifact upload conflict and update coverage downloader

a0c3541

feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util

e1e07cf

Apply gemini suggestions from code review

2de73a1

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

replace monkey patching with extension class approach, with tests

a7404ab

Find a way to run nemo_rl tests only on python 3.12 and cleanup test …

e97a0f5

…collection

Refactor: use pyproject.toml markers for conditional dependencies and…

b240805

… streamline CI pipeline

add comment

92095cf

Address PR feedback: validate save_strategy, fix user guide examples,…

d02bbab

… and fix unit test expected paths

Rename nemo_rl/wrapper_util.py to wrapper_util_rl.py and fix usages

a482189

Auto format tests/adapter/nemo_rl/test_checkpoint_manager.py using ru…

054d980

…ff format

Relax torch version requirement to >=2.8.0 to resolve dependency conf…

2c17326

…lict with nemo_rl

use explicit pip-version 24.0.1 in build job

0d06d6a

set pip-version to 24.0 (which actually exists)

45cfed8

chore(build): rebase on bifurcated profiles and fix dev-nemo-rl depen…

d2ae870

…dencies

ci: run all tests on python 3.12 and exclude nemo_rl tests on 3.10

1bbf9f8

ci: bifurcate tests and coverage between python 3.10 and 3.12

ef24d26

ci: recursively clone and install nemo_rl for python 3.12

173e8e2

remove local dir

7731522

ci: recursively clone and install nemo_rl in editable mode for python…

8e0363e

… 3.12

ci: use NeMo RL docker container for python 3.12 build

328423d

rebase and add comment

22c35dc

g-husam force-pushed the feature/nemo-rl-adapter branch from 402678f to 22c35dc Compare March 27, 2026 15:21

g-husam added 3 commits March 28, 2026 01:26

ci: add debug step to diagnose NeMo RL environment

ea58b7f

ci: set PYTHONPATH for NeMo RL build to include Megatron-LM

23ed295

g-husam added 5 commits March 28, 2026 02:01

ci: expand debug step to find megatron.bridge

cc8d648

ci: include Megatron-Bridge in PYTHONPATH for NeMo RL build

f8dee0e

ci: expand debug step to find modelopt and requirements

3c4d8c1

ci: very broad debug to find modelopt

7ce7004

ci: fix missing dependencies in NeMo RL container build

3e0873d

ci: install 3rdparty components as editable packages in NeMo RL build

6a36d62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util#78

feat(adapter/nemo_rl): add NeMo RL adapter and wrapper util#78
g-husam wants to merge 30 commits intomainfrom
feature/nemo-rl-adapter

g-husam commented Mar 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-husam commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Mar 28, 2026

Python Code Coverage Summary

Uh oh!

github-actions Bot commented Mar 28, 2026

C++ Code Coverage Summary

Uh oh!

github-actions Bot commented Mar 28, 2026

Python Code Coverage Summary

Uh oh!

github-actions Bot commented Mar 28, 2026

C++ Code Coverage Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

g-husam commented Mar 13, 2026 •

edited

Loading