Opaque HTTP rollout endpoint with publish-only delta sync#5
Open
nanjiangwill wants to merge 1 commit into
Open
Opaque HTTP rollout endpoint with publish-only delta sync#5nanjiangwill wants to merge 1 commit into
nanjiangwill wants to merge 1 commit into
Conversation
f506c33 to
94ac929
Compare
fd7c00d to
a0b4b09
Compare
94ac929 to
ec8b0b2
Compare
Add --rollout-endpoint-url to train against an elastic rollout fleet behind a single opaque HTTP endpoint (no per-engine handles). Generation routes to the URL; the disk-delta updater publishes each version to the shared disk dir and advances a `latest` pointer for the fleet to pull, instead of pushing via per-engine RPCs. Add-only: managed and external-addressed paths are unchanged. Requires delta mode + disk transport (the only cross-cluster channel) and is non-colocate. - abort: the endpoint cancels surplus when discarding; with --partial-rollout it drains so streaming tasks return their partials (closing each stream disconnects, which aborts the fleet-side request). Endpoint + --partial-rollout requires the streaming rollout, the only path that captures partials client-side. - streaming: record the weight version on an aborted partial so off-policy correction can weight it (update_from_meta_info is skipped without a finish_reason).
ec8b0b2 to
4536d5d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on the disk-level delta weight sync branch. Adds
--rollout-endpoint-urlto train against an elastic rollout fleet behind a single opaque HTTP endpoint — no per-engine handles, no router worker APIs.Three things follow from that one flag:
get_model_url); the rollout server holds no engines (reusesExternalRolloutServerwith empty engines); placement allocates 0 rollout GPUs.--update-weight-disk-dirand advances alatestpointer (via the existing pre-push commit hook), skipping the per-engineupdate_weights_from_disk/pause/resume RPCs. The fleet pulls and hot-loads on its own.--partial-rolloutthe streaming generation tasks self-break and return the partial trajectories for resumption (each tagged with the weight version it stopped at).The weight version each trajectory was generated with already flows back via sglang's
meta_info["weight_version"]intoSample.weight_versions, so staleness handling stays the algorithm's concern — unchanged here.Add-only: managed and external-addressed paths are byte-for-byte unchanged. Requires
--update-weight-mode delta --update-weight-transport disk(disk is the only cross-cluster channel; full checkpoints are too large, hence deltas) and is non-colocate;--partial-rolloutrequires the streaming generation path.Follow-ups (not in this PR): CPU-mockable unit tests for the new path, docs (external-rollout-engines / delta-weight-sync), and an e2e test against a mock pulling endpoint.