Skip to content

Opaque HTTP rollout endpoint with publish-only delta sync#5

Open
nanjiangwill wants to merge 1 commit into
disk-delta-weight-syncfrom
disaggregated-rollout
Open

Opaque HTTP rollout endpoint with publish-only delta sync#5
nanjiangwill wants to merge 1 commit into
disk-delta-weight-syncfrom
disaggregated-rollout

Conversation

@nanjiangwill

@nanjiangwill nanjiangwill commented Jun 16, 2026

Copy link
Copy Markdown

Stacked on the disk-level delta weight sync branch. Adds --rollout-endpoint-url to train against an elastic rollout fleet behind a single opaque HTTP endpoint — no per-engine handles, no router worker APIs.

Three things follow from that one flag:

  • Generation routes to the URL (get_model_url); the rollout server holds no engines (reuses ExternalRolloutServer with empty engines); placement allocates 0 rollout GPUs.
  • Weights are published, not pushed: the disk-delta updater writes each version to --update-weight-disk-dir and advances a latest pointer (via the existing pre-push commit hook), skipping the per-engine update_weights_from_disk/pause/resume RPCs. The fleet pulls and hot-loads on its own.
  • Abort has no router worker list to query (an opaque endpoint exposes none). With surplus discarded it cancels slime's local pending requests, so the client disconnect aborts the fleet; with --partial-rollout the streaming generation tasks self-break and return the partial trajectories for resumption (each tagged with the weight version it stopped at).

The weight version each trajectory was generated with already flows back via sglang's meta_info["weight_version"] into Sample.weight_versions, so staleness handling stays the algorithm's concern — unchanged here.

Add-only: managed and external-addressed paths are byte-for-byte unchanged. Requires --update-weight-mode delta --update-weight-transport disk (disk is the only cross-cluster channel; full checkpoints are too large, hence deltas) and is non-colocate; --partial-rollout requires the streaming generation path.

Follow-ups (not in this PR): CPU-mockable unit tests for the new path, docs (external-rollout-engines / delta-weight-sync), and an e2e test against a mock pulling endpoint.

Add --rollout-endpoint-url to train against an elastic rollout fleet behind a single
opaque HTTP endpoint (no per-engine handles). Generation routes to the URL; the
disk-delta updater publishes each version to the shared disk dir and advances a
`latest` pointer for the fleet to pull, instead of pushing via per-engine RPCs.

Add-only: managed and external-addressed paths are unchanged. Requires delta mode +
disk transport (the only cross-cluster channel) and is non-colocate.

- abort: the endpoint cancels surplus when discarding; with --partial-rollout it
  drains so streaming tasks return their partials (closing each stream disconnects,
  which aborts the fleet-side request). Endpoint + --partial-rollout requires the
  streaming rollout, the only path that captures partials client-side.
- streaming: record the weight version on an aborted partial so off-policy correction
  can weight it (update_from_meta_info is skipped without a finish_reason).
@nanjiangwill nanjiangwill force-pushed the disaggregated-rollout branch from ec8b0b2 to 4536d5d Compare June 17, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant