Skip to content

Publish-only delta sync, re-derived on the disk-delta base#6

Closed
nanjiangwill wants to merge 9 commits into
jvmncs/rollout-endpointfrom
disaggregated-rollout
Closed

Publish-only delta sync, re-derived on the disk-delta base#6
nanjiangwill wants to merge 9 commits into
jvmncs/rollout-endpointfrom
disaggregated-rollout

Conversation

@nanjiangwill

Copy link
Copy Markdown

Clean re-derivation of this branch's publish-only weight sync, rebuilt on top of the disk-level delta weight sync (PR to THUDM THUDM#2089) instead of the old update_weight_from_distributed_delta.py that THUDM#2089 removes.

The large diff vs this branch is the base difference: it carries all of THUDM#2089's delta rewrite plus the disaggregated-rollout feature. The publish-only logic itself is ~90 lines (see #5) — on THUDM#2089's clean disk-delta, "publish" is just write the version + advance a latest pointer, skip the engine RPCs.

Dropped relative to this branch:

  • --update-weight-delta-publish-wait next-sync/sync dual mode (publish is synchronous: write + announce, no in-flight drain).
  • --update-weight-delta-root (unused).
  • the per-request version-pin / request-mutation hook — sglang already returns the version in meta_info["weight_version"] (captured into Sample.weight_versions), and routing/affinity belong to the fleet's router.
  • the second publish hook (reuses the existing pre-push commit hook) and the force-connect special-casing.

Generation routing, placement, and abort follow the existing external-engine pattern (ExternalRolloutServer). Intended to supersede the publish-only approach here once THUDM#2089 lands.

Ship only the changed bytes between weight syncs as a canonical HF delta
checkpoint; rollout hosts apply it into a host-local checkpoint and reload via
the vanilla update_weights_from_disk path. Replaces the NCCL delta transport
from THUDM#1806 with a disk-only path that needs no engine-side delta support.
sync_local_checkpoint (was sync_weights) materializes the base lazily via the
idempotent init_local_checkpoint instead of a background thread; record per-sync
update time in update_weight_metrics; state the pre-read/pre-push hooks' purpose
(non-POSIX filesystem coherence).
The actor's update_weights is already @timer-wrapped (perf/update_weights_time),
so the per-sync total/publish/reload breakdown was duplicate instrumentation.
Keep only the delta-specific metrics (density, wire bytes).
The delta scaffold reworked the update-weight args: delta requires
--update-weight-transport=disk (was nccl-or-disk), needs
--update-weight-local-checkpoint-dir, and the --update-weight-delta-dir
compatibility alias is gone (the directory belongs to the transport, not the
encoding). Drop the alias resolve/backfill/conflict tests, point the transport
and colocate tests at the disk path, and cover the local-checkpoint requirement.
With the delta-dir alias gone, _resolve_update_weight_disk_dir no longer
normalizes anything — it's a single transport-level check, so fold it into
_validate_update_weight_args.
slime_validate_args validates everything else inline; the extracted
_validate_update_weight_args was the lone exception. Fold it in and test it
the same way as the other slime_validate_args checks (make_slime_validate_args).
Materialize the host-local checkpoint in a daemon thread at engine init so the
one-time base copy overlaps sglang launch and the first rollout (which serves
from init-loaded weights) instead of blocking the first delta reload. The first
sync_local_checkpoint's init_local_checkpoint is idempotent and flock-guarded,
so it either finds the copy done or blocks on the same lock — no join needed.
@nanjiangwill nanjiangwill force-pushed the jvmncs/rollout-endpoint branch from 1ab3399 to 1499c3e Compare June 16, 2026 22:33
@nanjiangwill nanjiangwill force-pushed the disaggregated-rollout branch from d70b68a to f506c33 Compare June 17, 2026 01:11
Add --rollout-endpoint-url to train against an elastic rollout fleet behind a single
opaque HTTP endpoint (no per-engine handles). Generation routes to the URL; the
disk-delta updater publishes each version to the shared disk dir and advances a
`latest` pointer for the fleet to pull, instead of pushing via per-engine RPCs.

Add-only: managed and external-addressed paths are unchanged. Requires delta mode +
disk transport (the only cross-cluster channel) and is non-colocate.

- abort: the endpoint cancels surplus when discarding; with --partial-rollout it
  drains so streaming tasks return their partials (closing each stream disconnects,
  which aborts the fleet-side request). Endpoint + --partial-rollout requires the
  streaming rollout, the only path that captures partials client-side.
- streaming: record the weight version on an aborted partial so off-policy correction
  can weight it (update_from_meta_info is skipped without a finish_reason).
@nanjiangwill nanjiangwill force-pushed the disaggregated-rollout branch from f506c33 to 94ac929 Compare June 17, 2026 04:21
@nanjiangwill

Copy link
Copy Markdown
Author

Closing: jvmncs/rollout-endpoint is deprecated. The publish-only work now stacks on the disk-delta branch via #5 (disaggregated-rollout -> disk-delta-weight-sync).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant