Disk-level delta weight sync#2089
Open
nanjiangwill wants to merge 8 commits into
Open
Conversation
This was referenced Jun 16, 2026
Ship only the changed bytes between weight syncs as a canonical HF delta checkpoint; rollout hosts apply it into a host-local checkpoint and reload via the vanilla update_weights_from_disk path. Replaces the NCCL delta transport from THUDM#1806 with a disk-only path that needs no engine-side delta support.
sync_local_checkpoint (was sync_weights) materializes the base lazily via the idempotent init_local_checkpoint instead of a background thread; record per-sync update time in update_weight_metrics; state the pre-read/pre-push hooks' purpose (non-POSIX filesystem coherence).
The actor's update_weights is already @timer-wrapped (perf/update_weights_time), so the per-sync total/publish/reload breakdown was duplicate instrumentation. Keep only the delta-specific metrics (density, wire bytes).
The delta scaffold reworked the update-weight args: delta requires --update-weight-transport=disk (was nccl-or-disk), needs --update-weight-local-checkpoint-dir, and the --update-weight-delta-dir compatibility alias is gone (the directory belongs to the transport, not the encoding). Drop the alias resolve/backfill/conflict tests, point the transport and colocate tests at the disk path, and cover the local-checkpoint requirement.
With the delta-dir alias gone, _resolve_update_weight_disk_dir no longer normalizes anything — it's a single transport-level check, so fold it into _validate_update_weight_args.
slime_validate_args validates everything else inline; the extracted _validate_update_weight_args was the lone exception. Fold it in and test it the same way as the other slime_validate_args checks (make_slime_validate_args).
Materialize the host-local checkpoint in a daemon thread at engine init so the one-time base copy overlaps sglang launch and the first rollout (which serves from init-loaded weights) instead of blocking the first delta reload. The first sync_local_checkpoint's init_local_checkpoint is idempotent and flock-guarded, so it either finds the copy done or blocks on the same lock — no join needed.
fd7c00d to
a0b4b09
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ship only the changed bytes between weight syncs instead of a full checkpoint, for non-colocated training/inference across clusters. The trainer publishes a per-tensor delta as a canonical HF checkpoint directory; each rollout host applies it in place and reloads via the ordinary
update_weights_from_diskpath — no delta-specific engine code. Replaces the NCCL delta transport from #1806.The delta is computed on raw tensor bytes and the engine just reloads a standard checkpoint, so the rollout side stays decoupled from the trainer and is free to use:
Future work
overwriteencoding instead of reloading the full checkpoint (larger sglang refactor).