Skip to content

Disk-level delta weight sync#2089

Open
nanjiangwill wants to merge 8 commits into
THUDM:mainfrom
modal-projects:disk-delta-weight-sync
Open

Disk-level delta weight sync#2089
nanjiangwill wants to merge 8 commits into
THUDM:mainfrom
modal-projects:disk-delta-weight-sync

Conversation

@nanjiangwill

@nanjiangwill nanjiangwill commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Ship only the changed bytes between weight syncs instead of a full checkpoint, for non-colocated training/inference across clusters. The trainer publishes a per-tensor delta as a canonical HF checkpoint directory; each rollout host applies it in place and reloads via the ordinary update_weights_from_disk path — no delta-specific engine code. Replaces the NCCL delta transport from #1806.

The delta is computed on raw tensor bytes and the engine just reloads a standard checkpoint, so the rollout side stays decoupled from the trainer and is free to use:

  • any low precision — int4, nvfp4, mxfp8, fp8-block
  • any attention/moe backend
  • any parallelism scheme

Future work

  • Apply the delta during the engine's weight load instead of patching a host-local checkpoint first (minor sglang change).
  • Read only the changed tensors via the overwrite encoding instead of reloading the full checkpoint (larger sglang refactor).

Ship only the changed bytes between weight syncs as a canonical HF delta
checkpoint; rollout hosts apply it into a host-local checkpoint and reload via
the vanilla update_weights_from_disk path. Replaces the NCCL delta transport
from THUDM#1806 with a disk-only path that needs no engine-side delta support.
sync_local_checkpoint (was sync_weights) materializes the base lazily via the
idempotent init_local_checkpoint instead of a background thread; record per-sync
update time in update_weight_metrics; state the pre-read/pre-push hooks' purpose
(non-POSIX filesystem coherence).
The actor's update_weights is already @timer-wrapped (perf/update_weights_time),
so the per-sync total/publish/reload breakdown was duplicate instrumentation.
Keep only the delta-specific metrics (density, wire bytes).
The delta scaffold reworked the update-weight args: delta requires
--update-weight-transport=disk (was nccl-or-disk), needs
--update-weight-local-checkpoint-dir, and the --update-weight-delta-dir
compatibility alias is gone (the directory belongs to the transport, not the
encoding). Drop the alias resolve/backfill/conflict tests, point the transport
and colocate tests at the disk path, and cover the local-checkpoint requirement.
With the delta-dir alias gone, _resolve_update_weight_disk_dir no longer
normalizes anything — it's a single transport-level check, so fold it into
_validate_update_weight_args.
slime_validate_args validates everything else inline; the extracted
_validate_update_weight_args was the lone exception. Fold it in and test it
the same way as the other slime_validate_args checks (make_slime_validate_args).
Materialize the host-local checkpoint in a daemon thread at engine init so the
one-time base copy overlaps sglang launch and the first rollout (which serves
from init-loaded weights) instead of blocking the first delta reload. The first
sync_local_checkpoint's init_local_checkpoint is idempotent and flock-guarded,
so it either finds the copy done or blocks on the same lock — no join needed.
@nanjiangwill nanjiangwill force-pushed the disk-delta-weight-sync branch from fd7c00d to a0b4b09 Compare June 17, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant