Releases: deepspeedai/DeepSpeed
Releases · deepspeedai/DeepSpeed
v0.19.0
What's Changed
- Update version after latest release (v0.18.9) by @loadams in #7936
- Refactor consolidate transpose by @nathon-lee in #7934
- Fix/fix autotp universal checkpoint ci by @tohtana in #7937
- Fix process hang in process-group shutdown by @Flamefire in #7941
- Zero3 defragment utility by @nathon-lee in #7940
- [SP] add SP deny list instead of allow by @kashif in #7887
- fix(zero): detach flat buffer to prevent autograd inplace error on CP… by @delock in #7948
- Fix FPQuantizer build by @Flamefire in #7963
- Fix zero 1 and 2 CPU-offloaded gradient norm by @alek6kun in #7967
- Fix overlap-comm buffer lifetimes by @tohtana in #7965
- Fix DeepCompile+Z3 on PyTorch v2.9/2.10 by @tohtana in #7951
- Fix WarmupCosineLR multi-group initialization by @tohtana in #7969
- Enable PyTorch version selection for full test by @tohtana in #7968
- fix(fp_quantizer): fix UB and negative shift warnings in fp_quantize_impl.cu by @Cursx in #7973
- fix(op_builder): avoid duplicate/wrong -gencode flags by @Cursx in #7974
- Rename dequantization template parameters by @Flamefire in #7976
- Avoid CUDA reinit error in CI tests by @tohtana in #7977
- Fix ZeRO-1/2 CPU-offloaded gradient loss with multiple backward() per step by @roycho96 in #7981
- deepcompile: Fix backward graph recompilation due to unbalanced forward/backward visits by @eternalNight in #7980
- Fix Adam subgroup inconsistency by @st-bang97 in #7982
- Dynamic offload compatible with static optimizer offload by @sfc-gh-truwase in #7979
- Fix modal ci timeout by @sfc-gh-truwase in #7989
- Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1 by @maxyu1115 in #7985
- fix: topkgating major bug by @excepshenal in #7986
- Add DeepSpeed NVTX domain support by @heurry in #7988
- Add Gram Newton-Schulz orthogonalization for Muon optimizer by @delock in #7953
- [AutoSP] (Sequence Parallelism) support for Multimodal Models (ViT + LLM) by @nathon-lee in #7984
- Update version.txt before 0.19.0 release by @loadams in #7995
New Contributors
- @alek6kun made their first contribution in #7967
- @Cursx made their first contribution in #7973
- @roycho96 made their first contribution in #7981
- @st-bang97 made their first contribution in #7982
- @maxyu1115 made their first contribution in #7985
- @excepshenal made their first contribution in #7986
- @heurry made their first contribution in #7988
Full Changelog: v0.18.9...v0.19.0
v0.18.9 Patch Release
What's Changed
- Respect
$TRITON_HOMEby @Flamefire in #7907 - Add Feature Universal Checkpoint for AutoTP by @nathon-lee in #7908
- fix: remove unnecessary shell=True in ROCm GPU architecture detection by @instantraaamen in #7915
- Don't detect local GPU if
$DS_IGNORE_CUDA_DETECTIONis set by @Flamefire in #7896 - Add HuggingFace tp_plan support for AutoTP by @delock in #7901
- fix: handle non-existent path in is_nfs_path for Triton autotune cache by @Krishnachaitanyakc in #7921
- Fix backward compatibility of torch.amp.custom_fwd for PyTorch < 2.4 by @tohtana in #7920
- Extending Muon Optimizer Support for ZeRO Stage 3 by @PKUWZP in #7919
- Add news item for ASPLOS 2026 Best Paper Award by @PKUWZP in #7923
- fix(superoffload) preserve multi-group updates with shared cpu buffers (#7905) by @xylian86 in #7906
- AGENTS.md: Add pre-commit command to existing CI requirements line by @delock in #7930
- Update README with latest news from DeepSpeed by @PKUWZP in #7931
- Merging AutoSP into DeepSpeed by @neeldani in #7860
- Add fallback to full test by @tohtana in #7933
- Remove Microsoft Corporation copyright from AGENTS.md and CLAUDE.md by @PKUWZP in #7932
- Update version.txt for latest incoming release 0.18.9 by @loadams in #7935
New Contributors
- @instantraaamen made their first contribution in #7915
- @Krishnachaitanyakc made their first contribution in #7921
- @neeldani made their first contribution in #7860
Full Changelog: v0.18.8...v0.18.9
v0.18.8 Patch Release
What's Changed
- Suppress see_memory_usage logs by @sfc-gh-truwase in #7891
- [Bloom] Fix hangs of bloom test by @k-artem in #7890
- double reduction user-friendly error by @stas00 in #7895
- Fix async_io ops building error on Huawei Ascend NPU by @huangyifan0610 in #7894
- Fix Evoformer's multi-arch dispatch root cause by @tohtana in #7881
- fix: Validate fp16.loss_scale is finite and non-negative by @nathon-lee in #7889
- Add AGENTS.md and CLAUDE.md with project rules for AI coding agents by @delock in #7902
- fix(zero3): use current_stream() instead of default_stream() for grad… by @michaelroyzen in #7898
- Update version by @loadams in #7903
New Contributors
- @huangyifan0610 made their first contribution in #7894
- @michaelroyzen made their first contribution in #7898
Full Changelog: v0.18.7...v0.18.8
v0.18.7 Patch Release
What's Changed
- Update version post release by @loadams in #7850
- Z1/2 init: flatten params on device by @ksugama in #7828
- Enable shm_comm support for arm by @phalani-paladugu in #7800
- Add news entry for DeepSpeed updates by @PKUWZP in #7854
- Add EXAONE 4.0 model support for Inference V2 by @Bias92 in #7853
- Fix ROCm BF16 conversion intrinsics in inference v2 (#7843) by @tohtana in #7846
- Fix compilation of Evoformer by @Flamefire in #7862
- Throw error when parameter is modified in GatheredParameters by @tohtana in #7832
- Fix Zero-3 static scale assertion in fp16 test by @tohtana in #7866
- Schedule nightly full test by @tohtana in #7870
- Fix broken links and add AutoTP Training tutorial to sidebar nav by @tohtana in #7874
- fix: replace 35 bare except clauses with except Exception by @haosenwang1018 in #7873
- perf: use deque for FIFO queues in sequence parallel, superoffload, and compile by @giulio-leone in #7880
- Fix: only add parameter with grads to parameter group by @delock in #7869
- Fix no-grad grad-fn lookup in ZeRO hook counting on PyTorch 2.3 (#7830) by @tohtana in #7841
- Fix import deepspeed crash on PyTorch v2.3 + Python 3.12 by @tohtana in #7875
- XPU use stock pytorch instead of Intel Extension for PyTorch by @delock in #7877
- Remove amp() from abstract accelerator by @delock in #7879
- Add document section explaining autocast nesting by @tohtana in #7883
- Fix hook count performance regression from v0.18.5 by @tohtana in #7886
New Contributors
- @ksugama made their first contribution in #7828
- @phalani-paladugu made their first contribution in #7800
- @Bias92 made their first contribution in #7853
- @haosenwang1018 made their first contribution in #7873
- @giulio-leone made their first contribution in #7880
Full Changelog: v0.18.6...v0.18.7
v0.18.6 Patch Release
What's Changed
- Update version.txt to 0.18.6 after latest release by @loadams in #7826
- Fix leaf module race condition by @tohtana in #7825
- Skip sequence parallel operations during eval by @jp1924 in #7821
- Support custom partitioning patterns for AutoTP by @tohtana in #7806
- Fix gradient is ready with z2 by @sfc-gh-truwase in #7829
- Fix AutoTP custom patterns: respect use_default_specs by @tohtana in #7827
- Support new python 3.14 annotation handling by @sdvillal in #7831
- fix: replace deprecated fractions.gcd with math.gcd by @Mr-Neutr0n in #7845
- Fix bf16 gradient norm divergence with ZeRO stage 0 by @tohtana in #7839
- Replace torch.jit.script with torch.compile (#7835) by @tohtana in #7840
New Contributors
- @jp1924 made their first contribution in #7821
- @Mr-Neutr0n made their first contribution in #7845
Full Changelog: v0.18.5...v0.18.6
v0.18.5 Patch Release
What's Changed
- Update version.txt after 0.18.4 release by @loadams in #7765
- Various fixes to run on mps by @jeffra in #7767
- Udpate workflow trigger by @tohtana in #7768
- fix: delete using namespace std. by @nathon-lee in #7766
- fix: update Megatron-DeepSpeed tutorial to match current repo structure by @nathon-lee in #7761
- Add timeout to test workflows by @tohtana in #7774
- Remove cron/PR triggers for outdated V100 tests by @loadams in #7777
- [Docs] Fix
docs/_pages/config-json.mdformat by @ooooo-create in #7779 - Update CLA to refer to DCO by @loadams in #7778
- Fix multiprocessing testcase by @k-artem in #7743
- fix: skip compressed allreduce for empty tensors by @T1mn in #7769
- docs: update README.md by @eltociear in #7781
- Fix gradient checkpointing with use_reentrant=True / PyTorch-style backward / ZeRO-3 by @tohtana in #7780
- Fix Ulysses PEFT test by @tohtana in #7784
- Fix Evoformer compilation by @sdvillal in #7760
- fix checkpointing/loading of z0+bf16 by @tohtana in #7786
- Add sequential allgather optimization for ZeRO-3 by @aeeeeeep in #7661
- Fix AutoTP test numerical tolerance with rtol by @tohtana in #7794
- Fix backward for pipeline engine by @tohtana in #7787
- Skip empty parameters in gradient reduction by @tohtana in #7789
- Fix issue with BF16 optimizer selection by @tohtana in #7788
- Fix BF16_Optimizer being used without ZeRO by @tohtana in #7790
- Add full test suite workflow by @tohtana in #7795
- Fix Muon optimizer module path by @tohtana in #7802
- Fix ping-pong buffer index reset and removing redundant stream sync by @undersilence in #7805
- Fix ZeRO stage to choose BF16 optimizer in test by @tohtana in #7803
- Run Evoformer tests sequentially by @tohtana in #7810
- Improve engine's cleanup by @tohtana in #7813
- Ignore evoformer test by @tohtana in #7815
- Fix typos in accelerator setup guide by @nathon-lee in #7818
- Raise clear error on in-place GatheredParameters edits without modifier_rank by @tohtana in #7817
- [Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses by @Flink-ddd in #7809
- Update PyTorch to v2.9 for modal tests by @tohtana in #7816
New Contributors
- @ooooo-create made their first contribution in #7779
- @T1mn made their first contribution in #7769
- @sdvillal made their first contribution in #7760
- @undersilence made their first contribution in #7805
Full Changelog: v0.18.4...v0.18.5
v0.18.4 Patch Release
What's Changed
- Update version by @sfc-gh-truwase in #7719
- Disable deterministic option in compile tests by @tohtana in #7720
- Fix SuperOffloadOptimizer_Stage3 crash due to missing param_names parameter by @ImaGoodFella in #7715
- [AMD][ROCm] Improve support of AMD by @k-artem in #7448
- fix typo by @stas00 in #7722
- Skip none in backward hook by @tohtana in #7725
- [Engine] Only scale gradients if scale_wrt_gas is True by @kashif in #7724
- Fix testcases that depends on triton by @k-artem in #7731
- Fix rare hang in DeepSpeed Async I/O wait by releasing the Python GIL by @xylian86 in #7727
- Fix #7733: Replace torch.sqrt with math.sqrt in scale_lr for sqrt method by @Rakshit-gen in #7735
- replace moe checkpoint dp_world_size with seq_dp_world_size by @wukong1992 in #7732
- [BUG] Fix UlyssesSPAttentionHF.register_with_transformers() crash with PEFT models by @Rakshit-gen in #7737
- Add core api update blog by @tohtana in #7738
- Fix Nebula checkpoint engine commit() API mismatch by @Rakshit-gen in #7740
- Fix DecoupledCheckpointEngine deadlock and improve reliability by @Rakshit-gen in #7742
- Fix OnebitLamb NaN propagation with empty parameters by @Rakshit-gen in #7736
- fix: remove premature MPI environment variable check in OpenMPIRunner by @leejianwoo-collab in #7751
- Enable python 3.11 and 3.12 tests by @loadams in #7007
- Add CI workflow to run tests on AWS by @tohtana in #7753
- Add fallback to BF16 support check by @tohtana in #7754
- Fix DeepCompile for PyTorch 2.8/2.9 compatibility by @tohtana in #7755
- Removed amp testcases by @k-artem in #7745
- fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim by @leejianwoo-collab in #7763
New Contributors
- @ImaGoodFella made their first contribution in #7715
- @k-artem made their first contribution in #7448
- @kashif made their first contribution in #7724
- @Rakshit-gen made their first contribution in #7735
- @leejianwoo-collab made their first contribution in #7751
Full Changelog: v0.18.3...v0.18.4
v0.18.3 Patch Release
What's Changed
- Update version.txt after release by @loadams in #7675
- [modal ci] fixes by @stas00 in #7676
- leaf modules: explain better by @stas00 in #7674
- disable nv-lightning-v100.yml cI by @stas00 in #7681
- allow seperate learning rate "muon_lr" and "adam_lr" for muon optimizer by @delock in #7658
- see_mem_usage: make always work by @stas00 in #7688
- make debug utils more resilient by @stas00 in #7690
- zero stage 1-2: don't pin memory if not configured by @stas00 in #7689
- modal ci: fix group concurrency by @stas00 in #7691
- Use pytorch utils to detect ninja by @Emrys-Merlin in #7687
- Update SECURITY.md to point to GitHub reporting rather than Microsoft by @loadams in #7692
- Add Qwen2.5 to AutoTP model list by @delock in #7696
- Trust intel server for XPU tests by @tohtana in #7698
- PyTorch-compatible backward API by @tohtana in #7665
- Add news about Ray x DeepSpeed Meetup by @PKUWZP in #7704
- Put Muon optimizer momentum buffer on GPU by @delock in #7648
- [ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases by @rraminen in #7655
- Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++. by @zhengchenyu in #7707
- fix: skip aio wait when swap tensors is empty by @xylian86 in #7712
- Low-precision master params/grads/optimizer states by @tohtana in #7700
- Enabled compiled autograd for backward pass by @deepcharm in #7667
- Wall clock timers API by @sfc-gh-truwase in #7714
New Contributors
- @Emrys-Merlin made their first contribution in #7687
Full Changelog: v0.18.2...v0.18.3
v0.18.2 Patch Release
What's Changed
- Update version after 0.18.1 release by @loadams in #7647
- Deduplicate fp32 weights under torch autocast and ZeRO3 by @eternalNight in #7651
- ulysses mpu: additional api by @stas00 in #7649
- ALST/UlyssesSP: more intuitive API wrt variable seqlen by @stas00 in #7656
- Fix misplaced overflow handling return in fused_optimizer.py by @rraminen in #7645
- [bug]: fixed comm_dtype in extra_large_param_to_reduce by @therealnaveenkamal in #7660
- UlyssesSP: TiledMLP doc - recomputes forward twice by @stas00 in #7664
- resolved a 0-dim tensor slicing bug from _get_state_without_padding by @therealnaveenkamal in #7659
- Fix typo in pytorch-profiler.md documentation by @kunheek in #7652
- README refresh by @sfc-gh-truwase in #7668
New Contributors
Full Changelog: v0.18.1...v0.18.2
v0.18.1 Patch Release
What's Changed
- Add ZenFlow code for Stage 3 by @JoshWoo2003 in #7516
- [XPU][CI] recover xpu-max1100 workflow by @Liangliang-Ma in #7630
- Take **kwargs in init of DeepSpeedZeroOptimizer subclasses by @eternalNight in #7634
- add support for tensor learning rate (vs scalar) by @NirSonnenschein in #7633
- Fix illegal memory access with multi_tensor_apply size above INT_MAX by @wangyan-mms in #7639
- No Muon optimizer for embeding and lm_head layer by @delock in #7641
- z2: report param name and not zero id in assert by @stas00 in #7637
- z2: don't pass
dtypetoreport_ipg_memory_usageby @stas00 in #7636 - Ulysses HF Accelerate integration by @stas00 in #7638
- Add DataStates-LLM: Asynchronous Checkpointing Engine Support by @mauryaavinash95 in #7166
New Contributors
- @JoshWoo2003 made their first contribution in #7516
- @wangyan-mms made their first contribution in #7639
Full Changelog: v0.18.0...v0.18.1