As slime grows, we need more thorough CI coverage for an increasing number of backends, deployment patterns, and training features.
This issue serves as a living index of our existing CI tests so that both users and contributors can quickly see:
- which backends are currently exercised by CI
- which parallelism / optimizer / rollout / deployment features are being tested
- which CI labels are expected to be used in practice
- where meaningful coverage gaps still exist
Over time this document should evolve into a compact CI coverage map of slime.
Last audited: 2026-06-11
Note: this issue tracks the CI suites we actually expect maintainers/contributors to use. Some legacy or backup jobs may still exist in the workflow template but are intentionally omitted here until they become part of the normal CI workflow.
CI entrypoints
Workflow source:
Triggers:
pull_request to main: opened, reopened, synchronize, labeled
push to main: cheap always-on CPU / adapter checks only
workflow_dispatch: can run the matrix manually; supports infinite_run
Most GPU e2e jobs are label-gated on self-hosted runners. The CPU unit tests and agent-adapter tests are always-on for PR / push / manual runs.
Periodic CI trigger PR:
#2053 [DON'T MERGE] run CI
- Long-lived PR used to periodically trigger CI by updating the
ci-test file and/or applying CI labels.
- Useful for scheduled/manual CI sweeps without attaching CI-only commits to feature PRs.
Dynamic changed-test entrypoint:
run-ci-changed
- detects changed
tests/test_*.py and tests/plugin_contracts/test_*.py
- uses each test file's
NUM_GPUS = ... when present, otherwise defaults to 8 GPUs
Common Megatron e2e sanity checks
Most Megatron e2e tests pass --ci-test. For the standard training path, this is intended to catch reference-policy / rollout-policy wiring regressions, including initial KL / log-prob sanity failures.
Specialized tests may add extra assertions or intentionally disable part of the default checker. Examples:
- MTP-only gradient test disables the KL checker and asserts that only MTP parameters receive non-zero gradients.
- Disk full / delta weight-update tests assert that expected checkpoint or safetensors files are actually written.
- Fan-out test asserts that the custom compact generate path was called exactly
num_rollout * rollout_batch_size times.
- External PD test asserts that disk-backed delta files were produced for pre-launched SGLang workers.
Always-on CPU / contract CI
These run on GitHub-hosted CPU runners for PR / push / manual runs.
cpu-unittest
Trigger: always-on
Coverage:
- Megatron argument validation
- DP scheduling and CP utilities
- metric reporting
- loss CP invariance
- value temperature
- reward-model utilities: F1, GPQA, math, math DAPO, DeepScaler
- sampling and rollout validation
- placement group utilities
- external SGLang engine utilities
- HF checkpoint saver utility
- plugin contracts for rollout, runtime hooks, path loading, and generate APIs
Tests:
agent-adapter-test
Trigger: always-on
Coverage:
- OpenAI / Anthropic / OpenAI Agents SDK adapter compatibility
Tests:
Label-gated GPU e2e CI
run-ci-sglang-config / e2e-test-sglang-config
SGLang YAML config and offload suite.
run-ci-megatron / e2e-test-megatron
Main Megatron backend e2e suite.
| Test |
GPUs |
Main coverage |
tests/test_full_disk_weight_update.py |
4 |
full checkpoint weight update through disk; asserts full HF checkpoint files are written |
tests/test_quick_start_glm4_9B.py |
8 |
GLM-Z1-9B, GRPO, TP2 + CP2, disaggregated actor/rollout, TIS, per-token loss |
tests/test_glm4.7_30B_A3B_pd_mooncake.py |
8 |
GLM-4.7 Flash MoE, TP2 + PP2 + CP2 + EP4, colocated single-node PD, Mooncake, EAGLE, DP attention / DP LM head |
tests/test_qwen3_30B_A3B.py |
8 |
Qwen3-30B-A3B MoE, GSPO, TP4 + CP2 + EP8, CPU optimizer offload, precision-aware optimizer, DeepEP, FP8 rollout, routing replay, TIS |
tests/test_qwen3.6_35B_A3B_pd_mooncake.py |
8 |
Qwen3.6-35B-A3B, PD + Mooncake, DeepEP, EAGLE, DP attention / DP LM head, debug rollout data save |
tests/test_qwen3_30B_A3B_r3.py |
8 |
Qwen3-30B-A3B, GSPO, rollout routing replay, TIS, DeepEP, FP8 rollout |
tests/test_qwen3_4B_ppo.py |
8 |
PPO, actor+critic config, critic-only warmup, colocated rollout, TP2 + CP2 |
tests/test_qwen3_4B_ppo_disaggregate.py |
8 |
PPO, actor+critic config, critic-only warmup, disaggregated actor/rollout, TP2 + CP2 |
tests/test_qwen3_4B_ppo_train_critic_only.py |
8 |
PPO with longer critic-only phase |
tests/test_moonlight_16B_A3B.py |
8 |
Moonlight MLA MoE, GSPO, TP2 + CP2 + EP8, colocated rollout |
tests/test_moonlight_16B_A3B_r3.py |
8 |
Moonlight MLA MoE, GSPO, rollout routing replay |
tests/test_mimo_7B_mtp_only_grad.py |
8 |
MTP training, EAGLE rollout, MTP-only gradient isolation assertion |
tests/test_qwen3_0.6B_parallel_check.py |
8 |
deterministic/parallel consistency across TP/PP/CP sizes and GPU counts; grad-norm save/load checks |
tests/test_qwen2.5_0.5B_debug_rollout_then_train.py |
8 |
two-phase debug rollout-only then train-only from saved rollout data |
tests/test_qwen2.5_0.5B_opd_sglang.py |
8 |
on-policy distillation with an external SGLang teacher server |
tests/test_qwen3_4B_external_pd.py |
6 |
external pre-launched SGLang PD fleet, Mooncake, external engine discovery, disk-backed delta weight sync |
tests/test_qwen2.5_0.5B_fully_async_short.py |
4 |
fully-async rollout path via train_async.py |
tests/test_qwen3_4B_streaming_partial_rollout.py |
8 |
streaming SGLang rollout, partial rollout, abort/recycle path, off-policy masking in partial rollout |
tests/test_qwen3.5_0.8B_gsm8k_short.py |
4 |
short colocated GRPO smoke, dynamic sampling, fault tolerance |
tests/test_qwen3.5_0.8B_gsm8k_async_short.py |
4 |
short async GRPO smoke, separate rollout GPUs, fault tolerance |
tests/test_qwen3_4B_ckpt.py |
8 |
checkpoint save/load roundtrip; optimizer save/load combinations; async save |
test_qwen3_4B_ckpt.py is run five ways:
--save-optimizer gpu --load-optimizer gpu
--save-optimizer gpu --load-optimizer cpu
--save-optimizer cpu --load-optimizer cpu
--save-optimizer cpu --load-optimizer gpu
--async-save
run-ci-precision / e2e-test-precision
run-ci-ckpt / e2e-test-ckpt
Dedicated checkpoint matrix, currently the same five test_qwen3_4B_ckpt.py modes listed above.
run-ci-image / e2e-test-image
Runs a curated GPU e2e subset inside slimerl/slime-test:latest instead of the default slimerl/slime:latest image. This is useful for validating release/test image contents against representative Megatron, SGLang, checkpoint, PD, and OPD paths.
Backend coverage summary
Megatron backend
Covered by GPU e2e and CPU tests.
Current coverage includes smoke or targeted e2e coverage for:
- Dense models: Qwen2.5-0.5B, Qwen3-0.6B, Qwen3.5-0.8B, Qwen3-4B, GLM-Z1-9B, MiMo-7B
- MoE models: Qwen3-30B-A3B, Qwen3.6-35B-A3B, GLM-4.7 Flash, Moonlight-16B-A3B
- MLA / MoE / EP / DeepEP paths
- TP / PP / CP / EP combinations
- colocated and disaggregated rollout
- single-node PD with Mooncake
- external pre-launched PD fleet
- GRPO, GSPO, PPO
- TIS, routing replay, rollout routing replay
- partial rollout and streaming rollout
- fully-async rollout
- OPD with SGLang teacher
- checkpoint save/load, async save, CPU/GPU optimizer placement roundtrips
- disk-backed full and delta weight update
- MTP training gradient-isolation check
This list means the path is exercised by at least one CI test. It does not imply exhaustive coverage of every combination.
FSDP backend
No FSDP backend test is currently listed in the active pr-test.yml.j2 matrix, and the previous FSDP test entries are no longer present in the current test tree.
Keep this section explicit so users do not assume FSDP is covered by current CI.
Coverage gaps / follow-ups
Missing or weak backend coverage
- FSDP backend CI
- multi-node e2e CI
- dedicated VLM / multimodal e2e CI
- NPU / AMD CI
Operational CI follow-ups
- Decide whether to remove the unused
run-ci-short / e2e-test-short job from the workflow template, or keep it as an explicitly documented legacy/manual fallback.
- Add an automatic periodic CI schedule if we want this to be independent of the long-lived trigger PR
#2053.
- Make the issue easier to keep in sync with
pr-test.yml.j2, ideally by generating part of this coverage map from the workflow matrix.
Invariant / assertion follow-ups
- Standardize the exact
--ci-test invariants across Megatron e2e tests.
- Document which tests are only smoke tests versus tests with explicit post-run assertions.
- Add more targeted assertions for rollout routing replay, fault tolerance recovery, partial rollout recycling, and PD weight sync instead of relying only on successful end-to-end completion where applicable.
As slime grows, we need more thorough CI coverage for an increasing number of backends, deployment patterns, and training features.
This issue serves as a living index of our existing CI tests so that both users and contributors can quickly see:
Over time this document should evolve into a compact CI coverage map of slime.
Last audited: 2026-06-11
CI entrypoints
Workflow source:
pr-test.yml.j2is the source of truth.pr-test.ymlis generated from the template and should not be edited manually.Triggers:
pull_requesttomain:opened,reopened,synchronize,labeledpushtomain: cheap always-on CPU / adapter checks onlyworkflow_dispatch: can run the matrix manually; supportsinfinite_runMost GPU e2e jobs are label-gated on self-hosted runners. The CPU unit tests and agent-adapter tests are always-on for PR / push / manual runs.
Periodic CI trigger PR:
#2053 [DON'T MERGE] run CIci-testfile and/or applying CI labels.Dynamic changed-test entrypoint:
run-ci-changedtests/test_*.pyandtests/plugin_contracts/test_*.pyNUM_GPUS = ...when present, otherwise defaults to 8 GPUsCommon Megatron e2e sanity checks
Most Megatron e2e tests pass
--ci-test. For the standard training path, this is intended to catch reference-policy / rollout-policy wiring regressions, including initial KL / log-prob sanity failures.Specialized tests may add extra assertions or intentionally disable part of the default checker. Examples:
num_rollout * rollout_batch_sizetimes.Always-on CPU / contract CI
These run on GitHub-hosted CPU runners for PR / push / manual runs.
cpu-unittestTrigger: always-on
Coverage:
Tests:
tests/test_megatron_argument_validation.pytests/test_dp_schedule.pytests/test_cp_utils.pytests/test_metric_report.pytests/test_metric_report_dist.pytests/test_loss_cp_invariance.pytests/test_value_temperature.pytests/test_rm_f1.pytests/test_rm_gpqa.pytests/test_rm_math.pytests/test_rm_math_dapo.pytests/test_rm_deepscaler.pytests/test_sample.pytests/test_agent_trajectory.pytests/test_rollout_validation.pytests/test_placement_group.pytests/test_external_sglang_engines.pytests/utils/test_hf_checkpoint_saver.pytests/plugin_contracts/test_plugin_rollout_contracts.pytests/plugin_contracts/test_plugin_runtime_hook_contracts.pytests/plugin_contracts/test_plugin_path_loading_contracts.pytests/plugin_contracts/test_plugin_generate_contracts.pyagent-adapter-testTrigger: always-on
Coverage:
Tests:
tests/test_agent_adapters.pytests/test_agent_sdk_adapters.pyLabel-gated GPU e2e CI
run-ci-sglang-config/e2e-test-sglang-configSGLang YAML config and offload suite.
tests/test_qwen2.5_0.5B_sglang_config.py--sglang-config, multiple regular groups, placeholder group, heterogeneous TPtests/test_qwen2.5_0.5B_sglang_config_distributed.py--sglang-config, heterogeneous TP, placeholder grouptests/test_sglang_config_mixed_offload.pytests/test_sglang_config_mixed_offload_ft.pyrun-ci-megatron/e2e-test-megatronMain Megatron backend e2e suite.
tests/test_full_disk_weight_update.pytests/test_quick_start_glm4_9B.pytests/test_glm4.7_30B_A3B_pd_mooncake.pytests/test_qwen3_30B_A3B.pytests/test_qwen3.6_35B_A3B_pd_mooncake.pytests/test_qwen3_30B_A3B_r3.pytests/test_qwen3_4B_ppo.pytests/test_qwen3_4B_ppo_disaggregate.pytests/test_qwen3_4B_ppo_train_critic_only.pytests/test_moonlight_16B_A3B.pytests/test_moonlight_16B_A3B_r3.pytests/test_mimo_7B_mtp_only_grad.pytests/test_qwen3_0.6B_parallel_check.pytests/test_qwen2.5_0.5B_debug_rollout_then_train.pytests/test_qwen2.5_0.5B_opd_sglang.pytests/test_qwen3_4B_external_pd.pytests/test_qwen2.5_0.5B_fully_async_short.pytrain_async.pytests/test_qwen3_4B_streaming_partial_rollout.pytests/test_qwen3.5_0.8B_gsm8k_short.pytests/test_qwen3.5_0.8B_gsm8k_async_short.pytests/test_qwen3_4B_ckpt.pytest_qwen3_4B_ckpt.pyis run five ways:--save-optimizer gpu --load-optimizer gpu--save-optimizer gpu --load-optimizer cpu--save-optimizer cpu --load-optimizer cpu--save-optimizer cpu --load-optimizer gpu--async-saverun-ci-precision/e2e-test-precisiontests/test_qwen3_0.6B_parallel_check.pyrun-ci-ckpt/e2e-test-ckptDedicated checkpoint matrix, currently the same five
test_qwen3_4B_ckpt.pymodes listed above.run-ci-image/e2e-test-imageRuns a curated GPU e2e subset inside
slimerl/slime-test:latestinstead of the defaultslimerl/slime:latestimage. This is useful for validating release/test image contents against representative Megatron, SGLang, checkpoint, PD, and OPD paths.Backend coverage summary
Megatron backend
Covered by GPU e2e and CPU tests.
Current coverage includes smoke or targeted e2e coverage for:
This list means the path is exercised by at least one CI test. It does not imply exhaustive coverage of every combination.
FSDP backend
No FSDP backend test is currently listed in the active
pr-test.yml.j2matrix, and the previous FSDP test entries are no longer present in the current test tree.Keep this section explicit so users do not assume FSDP is covered by current CI.
Coverage gaps / follow-ups
Missing or weak backend coverage
Operational CI follow-ups
run-ci-short/e2e-test-shortjob from the workflow template, or keep it as an explicitly documented legacy/manual fallback.#2053.pr-test.yml.j2, ideally by generating part of this coverage map from the workflow matrix.Invariant / assertion follow-ups
--ci-testinvariants across Megatron e2e tests.