Skip to content

[ci] CI coverage tracking #777

@zhuzilin

Description

@zhuzilin

As slime grows, we need more thorough CI coverage for an increasing number of backends, deployment patterns, and training features.

This issue serves as a living index of our existing CI tests so that both users and contributors can quickly see:

  • which backends are currently exercised by CI
  • which parallelism / optimizer / rollout / deployment features are being tested
  • which CI labels are expected to be used in practice
  • where meaningful coverage gaps still exist

Over time this document should evolve into a compact CI coverage map of slime.

Last audited: 2026-06-11

Note: this issue tracks the CI suites we actually expect maintainers/contributors to use. Some legacy or backup jobs may still exist in the workflow template but are intentionally omitted here until they become part of the normal CI workflow.


CI entrypoints

Workflow source:

Triggers:

  • pull_request to main: opened, reopened, synchronize, labeled
  • push to main: cheap always-on CPU / adapter checks only
  • workflow_dispatch: can run the matrix manually; supports infinite_run

Most GPU e2e jobs are label-gated on self-hosted runners. The CPU unit tests and agent-adapter tests are always-on for PR / push / manual runs.

Periodic CI trigger PR:

  • #2053 [DON'T MERGE] run CI
    • Long-lived PR used to periodically trigger CI by updating the ci-test file and/or applying CI labels.
    • Useful for scheduled/manual CI sweeps without attaching CI-only commits to feature PRs.

Dynamic changed-test entrypoint:

  • run-ci-changed
    • detects changed tests/test_*.py and tests/plugin_contracts/test_*.py
    • uses each test file's NUM_GPUS = ... when present, otherwise defaults to 8 GPUs

Common Megatron e2e sanity checks

Most Megatron e2e tests pass --ci-test. For the standard training path, this is intended to catch reference-policy / rollout-policy wiring regressions, including initial KL / log-prob sanity failures.

Specialized tests may add extra assertions or intentionally disable part of the default checker. Examples:

  • MTP-only gradient test disables the KL checker and asserts that only MTP parameters receive non-zero gradients.
  • Disk full / delta weight-update tests assert that expected checkpoint or safetensors files are actually written.
  • Fan-out test asserts that the custom compact generate path was called exactly num_rollout * rollout_batch_size times.
  • External PD test asserts that disk-backed delta files were produced for pre-launched SGLang workers.

Always-on CPU / contract CI

These run on GitHub-hosted CPU runners for PR / push / manual runs.

cpu-unittest

Trigger: always-on

Coverage:

  • Megatron argument validation
  • DP scheduling and CP utilities
  • metric reporting
  • loss CP invariance
  • value temperature
  • reward-model utilities: F1, GPQA, math, math DAPO, DeepScaler
  • sampling and rollout validation
  • placement group utilities
  • external SGLang engine utilities
  • HF checkpoint saver utility
  • plugin contracts for rollout, runtime hooks, path loading, and generate APIs

Tests:

agent-adapter-test

Trigger: always-on

Coverage:

  • OpenAI / Anthropic / OpenAI Agents SDK adapter compatibility

Tests:


Label-gated GPU e2e CI

run-ci-sglang-config / e2e-test-sglang-config

SGLang YAML config and offload suite.

Test GPUs Main coverage
tests/test_qwen2.5_0.5B_sglang_config.py 8 colocated --sglang-config, multiple regular groups, placeholder group, heterogeneous TP
tests/test_qwen2.5_0.5B_sglang_config_distributed.py 8 non-colocated train/rollout split with --sglang-config, heterogeneous TP, placeholder group
tests/test_sglang_config_mixed_offload.py 8 multiple SGLang models, updatable actor + frozen ref, offload/onload, frozen-model restore from disk
tests/test_sglang_config_mixed_offload_ft.py 8 mixed offload plus fault tolerance / engine recovery

run-ci-megatron / e2e-test-megatron

Main Megatron backend e2e suite.

Test GPUs Main coverage
tests/test_full_disk_weight_update.py 4 full checkpoint weight update through disk; asserts full HF checkpoint files are written
tests/test_quick_start_glm4_9B.py 8 GLM-Z1-9B, GRPO, TP2 + CP2, disaggregated actor/rollout, TIS, per-token loss
tests/test_glm4.7_30B_A3B_pd_mooncake.py 8 GLM-4.7 Flash MoE, TP2 + PP2 + CP2 + EP4, colocated single-node PD, Mooncake, EAGLE, DP attention / DP LM head
tests/test_qwen3_30B_A3B.py 8 Qwen3-30B-A3B MoE, GSPO, TP4 + CP2 + EP8, CPU optimizer offload, precision-aware optimizer, DeepEP, FP8 rollout, routing replay, TIS
tests/test_qwen3.6_35B_A3B_pd_mooncake.py 8 Qwen3.6-35B-A3B, PD + Mooncake, DeepEP, EAGLE, DP attention / DP LM head, debug rollout data save
tests/test_qwen3_30B_A3B_r3.py 8 Qwen3-30B-A3B, GSPO, rollout routing replay, TIS, DeepEP, FP8 rollout
tests/test_qwen3_4B_ppo.py 8 PPO, actor+critic config, critic-only warmup, colocated rollout, TP2 + CP2
tests/test_qwen3_4B_ppo_disaggregate.py 8 PPO, actor+critic config, critic-only warmup, disaggregated actor/rollout, TP2 + CP2
tests/test_qwen3_4B_ppo_train_critic_only.py 8 PPO with longer critic-only phase
tests/test_moonlight_16B_A3B.py 8 Moonlight MLA MoE, GSPO, TP2 + CP2 + EP8, colocated rollout
tests/test_moonlight_16B_A3B_r3.py 8 Moonlight MLA MoE, GSPO, rollout routing replay
tests/test_mimo_7B_mtp_only_grad.py 8 MTP training, EAGLE rollout, MTP-only gradient isolation assertion
tests/test_qwen3_0.6B_parallel_check.py 8 deterministic/parallel consistency across TP/PP/CP sizes and GPU counts; grad-norm save/load checks
tests/test_qwen2.5_0.5B_debug_rollout_then_train.py 8 two-phase debug rollout-only then train-only from saved rollout data
tests/test_qwen2.5_0.5B_opd_sglang.py 8 on-policy distillation with an external SGLang teacher server
tests/test_qwen3_4B_external_pd.py 6 external pre-launched SGLang PD fleet, Mooncake, external engine discovery, disk-backed delta weight sync
tests/test_qwen2.5_0.5B_fully_async_short.py 4 fully-async rollout path via train_async.py
tests/test_qwen3_4B_streaming_partial_rollout.py 8 streaming SGLang rollout, partial rollout, abort/recycle path, off-policy masking in partial rollout
tests/test_qwen3.5_0.8B_gsm8k_short.py 4 short colocated GRPO smoke, dynamic sampling, fault tolerance
tests/test_qwen3.5_0.8B_gsm8k_async_short.py 4 short async GRPO smoke, separate rollout GPUs, fault tolerance
tests/test_qwen3_4B_ckpt.py 8 checkpoint save/load roundtrip; optimizer save/load combinations; async save

test_qwen3_4B_ckpt.py is run five ways:

  • --save-optimizer gpu --load-optimizer gpu
  • --save-optimizer gpu --load-optimizer cpu
  • --save-optimizer cpu --load-optimizer cpu
  • --save-optimizer cpu --load-optimizer gpu
  • --async-save

run-ci-precision / e2e-test-precision

Test GPUs Main coverage
tests/test_qwen3_0.6B_parallel_check.py 8 dedicated precision / parallel-consistency sweep across TP/PP/CP configurations

run-ci-ckpt / e2e-test-ckpt

Dedicated checkpoint matrix, currently the same five test_qwen3_4B_ckpt.py modes listed above.

run-ci-image / e2e-test-image

Runs a curated GPU e2e subset inside slimerl/slime-test:latest instead of the default slimerl/slime:latest image. This is useful for validating release/test image contents against representative Megatron, SGLang, checkpoint, PD, and OPD paths.


Backend coverage summary

Megatron backend

Covered by GPU e2e and CPU tests.

Current coverage includes smoke or targeted e2e coverage for:

  • Dense models: Qwen2.5-0.5B, Qwen3-0.6B, Qwen3.5-0.8B, Qwen3-4B, GLM-Z1-9B, MiMo-7B
  • MoE models: Qwen3-30B-A3B, Qwen3.6-35B-A3B, GLM-4.7 Flash, Moonlight-16B-A3B
  • MLA / MoE / EP / DeepEP paths
  • TP / PP / CP / EP combinations
  • colocated and disaggregated rollout
  • single-node PD with Mooncake
  • external pre-launched PD fleet
  • GRPO, GSPO, PPO
  • TIS, routing replay, rollout routing replay
  • partial rollout and streaming rollout
  • fully-async rollout
  • OPD with SGLang teacher
  • checkpoint save/load, async save, CPU/GPU optimizer placement roundtrips
  • disk-backed full and delta weight update
  • MTP training gradient-isolation check

This list means the path is exercised by at least one CI test. It does not imply exhaustive coverage of every combination.

FSDP backend

No FSDP backend test is currently listed in the active pr-test.yml.j2 matrix, and the previous FSDP test entries are no longer present in the current test tree.

Keep this section explicit so users do not assume FSDP is covered by current CI.


Coverage gaps / follow-ups

Missing or weak backend coverage

  • FSDP backend CI
  • multi-node e2e CI
  • dedicated VLM / multimodal e2e CI
  • NPU / AMD CI

Operational CI follow-ups

  • Decide whether to remove the unused run-ci-short / e2e-test-short job from the workflow template, or keep it as an explicitly documented legacy/manual fallback.
  • Add an automatic periodic CI schedule if we want this to be independent of the long-lived trigger PR #2053.
  • Make the issue easier to keep in sync with pr-test.yml.j2, ideally by generating part of this coverage map from the workflow matrix.

Invariant / assertion follow-ups

  • Standardize the exact --ci-test invariants across Megatron e2e tests.
  • Document which tests are only smoke tests versus tests with explicit post-run assertions.
  • Add more targeted assertions for rollout routing replay, fault tolerance recovery, partial rollout recycling, and PD weight sync instead of relying only on successful end-to-end completion where applicable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions