[ci] CI coverage tracking

As slime grows, we need more thorough CI coverage for an increasing number of backends, deployment patterns, and training features.

This issue serves as a living index of our existing CI tests so that **both users and contributors** can quickly see:

- which backends are currently exercised by CI
- which parallelism / optimizer / rollout / deployment features are being tested
- which CI labels are expected to be used in practice
- where meaningful coverage gaps still exist

Over time this document should evolve into a compact **CI coverage map** of slime.

Last audited: **2026-06-11**

> Note: this issue tracks the CI suites we actually expect maintainers/contributors to use. Some legacy or backup jobs may still exist in the workflow template but are intentionally omitted here until they become part of the normal CI workflow.

---

## CI entrypoints

Workflow source:

- [`pr-test.yml.j2`](https://github.com/THUDM/slime/blob/main/.github/workflows/pr-test.yml.j2) is the source of truth.
- [`pr-test.yml`](https://github.com/THUDM/slime/blob/main/.github/workflows/pr-test.yml) is generated from the template and should not be edited manually.

Triggers:

- `pull_request` to `main`: `opened`, `reopened`, `synchronize`, `labeled`
- `push` to `main`: cheap always-on CPU / adapter checks only
- `workflow_dispatch`: can run the matrix manually; supports `infinite_run`

Most GPU e2e jobs are **label-gated** on self-hosted runners. The CPU unit tests and agent-adapter tests are always-on for PR / push / manual runs.

Periodic CI trigger PR:

- [`#2053 [DON'T MERGE] run CI`](https://github.com/THUDM/slime/pull/2053)
  - Long-lived PR used to periodically trigger CI by updating the `ci-test` file and/or applying CI labels.
  - Useful for scheduled/manual CI sweeps without attaching CI-only commits to feature PRs.

Dynamic changed-test entrypoint:

- `run-ci-changed`
  - detects changed `tests/test_*.py` and `tests/plugin_contracts/test_*.py`
  - uses each test file's `NUM_GPUS = ...` when present, otherwise defaults to 8 GPUs

---

## Common Megatron e2e sanity checks

Most Megatron e2e tests pass `--ci-test`. For the standard training path, this is intended to catch reference-policy / rollout-policy wiring regressions, including initial KL / log-prob sanity failures.

Specialized tests may add extra assertions or intentionally disable part of the default checker. Examples:

- MTP-only gradient test disables the KL checker and asserts that only MTP parameters receive non-zero gradients.
- Disk full / delta weight-update tests assert that expected checkpoint or safetensors files are actually written.
- Fan-out test asserts that the custom compact generate path was called exactly `num_rollout * rollout_batch_size` times.
- External PD test asserts that disk-backed delta files were produced for pre-launched SGLang workers.

---

## Always-on CPU / contract CI

These run on GitHub-hosted CPU runners for PR / push / manual runs.

### `cpu-unittest`

Trigger: always-on

Coverage:

- Megatron argument validation
- DP scheduling and CP utilities
- metric reporting
- loss CP invariance
- value temperature
- reward-model utilities: F1, GPQA, math, math DAPO, DeepScaler
- sampling and rollout validation
- placement group utilities
- external SGLang engine utilities
- HF checkpoint saver utility
- plugin contracts for rollout, runtime hooks, path loading, and generate APIs

Tests:

- [`tests/test_megatron_argument_validation.py`](https://github.com/THUDM/slime/blob/main/tests/test_megatron_argument_validation.py)
- [`tests/test_dp_schedule.py`](https://github.com/THUDM/slime/blob/main/tests/test_dp_schedule.py)
- [`tests/test_cp_utils.py`](https://github.com/THUDM/slime/blob/main/tests/test_cp_utils.py)
- [`tests/test_metric_report.py`](https://github.com/THUDM/slime/blob/main/tests/test_metric_report.py)
- [`tests/test_metric_report_dist.py`](https://github.com/THUDM/slime/blob/main/tests/test_metric_report_dist.py)
- [`tests/test_loss_cp_invariance.py`](https://github.com/THUDM/slime/blob/main/tests/test_loss_cp_invariance.py)
- [`tests/test_value_temperature.py`](https://github.com/THUDM/slime/blob/main/tests/test_value_temperature.py)
- [`tests/test_rm_f1.py`](https://github.com/THUDM/slime/blob/main/tests/test_rm_f1.py)
- [`tests/test_rm_gpqa.py`](https://github.com/THUDM/slime/blob/main/tests/test_rm_gpqa.py)
- [`tests/test_rm_math.py`](https://github.com/THUDM/slime/blob/main/tests/test_rm_math.py)
- [`tests/test_rm_math_dapo.py`](https://github.com/THUDM/slime/blob/main/tests/test_rm_math_dapo.py)
- [`tests/test_rm_deepscaler.py`](https://github.com/THUDM/slime/blob/main/tests/test_rm_deepscaler.py)
- [`tests/test_sample.py`](https://github.com/THUDM/slime/blob/main/tests/test_sample.py)
- [`tests/test_agent_trajectory.py`](https://github.com/THUDM/slime/blob/main/tests/test_agent_trajectory.py)
- [`tests/test_rollout_validation.py`](https://github.com/THUDM/slime/blob/main/tests/test_rollout_validation.py)
- [`tests/test_placement_group.py`](https://github.com/THUDM/slime/blob/main/tests/test_placement_group.py)
- [`tests/test_external_sglang_engines.py`](https://github.com/THUDM/slime/blob/main/tests/test_external_sglang_engines.py)
- [`tests/utils/test_hf_checkpoint_saver.py`](https://github.com/THUDM/slime/blob/main/tests/utils/test_hf_checkpoint_saver.py)
- [`tests/plugin_contracts/test_plugin_rollout_contracts.py`](https://github.com/THUDM/slime/blob/main/tests/plugin_contracts/test_plugin_rollout_contracts.py)
- [`tests/plugin_contracts/test_plugin_runtime_hook_contracts.py`](https://github.com/THUDM/slime/blob/main/tests/plugin_contracts/test_plugin_runtime_hook_contracts.py)
- [`tests/plugin_contracts/test_plugin_path_loading_contracts.py`](https://github.com/THUDM/slime/blob/main/tests/plugin_contracts/test_plugin_path_loading_contracts.py)
- [`tests/plugin_contracts/test_plugin_generate_contracts.py`](https://github.com/THUDM/slime/blob/main/tests/plugin_contracts/test_plugin_generate_contracts.py)

### `agent-adapter-test`

Trigger: always-on

Coverage:

- OpenAI / Anthropic / OpenAI Agents SDK adapter compatibility

Tests:

- [`tests/test_agent_adapters.py`](https://github.com/THUDM/slime/blob/main/tests/test_agent_adapters.py)
- [`tests/test_agent_sdk_adapters.py`](https://github.com/THUDM/slime/blob/main/tests/test_agent_sdk_adapters.py)

---

## Label-gated GPU e2e CI

### `run-ci-sglang-config` / `e2e-test-sglang-config`

SGLang YAML config and offload suite.

| Test | GPUs | Main coverage |
|---|---:|---|
| [`tests/test_qwen2.5_0.5B_sglang_config.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen2.5_0.5B_sglang_config.py) | 8 | colocated `--sglang-config`, multiple regular groups, placeholder group, heterogeneous TP |
| [`tests/test_qwen2.5_0.5B_sglang_config_distributed.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen2.5_0.5B_sglang_config_distributed.py) | 8 | non-colocated train/rollout split with `--sglang-config`, heterogeneous TP, placeholder group |
| [`tests/test_sglang_config_mixed_offload.py`](https://github.com/THUDM/slime/blob/main/tests/test_sglang_config_mixed_offload.py) | 8 | multiple SGLang models, updatable actor + frozen ref, offload/onload, frozen-model restore from disk |
| [`tests/test_sglang_config_mixed_offload_ft.py`](https://github.com/THUDM/slime/blob/main/tests/test_sglang_config_mixed_offload_ft.py) | 8 | mixed offload plus fault tolerance / engine recovery |

### `run-ci-megatron` / `e2e-test-megatron`

Main Megatron backend e2e suite.

| Test | GPUs | Main coverage |
|---|---:|---|
| [`tests/test_full_disk_weight_update.py`](https://github.com/THUDM/slime/blob/main/tests/test_full_disk_weight_update.py) | 4 | full checkpoint weight update through disk; asserts full HF checkpoint files are written |
| [`tests/test_quick_start_glm4_9B.py`](https://github.com/THUDM/slime/blob/main/tests/test_quick_start_glm4_9B.py) | 8 | GLM-Z1-9B, GRPO, TP2 + CP2, disaggregated actor/rollout, TIS, per-token loss |
| [`tests/test_glm4.7_30B_A3B_pd_mooncake.py`](https://github.com/THUDM/slime/blob/main/tests/test_glm4.7_30B_A3B_pd_mooncake.py) | 8 | GLM-4.7 Flash MoE, TP2 + PP2 + CP2 + EP4, colocated single-node PD, Mooncake, EAGLE, DP attention / DP LM head |
| [`tests/test_qwen3_30B_A3B.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_30B_A3B.py) | 8 | Qwen3-30B-A3B MoE, GSPO, TP4 + CP2 + EP8, CPU optimizer offload, precision-aware optimizer, DeepEP, FP8 rollout, routing replay, TIS |
| [`tests/test_qwen3.6_35B_A3B_pd_mooncake.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3.6_35B_A3B_pd_mooncake.py) | 8 | Qwen3.6-35B-A3B, PD + Mooncake, DeepEP, EAGLE, DP attention / DP LM head, debug rollout data save |
| [`tests/test_qwen3_30B_A3B_r3.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_30B_A3B_r3.py) | 8 | Qwen3-30B-A3B, GSPO, rollout routing replay, TIS, DeepEP, FP8 rollout |
| [`tests/test_qwen3_4B_ppo.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_ppo.py) | 8 | PPO, actor+critic config, critic-only warmup, colocated rollout, TP2 + CP2 |
| [`tests/test_qwen3_4B_ppo_disaggregate.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_ppo_disaggregate.py) | 8 | PPO, actor+critic config, critic-only warmup, disaggregated actor/rollout, TP2 + CP2 |
| [`tests/test_qwen3_4B_ppo_train_critic_only.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_ppo_train_critic_only.py) | 8 | PPO with longer critic-only phase |
| [`tests/test_moonlight_16B_A3B.py`](https://github.com/THUDM/slime/blob/main/tests/test_moonlight_16B_A3B.py) | 8 | Moonlight MLA MoE, GSPO, TP2 + CP2 + EP8, colocated rollout |
| [`tests/test_moonlight_16B_A3B_r3.py`](https://github.com/THUDM/slime/blob/main/tests/test_moonlight_16B_A3B_r3.py) | 8 | Moonlight MLA MoE, GSPO, rollout routing replay |
| [`tests/test_mimo_7B_mtp_only_grad.py`](https://github.com/THUDM/slime/blob/main/tests/test_mimo_7B_mtp_only_grad.py) | 8 | MTP training, EAGLE rollout, MTP-only gradient isolation assertion |
| [`tests/test_qwen3_0.6B_parallel_check.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_0.6B_parallel_check.py) | 8 | deterministic/parallel consistency across TP/PP/CP sizes and GPU counts; grad-norm save/load checks |
| [`tests/test_qwen2.5_0.5B_debug_rollout_then_train.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen2.5_0.5B_debug_rollout_then_train.py) | 8 | two-phase debug rollout-only then train-only from saved rollout data |
| [`tests/test_qwen2.5_0.5B_opd_sglang.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen2.5_0.5B_opd_sglang.py) | 8 | on-policy distillation with an external SGLang teacher server |
| [`tests/test_qwen3_4B_external_pd.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_external_pd.py) | 6 | external pre-launched SGLang PD fleet, Mooncake, external engine discovery, disk-backed delta weight sync |
| [`tests/test_qwen2.5_0.5B_fully_async_short.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen2.5_0.5B_fully_async_short.py) | 4 | fully-async rollout path via `train_async.py` |
| [`tests/test_qwen3_4B_streaming_partial_rollout.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_streaming_partial_rollout.py) | 8 | streaming SGLang rollout, partial rollout, abort/recycle path, off-policy masking in partial rollout |
| [`tests/test_qwen3.5_0.8B_gsm8k_short.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3.5_0.8B_gsm8k_short.py) | 4 | short colocated GRPO smoke, dynamic sampling, fault tolerance |
| [`tests/test_qwen3.5_0.8B_gsm8k_async_short.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3.5_0.8B_gsm8k_async_short.py) | 4 | short async GRPO smoke, separate rollout GPUs, fault tolerance |
| [`tests/test_qwen3_4B_ckpt.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_4B_ckpt.py) | 8 | checkpoint save/load roundtrip; optimizer save/load combinations; async save |

`test_qwen3_4B_ckpt.py` is run five ways:

- `--save-optimizer gpu --load-optimizer gpu`
- `--save-optimizer gpu --load-optimizer cpu`
- `--save-optimizer cpu --load-optimizer cpu`
- `--save-optimizer cpu --load-optimizer gpu`
- `--async-save`

### `run-ci-precision` / `e2e-test-precision`

| Test | GPUs | Main coverage |
|---|---:|---|
| [`tests/test_qwen3_0.6B_parallel_check.py`](https://github.com/THUDM/slime/blob/main/tests/test_qwen3_0.6B_parallel_check.py) | 8 | dedicated precision / parallel-consistency sweep across TP/PP/CP configurations |

### `run-ci-ckpt` / `e2e-test-ckpt`

Dedicated checkpoint matrix, currently the same five `test_qwen3_4B_ckpt.py` modes listed above.

### `run-ci-image` / `e2e-test-image`

Runs a curated GPU e2e subset inside `slimerl/slime-test:latest` instead of the default `slimerl/slime:latest` image. This is useful for validating release/test image contents against representative Megatron, SGLang, checkpoint, PD, and OPD paths.

---

## Backend coverage summary

### Megatron backend

Covered by GPU e2e and CPU tests.

Current coverage includes smoke or targeted e2e coverage for:

- Dense models: Qwen2.5-0.5B, Qwen3-0.6B, Qwen3.5-0.8B, Qwen3-4B, GLM-Z1-9B, MiMo-7B
- MoE models: Qwen3-30B-A3B, Qwen3.6-35B-A3B, GLM-4.7 Flash, Moonlight-16B-A3B
- MLA / MoE / EP / DeepEP paths
- TP / PP / CP / EP combinations
- colocated and disaggregated rollout
- single-node PD with Mooncake
- external pre-launched PD fleet
- GRPO, GSPO, PPO
- TIS, routing replay, rollout routing replay
- partial rollout and streaming rollout
- fully-async rollout
- OPD with SGLang teacher
- checkpoint save/load, async save, CPU/GPU optimizer placement roundtrips
- disk-backed full and delta weight update
- MTP training gradient-isolation check

This list means the path is exercised by at least one CI test. It does **not** imply exhaustive coverage of every combination.

### FSDP backend

No FSDP backend test is currently listed in the active `pr-test.yml.j2` matrix, and the previous FSDP test entries are no longer present in the current test tree.

Keep this section explicit so users do not assume FSDP is covered by current CI.

---

## Coverage gaps / follow-ups

### Missing or weak backend coverage

- FSDP backend CI
- multi-node e2e CI
- dedicated VLM / multimodal e2e CI
- NPU / AMD CI

### Operational CI follow-ups

- Decide whether to remove the unused `run-ci-short` / `e2e-test-short` job from the workflow template, or keep it as an explicitly documented legacy/manual fallback.
- Add an automatic periodic CI schedule if we want this to be independent of the long-lived trigger PR [`#2053`](https://github.com/THUDM/slime/pull/2053).
- Make the issue easier to keep in sync with `pr-test.yml.j2`, ideally by generating part of this coverage map from the workflow matrix.

### Invariant / assertion follow-ups

- Standardize the exact `--ci-test` invariants across Megatron e2e tests.
- Document which tests are only smoke tests versus tests with explicit post-run assertions.
- Add more targeted assertions for rollout routing replay, fault tolerance recovery, partial rollout recycling, and PD weight sync instead of relying only on successful end-to-end completion where applicable.

Test	GPUs	Main coverage
`tests/test_qwen2.5_0.5B_sglang_config.py`	8	colocated `--sglang-config`, multiple regular groups, placeholder group, heterogeneous TP
`tests/test_qwen2.5_0.5B_sglang_config_distributed.py`	8	non-colocated train/rollout split with `--sglang-config`, heterogeneous TP, placeholder group
`tests/test_sglang_config_mixed_offload.py`	8	multiple SGLang models, updatable actor + frozen ref, offload/onload, frozen-model restore from disk
`tests/test_sglang_config_mixed_offload_ft.py`	8	mixed offload plus fault tolerance / engine recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] CI coverage tracking #777

CI entrypoints

Common Megatron e2e sanity checks

Always-on CPU / contract CI

`cpu-unittest`

`agent-adapter-test`

Label-gated GPU e2e CI

`run-ci-sglang-config` / `e2e-test-sglang-config`

`run-ci-megatron` / `e2e-test-megatron`

`run-ci-precision` / `e2e-test-precision`

`run-ci-ckpt` / `e2e-test-ckpt`

`run-ci-image` / `e2e-test-image`

Backend coverage summary

Megatron backend

FSDP backend

Coverage gaps / follow-ups

Missing or weak backend coverage

Operational CI follow-ups

Invariant / assertion follow-ups

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	GPUs	Main coverage
`tests/test_full_disk_weight_update.py`	4	full checkpoint weight update through disk; asserts full HF checkpoint files are written
`tests/test_quick_start_glm4_9B.py`	8	GLM-Z1-9B, GRPO, TP2 + CP2, disaggregated actor/rollout, TIS, per-token loss
`tests/test_glm4.7_30B_A3B_pd_mooncake.py`	8	GLM-4.7 Flash MoE, TP2 + PP2 + CP2 + EP4, colocated single-node PD, Mooncake, EAGLE, DP attention / DP LM head
`tests/test_qwen3_30B_A3B.py`	8	Qwen3-30B-A3B MoE, GSPO, TP4 + CP2 + EP8, CPU optimizer offload, precision-aware optimizer, DeepEP, FP8 rollout, routing replay, TIS
`tests/test_qwen3.6_35B_A3B_pd_mooncake.py`	8	Qwen3.6-35B-A3B, PD + Mooncake, DeepEP, EAGLE, DP attention / DP LM head, debug rollout data save
`tests/test_qwen3_30B_A3B_r3.py`	8	Qwen3-30B-A3B, GSPO, rollout routing replay, TIS, DeepEP, FP8 rollout
`tests/test_qwen3_4B_ppo.py`	8	PPO, actor+critic config, critic-only warmup, colocated rollout, TP2 + CP2
`tests/test_qwen3_4B_ppo_disaggregate.py`	8	PPO, actor+critic config, critic-only warmup, disaggregated actor/rollout, TP2 + CP2
`tests/test_qwen3_4B_ppo_train_critic_only.py`	8	PPO with longer critic-only phase
`tests/test_moonlight_16B_A3B.py`	8	Moonlight MLA MoE, GSPO, TP2 + CP2 + EP8, colocated rollout
`tests/test_moonlight_16B_A3B_r3.py`	8	Moonlight MLA MoE, GSPO, rollout routing replay
`tests/test_mimo_7B_mtp_only_grad.py`	8	MTP training, EAGLE rollout, MTP-only gradient isolation assertion
`tests/test_qwen3_0.6B_parallel_check.py`	8	deterministic/parallel consistency across TP/PP/CP sizes and GPU counts; grad-norm save/load checks
`tests/test_qwen2.5_0.5B_debug_rollout_then_train.py`	8	two-phase debug rollout-only then train-only from saved rollout data
`tests/test_qwen2.5_0.5B_opd_sglang.py`	8	on-policy distillation with an external SGLang teacher server
`tests/test_qwen3_4B_external_pd.py`	6	external pre-launched SGLang PD fleet, Mooncake, external engine discovery, disk-backed delta weight sync
`tests/test_qwen2.5_0.5B_fully_async_short.py`	4	fully-async rollout path via `train_async.py`
`tests/test_qwen3_4B_streaming_partial_rollout.py`	8	streaming SGLang rollout, partial rollout, abort/recycle path, off-policy masking in partial rollout
`tests/test_qwen3.5_0.8B_gsm8k_short.py`	4	short colocated GRPO smoke, dynamic sampling, fault tolerance
`tests/test_qwen3.5_0.8B_gsm8k_async_short.py`	4	short async GRPO smoke, separate rollout GPUs, fault tolerance
`tests/test_qwen3_4B_ckpt.py`	8	checkpoint save/load roundtrip; optimizer save/load combinations; async save

[ci] CI coverage tracking #777

Description

CI entrypoints

Common Megatron e2e sanity checks

Always-on CPU / contract CI

cpu-unittest

agent-adapter-test

Label-gated GPU e2e CI

run-ci-sglang-config / e2e-test-sglang-config

run-ci-megatron / e2e-test-megatron

run-ci-precision / e2e-test-precision

run-ci-ckpt / e2e-test-ckpt

run-ci-image / e2e-test-image

Backend coverage summary

Megatron backend

FSDP backend

Coverage gaps / follow-ups

Missing or weak backend coverage

Operational CI follow-ups

Invariant / assertion follow-ups

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`cpu-unittest`

`agent-adapter-test`

`run-ci-sglang-config` / `e2e-test-sglang-config`

`run-ci-megatron` / `e2e-test-megatron`

`run-ci-precision` / `e2e-test-precision`

`run-ci-ckpt` / `e2e-test-ckpt`

`run-ci-image` / `e2e-test-image`